I’m a skeptic, satiated by large raw data sets, analysis & inference

Speak to anyone who knows me, and they will likely characterize me as a skeptical, pessimistic, anxious, intense, and persistent individual.

If someone sends me a spreadsheet and then calls me to walk me through the numbers my immediate assumption is that the purpose of the follow-up call is to shape my perception. If someone provides me a composite of the figures without the raw data, visible formulas and documented data sources, I also assume manipulation. With this said I am a realist, and I am willing to accept manipulation, but I am honest about acceptance rather than convincing myself otherwise. I am just wired to be vigilant.

For me the glass being half-full represents a lack of fear of it being half-empty, I am motivated to refill the glass by the reality that it is half-empty and what is likely an unhealthy fear of dying from dehydration, but it works for me. From my perspective, the half-empty glass is not depressing or a demotivator it is a potential reality. Now don’t get me wrong, I know there is water in the glass and death is not imminent, but I am incredibly aware and grateful for the opportunity to find a water source to refill my glass.

I spend my days listening to dozens of pitches, where I need to focus, why I need to do x or y, what I am missing out on by not doing x or y, etc… The pitches almost always start with a half-full perspective, selling the positive but it’s amazing how when it doesn’t go the way the pitchman expects the approach shifts to the half-empty perspective, relying on FOMO (fear of missing out) as a last ditch attempt at motivation.

Now let’s face it, no one likes to miss out, but as a realist, I recognize that I can’t do everything, so decisions are required. Forks in the road appear every minute of every hour of every day, and I am greeted at each fork by a host espousing the merits of their righteous path. For someone like me, these decisions need to be my own, driven by raw data (as raw as it can be), analysis and inference. I try to check the near-term outcomes at the door and focus on and visualizing the long-term strategic outcomes, the vision. In my mind tactical activities require little to no thought, they just happen. For example, a visionary looking for a more sustainable model for garbage disposal doesn’t stop taking their garbage to the curb every Monday and Thursday. Accepting what is and executing without much thought IMO avoids paralyzation and makes room in your life and brain for what will be.

So now we arrive at the origin of this blog. I have to make personal and professional bets on where the market is going, what is most relevant and where I should focus my time. Of course, I have a subjective opinion on where I believe the market is going but I like to validate my opinions with some data and with so many people, organizations and news outlets selling their version of the future the question becomes, how do I validate my opinions objectively. Social chatter is meaningful to me as is sentiment analysis. The great news is with a little Python, the use of some APIs and the ELK stack it’s pretty easy to collect data from social media platforms, store it, analyze it and draw some conclusions. One such question that is very relevant to me is what technologies and what OEMs (original equipment manufacturers) have mindshare? I’ve been pulling social media data for a few weeks using #hashtags to see what techs and OEMs have the most buzz; I have also been doing sentiment analysis to see if the buzz is good or bad.

Here is my view of the market using social media buzz to determine mindshare (it actually feels pretty on the money):

Continental seat 21D stream of consciousness…

Sitting on a Continental flight in seat 21D on a flight from Austin, TX to Newark, NJ and I thought I would write a post (maybe this would be better classified as a rant or a stream of consciousness) on change and vision. I realize that this is more of a culmination of random thoughts than anything else but what the heck I’m bored, well not rally bored but need a break from work. Over the past few weeks I have been working relentlessly on powerpoint presentations and writing a lot while I am flying so I thought I would use the last hour or so of this flight to dump a bunch of random thoughts onto paper. Clear some room in my brain for new ideas 🙂

As human beings it is very difficult to step outside our skin and change our perspective. This inability to readily change our perspective in my opinion is the key contributor to our resistance to change and our inability to create a generally acceptable vision. The thought that we may be removed from a comfort zone where we are no longer the authority and forced into the unknown can be frightening. A few years ago (more like 10 years ago) a good friend of mine introduced me to a concept called “Remote Viewing”. “Remote Viewing” is a technique employed and taught by the CIA as a way to visualize things through the eyes of another person (from the CIA’s perspective this would be the enemy). I was blown away and for years I believed that I could learn “Remote Viewing” and it would hold the key to my success in life, the ability to fully commit to someone else’s perspective, to the degree that I would see the situation exactly as they would. I could then take this knowledge and apply it to the mission of morphing their perspective. The ability to do this would provide me the unfair advantage I was looking for. Well needless to say I don’t think anyone has to worry I have not mastered “Remote Viewing” but then again if I had would I tell you 🙂

Interestingly enough as an outgrowth to remote viewing I became interested in something called Neuro-linguistic programming. NLP is a little bit more grounded than Remote Viewing. I researched NLP and purchased and read 4 books on the topic. Once you read the concept you will understand why it was so intriguing to me and for those of you who know me it should be even more evident why I would be so captivated by the concept. My wife can attest to the fact that I am definitely not the life of the party with my philosophical opinions on topics such as “Remote Viewing”, NLP and other bizarre topic like the Bible Code. As a matter of fact years ago she put moratorium on such topics in public settings, thus I am forced to write about them on my blog 🙂

The term “Neuro-linguistic programming” is pretty self explanatory. It is is a set of techniques, axioms and beliefs most often leveraged for personal development. It is is rooted in the study of body language and language patterns. The techniques are predicated upon the principle that all behaviors (whether excellent or dysfunctional) are not random, but have a practically determinable structure. Wow, what an amazing field, the ability to interpret subjective reality. My thought process when I began educating myself on NLP was that I would learn how to morph subjective reality into objective reality.

So I purchased several books on NLP and began to battle through the very dry material in an attempt to learn the techniques and leverage them as a way to get inside my audiences head, understand their subjective reality and create an objective reality aligned with my subjective reality. The ability to lead is firmly rooted in this concept, the creation of a reality that moves people.

What the hell am I talking about, oh yeah…

Why am I writing this? Change is inevitable and our ability to cope and stay grounded is based in our ability to create a subjective reality. The ability to understand and alter the subjective reality of others in my opinion ultimately holds the key to our success. Now the tough part of determining how to do this. My adaptation is a mix that works for me but may not work for others:

  1. Defined End-State – Having a clear, defensible definition of the end-state is critical.
  2. Vision – Have a vision. The ability to clearly and concisely convey your vision is key. It’s not over once you achieve the end-state. A vision should not be finite. The end-state should be the horizon, once we reach the horizon a new horizon is created.
  3. Faith – The only thing that matters is the end-state, while milestones and the measurement of progress are applicable they should not define the map for the journey. Execution of a vision requires a fluid adaptable philosophy executed by individuals with insatiable desire, not individuals with a need to hit milestones. This in my opinion is by far the most difficult aspect of executing against a successful vision. This is not an exact science but rather an art. Often times the painting is meaningless until the last brush stoke is applied to the canvas.
  4. Commitment – Stay committed and vigilant. The horizon is distant and will often seem like a mirage, the ability to stay focused on the vision and committed to the destination is paramount.
  5. Flexibility – Be flexible, the ability to influence is predicated on the ability to morph and adapt to detours in the road to the end-state.

Finally for those of you who were boy shouts “Be Prepared”…

Ahhhhh…. I feel refreshed!

Thinking out loud…

Once again I read a blog post this morning on Mark Lewis’s blog that I felt compelled to comment on.  Unfortunately EMC has opted to disable the comment function of Mark’s TypePad blog???  What’s the deal with this?  Hopefully someone will realize that comments should be turned on sooner rather than later.

Nonetheless it is nice to see that Mark commenting on the applicability of expert knowledge to great technology.  What a great quote ?Technology is no cure for stupidity.?  I believe that any great solution is built on great technology, expert knowledge (intellectual property) and well honed process.  The ability to apply technology to holistic business strategy is a difficult thing for many organizations to visualize, all to often tactical infrastructure requirements bubble to the top and take president, forcing many organization to abandon strategic vision.  I propose that with the application of expert knowledge tactical problems can be solved and aligned with a strategic vision.  Tactical behavior with a disregard for strategic alignment will continue to make it very difficult for organizations to realize the the maximum potential of many of the technology solutions that they are deploying and implementing.  Ultimately this is a self-fulfilling prophecy.  Let’s face it – there is ton of parity in the marketplace, the ability to apply expert knowledge to a business problem is now where the solution value lies.  Is there still business value in the bowels of brick and mortar technology or should we be focused on solving business problems and identifying solution providers who provide our organization with the highest probability for success.  The analogy I like to use is if you were to build a brick structure, who provides the most value, the mason, the quarry, the kiln process, etc…?  Most of us discount the value of the quarry and kiln process as a brick is a brick, the expert knowledge of the mason is more often subject to heavy scrutiny because the success of the project relies on the mason.  The mason holds the knowledge and the consumer is entrusting the expert to source the right brick for the job, apply their expert knowledge and compete the project to specifications,  on time and on budget.

As side note, this brings up another thought – depth vs. breadth, a debate that I often am engaged in.  Is a highly skilled brick mason qualified to do tile work?  As an educated consumer I would not hire a brick mason to lay my tile floor, although these are adjacent skills the discrete skills required to deliver effectively and efficiently most probably do not exist.

Would love your thoughts on this rant….


Humbleness is far more productive than expertness!

Prompted by a recent “Nick Burns” (Saturday Night Live character) event I needed to vent.

I never cease to be amazed by the vast number of people who put themselves in situations assuming they are the smartest person in a conversation, inevitably this always makes them the dumbest person in the conversation. I am guilty of this on occasion but I try to learn from my mistakes as often as possible. As a general rule I always assume that most people are smarter than I am. I have always taken this approach, because I have never met someone that given the opportunity could not teach me something. The passing of knowledge from one person to another tends to be a more fruitful and enjoyable when coupled with a little humility. I have always approached things in this manner is because it gives me a reason to puck myself as hard as humanly possible. In my opinion the key to putting yourself in a position of authority rests solely on an individuals desire and ability to study and comprehend the existing information on a particular topic coupled with the intellect to derive new conclusions. I truly believe that at the rate things change on this great planet, at any given time I know <1% of what I want or need to know. This philosophy has provided me with an insatiable thirst to know more and hopefully the ability learn enough to hold an intelligent conversation, but I would never consider myself an expert on anything; there is just to much to learn!

I feel much better now!

Data classification and the need for ontology…

How many organizations are struggling with data classification and building an ILM (Information Lifecycle Management) strategy? In the storage industry today we often talk about the differences between ILM and HSM (Hierarchical Storage Management) but does taxonomy actually provide enough to realize true ILM? Rather than taxonomy we should be addressing ontology. Ontology as defined in the realm of computer science is a data model that represents a domain and is used to reason about the objects in that domain and the relations between them. Known categorization systems used today were designed to optimize linear seek time not to optimize or categorize the intellectual aspects of information. Classification and categorization techniques used today while presented as organizing information are actually categorizing the physical objects that contain ideas or information. The industry is attempting to leverage traditional categorization methods by using tags which create meta data to try to depict the ideas and information inside the containers, more intelligent categorization methods today are applying lexicons to attempt to automate the generation of meta data. Again, I find it odd that we as an industry obsessed with data classification and life cycle management do not address ontology and our approach to ontology on a daily basis.

Ontology would need to consider owners, users, participant, openness of the domain and the potential for the control set to be altered and signal loss. The storage industry has avoided true ontology because the undertaking is massive. Until an ontological method for classification and categorization is developed can we ever really achieve true ILM?

The need for a thesaurus of terms, words or tags is an absolute requirement to enable true ILM. A canonical example of this would be imagine someone searches the web (largest know corpus of data) for “Movie” and another user searches a repository for “Cinema” would the return be the same? Most likely if the search is of a full text index the answer would be no, the reliance on tags to categorize a document using multiple words of terms makes it difficult to enforce and deliver true plug-and-play categorization and ILM. Now we also have to consider the signal loss, imagine a search of the web for “… Politics” and “… Agenda” while they might appear to be synonymous they may or may not be.

This is a complex problem that is not easily addressed but I believe there is a definite long term requirement for a transition to ontological approach.


Compliance and Collisions

Over the past couple of years compliance has become a major buzz word in the storage industry. Regulatory bodies such as the SEC and Federal Government have mandated that organizations begin adhering to the over 16,000 worldwide regulations.There are numerous technologies that the storage industry has responded with; many of them legacy technologies and many of them new more advanced technologies which have changed the face of compliance and long term archiving. While traditional technologies such as WORM[1] optical and WORM tape continue to play an active role in compliance and long term archiving, CAS (Content Addressable Storage) has emerged as the technology of choice. Organizations can now cost effectively host petabytes of archive data online with reliability, availability, manageability, and serviceability that surpasses that of traditional WORM devices.While Content Addressable Storage has revolutionized the compliance and long term archiving market place there have been some concerns raised in the past 6 to 9 months. This article will examine at a high level the workings of Content Addressable Storage and some of the associated concerns.The basic premise for Content Addressable Storage is that data sent to a compliant device is hashed and stored on the device. The hash acts as a digital fingerprint for the data, in theory the only way to generate a duplicate hash is to hash the exact same data. The concept of digitally fingerprinting data has provided benefits beyond compliance and guaranteed authenticity. Hashing has facilitated single instance storage so data with an identical fingerprint can be deduplicated; this type of functionality has a positive cascading effect throughout an organization. CAS vendors provide hashing algorithm options with their products, the more common hashing algorithms are MD5[2], SHA-1[3], SHA-256, and SHA-512.

Criteria MD5 SHA-1 SHA-256
Key Length 128 bit 160 bit 256 bit
Maximum Size of data Infinite 2^64 bits 2^64 bits
Main advantage Speed Security Perceived as more secure

Figure 1 displays the relative performance and response time of the MD5, SHA-1 and SHA-512 algorithms. It is important to note the increased overhead associated with larger hash sizes.

Figure 1: Hashing performance metrics


Source: http://msdn.microsoft.com/library/default.asp?url=/library/

Recently collisions have been discovered for both MD5 and SHA-1. Essentially these collisions were produced in the lab and were found by generating data that produces the same MD5 or SHA-1 hash. Figure 2 is an example of two hex data stings that cause an MD5 collision. Figure 3 is the PERL source that can be used to prove that the collision.

Figure 2: MD5 Collision Example


Figure 3: Example Perl Script to demonstrate MD5 collision (requires Digest::MD5 and Digest::SHA1)


The above example represents a hash collision, H(M) = H(M?1). This hash collision was lab generated by looking for random theoretical data that would cause a hash collision.

There are two common cryptosystem attacks that this article will concentrate on, the brute force attack and the birthday attack. A brute force and birthday attack both solve for collisions by generating M and M1 until there is a hash collision.

The MD5 hash function produces 128-bit values, whereas SHA?1 produces 160-bit values and SHA-256 produces a 256-bit value. The question becomes how many bits do we need for security? Practically 2128, 2160 and 2256 are all more than large enough to thwart a brute force attack that simply searches randomly for colliding pairs (M,M1). However, a Birthday Attack reduces the size of the search space to roughly the square root of the original size. Thus, MD5 has roughly the same resistance to the birthday attack as a cryptosystem with 64-bit keys would have to a brute force attack. Similarly, SHA?1?s effective size in terms of birthday attack resistance is only 80-bits, etc?.

The birthday attack is named for the birthday paradox, which simply states that there is approximately a 50?50 chance that two people in a room of 23 strangers have the same birthday. For a complete description of the birthday paradox click the following link http://en.wikipedia.org/wiki/Birthday_paradox.

A birthday attack essentially creates random messages, takes their hash value, and checks to see if that hash value has been encountered before. For MD5, as an example, an attacker could expect to find collisions after trying 264 messages. Given today’s computing power, this is a difficult, but not impossible task.

While collisions to both MD5 and SHA-1 have been found using both brute force and birthday attacks these are not real world examples. The concept of generating artificial data until a collision is found in no way threatens the authenticity or integrity of existing data.

The two hash attacks that can cause authenticity and data integrity problems are a 1st preimage and 2nd preimage attack.

A 1st preimage attack is best described as given X (X represents an existing hash) solve for M, or by the equation H(M) = X. A 1st preimage attack has never been successful against any of the mentioned hashing algorithms.

A 2nd preimage attack can be described as a given M solve for M1 where the hashes are equal. This can be represented by the equation H(M) = H(M1). A 2nd preimage attack has also never been successful against any of the mentioned hashing algorithms.

While brute force and birthday attacks provide reason for concern for both MD5 and SHA-1 the key is to consider is what damage can generating two bogus messages with the same hash do? Why is this important?

Imagine for a moment that an adversary constructs two messages with the same hash where one message appears legitimate or innocuous. For example, suppose the attacker discovers that the message “I, Bob, agree to pay Charlie $ 5000.00 on 4/12/2005.” has the same hash as “I, Bob, agree to pay Charlie $18542841.54 on 9/27/2012.” Charlie could then try to get the victim to digitally sign the first message (e.g., by purchasing $5000 of goods). Charlie would then claim that Bob actually signed the second message, and “prove” this assertion by showing that Bob’s signature matches the second message. While in theory this possible it would require a 2nd preimage attack which has never been successfully perpetrated against any of the aforementioned algorithms.

A SHA-1 attack requires an estimated 269 or approximately 590 billion hash computations. The amount of computational power required to generate a hash collision is far beyond the average desktop computer. To put this in perspective using 10,000 custom ASICs[4] that can each perform 2 billion hash operations per second, the attack would take about one year. Moore’s Law[5] will make the attack more practical over time and there are also community initiatives that may make more feasible in the near future (http://www.certainkey.com/dnet/).

In conclusion there is no reason to believe that we are even close to a successful 1st preimage or 2nd preimage attack. Pratically speaking there is no cause for concern over the use of either the MD5 or SHA-1 algorithm. Customers who are concerned with MD5 and/or SHA-1 algorithms should inquire about alternative hashing algorithms, most vendors will support multiple.

[1] Write Once Read Many
[2] Message-Digest algorithm 5
[3] Secure Hash Algorithm
[4] Application Specific Integrated Circuits

The devils in the mathematical detail…

Looking forward to my tax relief in New Jersey 🙂 What a joke. A 1% increase in sales tax is now the great hope to lower my property tax in New Jersey. This is Jon Corzine’s master plan, it is hard to believe that he was the Chairman and CEO of Goldman Sachs from 1994 to 1999. Lets just do some simple math, to lower taxes an average of $1000.00 a year per NJ resident (sounds like a number that would get me excited) working with a .01 (1% tax increase) increase per dollar in sales tax the state would have to generate an average of $100,000.00 in revenue on sales taxable items. Is it me or does this seem ludicrous? Let assume the state does a phenomenal job investing (note this oxymoronic statement, when was the last time the local, state or federal government did a phenomenal job investing?) and this is a multi-year plan I would still rather have my 1% back.

It is also my understanding that only a portion of the 1% is going toward property tax breaks, some portion will be going towards the already depleted state workers pension fund (another fine example of the state government investing geniuses at work).

Some tax facts about the state of New Jersey before we get to my conclusion.

New Jersey is one of the 37 states that collect property taxes at both the state and local levels. As in most states, local governments collect far more. New Jersey’s localities collected $18,225,594,000 in property taxes in fiscal year 2004, which is the latest year the Census Bureau published state-by-state property tax collections. At the state level, New Jersey collected $3,660,000 in property taxes during FY 2004, making its combined state/local property taxes $18,229,254,000. At $2,099, New Jersey’s combined per capita collections were the highest in the nation.

Estimated at 10.8% of income, New Jersey?s state/local tax burden percentage ranks 17th highest nationally, above the national average of 10.6%. New Jersey tax payers pay $5,234 per-capita in state and local taxes.

New Jersey?s personal income tax system consists of six brackets and a top rate of 8.97% kicking in at an income level of $500,000. Among states levying personal income taxes, New Jersey?s top rate ranks 6th highest nationally. New Jersey’s 2004 individual income tax collections were $852 per person,

New Jersey levies a 6% general sales or use tax on consumers, which is above the national median of 5%. State and local governments combined collect approximately $721 per capita in general sales taxes, ranking 31st highest nationally. New Jersey?s gasoline tax stands at 14.5 cents per gallon and ranks 4th lowest nationally. New Jersey’s cigarette tax stands at $2.40 per pack of twenty and ranks 2nd highest nationally. The sales tax was adopted in 1966, the gasoline tax in 1927 and the cigarette tax in 1948.

So at $721 per capita that put the per capita revenue at $12,016.66. Assuming the entire 1% tax increase was going to lower property taxes this would yield a potential reduction of $120.16 which I guess if you apply that to the average state/local property taxes of $2,099.00 would be a 5.7% reduction.

I won’t hold my breath from my property tax decrease. But maybe I should stop avoiding the guy walking around my neighborhood reassessing the homes, maybe I have it all wrong and he looking to lower my taxes. Sure he is! I would like to see a little more detail on this plan, right now I am less than excited.