Data classification and the need for ontology…

How many organizations are struggling with data classification and building an ILM (Information Lifecycle Management) strategy? In the storage industry today we often talk about the differences between ILM and HSM (Hierarchical Storage Management) but does taxonomy actually provide enough to realize true ILM? Rather than taxonomy we should be addressing ontology. Ontology as defined in the realm of computer science is a data model that represents a domain and is used to reason about the objects in that domain and the relations between them. Known categorization systems used today were designed to optimize linear seek time not to optimize or categorize the intellectual aspects of information. Classification and categorization techniques used today while presented as organizing information are actually categorizing the physical objects that contain ideas or information. The industry is attempting to leverage traditional categorization methods by using tags which create meta data to try to depict the ideas and information inside the containers, more intelligent categorization methods today are applying lexicons to attempt to automate the generation of meta data. Again, I find it odd that we as an industry obsessed with data classification and life cycle management do not address ontology and our approach to ontology on a daily basis.

Ontology would need to consider owners, users, participant, openness of the domain and the potential for the control set to be altered and signal loss. The storage industry has avoided true ontology because the undertaking is massive. Until an ontological method for classification and categorization is developed can we ever really achieve true ILM?

The need for a thesaurus of terms, words or tags is an absolute requirement to enable true ILM. A canonical example of this would be imagine someone searches the web (largest know corpus of data) for “Movie” and another user searches a repository for “Cinema” would the return be the same? Most likely if the search is of a full text index the answer would be no, the reliance on tags to categorize a document using multiple words of terms makes it difficult to enforce and deliver true plug-and-play categorization and ILM. Now we also have to consider the signal loss, imagine a search of the web for “… Politics” and “… Agenda” while they might appear to be synonymous they may or may not be.

This is a complex problem that is not easily addressed but I believe there is a definite long term requirement for a transition to ontological approach.

-RJB

Embrace or disrupt?

So I am sitting on the plane on my way from east coast to the west coast listening to a speech by Dave DeWalt that he gave in April 2006 at the Software 2006 Conference in Santa Clara, California. Dave DeWalt is discussing decisions to embrace or disrupt the stack (software stack). He referenced a statistic that I found very interesting; 75% of the profit in the software marketplace comes from three companies and 50% of that number is generated by Microsoft, sounds crazy but I have no doubt that it is true. Some organization have embraced the Microsoft model and identified areas they can play or specific market segments where they can specialize. For instance Google has decided to leverage their lead over Microsoft’s MSN division to disrupt and hopefully capture the desktop space. Google is attacking the desktop market with applications like gmail, google spreadsheets, and the most recent application Writely (online word processor). There is also talk that Google will release a Linux distribution based on the Debian distro Ubuntu, the rumor is that this distro will be called Gubuntu.

I found Dave’s talk interesting because he referenced a couple of my favorite books “The World Is Flat” by Thomas Friedman and “The Only Sustainable Edge” by John Hagell III and John Seely Brown. Dave also hits on many of the topics that Clayton M. Christensen discusses in another one of my favorite books “The Innovator’s Dilemma”.

-RJB

ATA vs. VTL, is there a right or wrong answer?

For years customer have been facing problems with backup windows, Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO). The falling cost cost of storage has fueled the use of disk to increase the speed of backups and provide a potential solution to backup window, RPO and RTO issues. Customer today are leveraging the disk based backup solutions to augment their existing tape solutions in an effort to decrease backup time and recovery time as well as prolong investments in aging tape technology. Disk based backup solutions have taken numerous forms, leveraging ATA or LC-FC (low cost fibre channel) is a popular low cost solution that can normally be implemented in conjunction with an enterprise storage consolidation or integrated into an existing storage strategy.

The use of ATA or LC-FC can be a very economical introduction into the world of B2D. Often larger organizations shy away from traditional backup to disk because of the associated process change required and the potential increase in operational complexity. Organizations looking to benefit from the speed of disk without the need for process change may consider a VTL (virtual tape library) strategy. Virtual Tape also offers operational simplicity, in many cased native IP based replication, compression and/or data de-duplication. VTL devices are purpose built and optimized for backup this makes VTL a compelling choice. The caveat with VTL devices is that the simplicity of an emulated tape device also offers the many of the limitations and licensing costs associated with physical tape.

Organizations should also consider SLA requirements that typically encompass backup windows, RPO and RTO. What will the backup data flow be once a B2D solution is implemented? Will the architected B2D solution meet all the requirements of the SLA? In most cases the current state may look like D2T (production disk to tape), D2T2T (production disk to onsite tape copy to offsite tape copy) or D2Clone/Snap2Tape (disk 2 array based clone/snap 2 tape). Once a B2D strategy is employed the flow may look like any one of the following D2D2T, D2VT2T, or D2Clone/Snap2D2T, etc… The point here is that there are more ways than ever to implement backup solutions today, the pros and cons of each solution should be considered relative to the desired and/or required SLA, RPO and RTO.

-RJB

EMC snatches up RSA Security for 2.1 billion…

The storage or should I say security market is really interesting right now. It appears that storage and security are converging as fast if not faster than IP and storage. The trend was initiated by Symantec’s acquisition of Veritas for 13 billion and now EMC’s acquisition of RSA Security for 2.1 billion. It’s not just the big boys making the crossing over security company SonicWall is dabbling in the storage space. SonicWall is breaking into the storage market with a solution focused on backup and recovery. The SonicWall appliance is targeted at the SMB space and provides features like CDP, file versioning, SQL support, exchange support and bare metal recovery at an extremely affordable price.

It remains to be seen what EMC will do with RSA Security but I can see key management and products like Reporting and Compliance Manager being leveraged very quickly. With all eyes on data security and compliance the acquisition of RSA puts EMC in the role of innovator in a sector that is very hot.

The the RSA technology should provide EMC a huge advantage to begin to architect end-to-end solutions where data security and chain of custody can be guaranteed. Combining key authentication with logging tools like Network Intelligence or SenSage for security event management could provide a level of data security that the data security market desperately needs.

Finally one has to wonder what this will mean for data security appliance makers who rely on EMC as a significant source of revenue. How long will it be before EMC leverages the RSA technology to embedded data encryption.

-RJB

Adding RSS feeds to personalized search pages…

A number of people have asked me about how to add content from gotitsolutions.org, storagesquid.org, recoverymonkey.org and vi411.org to their personalized MSN, Yahoo, Google, etc… home pages. Each of these providers allow users to add RSS feeds as content to their home page.

I have simplified this process. For those of you who use My Yahoo, MSN, Google or AOL as your personalized home page (this should be most of you) you can just click the the add button under syndicate on my blog page and the RSS feed will be added your home page.

If you use another service I am making the assumption that you are technical and you know how to add an RSS feed to your page. 🙂

After adding the feed URL you should be able to return to your home page and the new content should be listed. Below is an example of my google home page.

google homepage

Good luck.
-RJB

Compliance and Collisions

Over the past couple of years compliance has become a major buzz word in the storage industry. Regulatory bodies such as the SEC and Federal Government have mandated that organizations begin adhering to the over 16,000 worldwide regulations.There are numerous technologies that the storage industry has responded with; many of them legacy technologies and many of them new more advanced technologies which have changed the face of compliance and long term archiving. While traditional technologies such as WORM[1] optical and WORM tape continue to play an active role in compliance and long term archiving, CAS (Content Addressable Storage) has emerged as the technology of choice. Organizations can now cost effectively host petabytes of archive data online with reliability, availability, manageability, and serviceability that surpasses that of traditional WORM devices.While Content Addressable Storage has revolutionized the compliance and long term archiving market place there have been some concerns raised in the past 6 to 9 months. This article will examine at a high level the workings of Content Addressable Storage and some of the associated concerns.The basic premise for Content Addressable Storage is that data sent to a compliant device is hashed and stored on the device. The hash acts as a digital fingerprint for the data, in theory the only way to generate a duplicate hash is to hash the exact same data. The concept of digitally fingerprinting data has provided benefits beyond compliance and guaranteed authenticity. Hashing has facilitated single instance storage so data with an identical fingerprint can be deduplicated; this type of functionality has a positive cascading effect throughout an organization. CAS vendors provide hashing algorithm options with their products, the more common hashing algorithms are MD5[2], SHA-1[3], SHA-256, and SHA-512.

Criteria MD5 SHA-1 SHA-256
Key Length 128 bit 160 bit 256 bit
Maximum Size of data Infinite 2^64 bits 2^64 bits
Main advantage Speed Security Perceived as more secure

Figure 1 displays the relative performance and response time of the MD5, SHA-1 and SHA-512 algorithms. It is important to note the increased overhead associated with larger hash sizes.

Figure 1: Hashing performance metrics

image001

Source: http://msdn.microsoft.com/library/default.asp?url=/library/
en-us/dnbda/html/bdadotnetarch15.asp

Recently collisions have been discovered for both MD5 and SHA-1. Essentially these collisions were produced in the lab and were found by generating data that produces the same MD5 or SHA-1 hash. Figure 2 is an example of two hex data stings that cause an MD5 collision. Figure 3 is the PERL source that can be used to prove that the collision.

Figure 2: MD5 Collision Example

image002

Figure 3: Example Perl Script to demonstrate MD5 collision (requires Digest::MD5 and Digest::SHA1)

image003

The above example represents a hash collision, H(M) = H(M?1). This hash collision was lab generated by looking for random theoretical data that would cause a hash collision.

There are two common cryptosystem attacks that this article will concentrate on, the brute force attack and the birthday attack. A brute force and birthday attack both solve for collisions by generating M and M1 until there is a hash collision.

The MD5 hash function produces 128-bit values, whereas SHA?1 produces 160-bit values and SHA-256 produces a 256-bit value. The question becomes how many bits do we need for security? Practically 2128, 2160 and 2256 are all more than large enough to thwart a brute force attack that simply searches randomly for colliding pairs (M,M1). However, a Birthday Attack reduces the size of the search space to roughly the square root of the original size. Thus, MD5 has roughly the same resistance to the birthday attack as a cryptosystem with 64-bit keys would have to a brute force attack. Similarly, SHA?1?s effective size in terms of birthday attack resistance is only 80-bits, etc?.

The birthday attack is named for the birthday paradox, which simply states that there is approximately a 50?50 chance that two people in a room of 23 strangers have the same birthday. For a complete description of the birthday paradox click the following link http://en.wikipedia.org/wiki/Birthday_paradox.

A birthday attack essentially creates random messages, takes their hash value, and checks to see if that hash value has been encountered before. For MD5, as an example, an attacker could expect to find collisions after trying 264 messages. Given today’s computing power, this is a difficult, but not impossible task.

While collisions to both MD5 and SHA-1 have been found using both brute force and birthday attacks these are not real world examples. The concept of generating artificial data until a collision is found in no way threatens the authenticity or integrity of existing data.

The two hash attacks that can cause authenticity and data integrity problems are a 1st preimage and 2nd preimage attack.

A 1st preimage attack is best described as given X (X represents an existing hash) solve for M, or by the equation H(M) = X. A 1st preimage attack has never been successful against any of the mentioned hashing algorithms.

A 2nd preimage attack can be described as a given M solve for M1 where the hashes are equal. This can be represented by the equation H(M) = H(M1). A 2nd preimage attack has also never been successful against any of the mentioned hashing algorithms.

While brute force and birthday attacks provide reason for concern for both MD5 and SHA-1 the key is to consider is what damage can generating two bogus messages with the same hash do? Why is this important?

Imagine for a moment that an adversary constructs two messages with the same hash where one message appears legitimate or innocuous. For example, suppose the attacker discovers that the message “I, Bob, agree to pay Charlie $ 5000.00 on 4/12/2005.” has the same hash as “I, Bob, agree to pay Charlie $18542841.54 on 9/27/2012.” Charlie could then try to get the victim to digitally sign the first message (e.g., by purchasing $5000 of goods). Charlie would then claim that Bob actually signed the second message, and “prove” this assertion by showing that Bob’s signature matches the second message. While in theory this possible it would require a 2nd preimage attack which has never been successfully perpetrated against any of the aforementioned algorithms.

A SHA-1 attack requires an estimated 269 or approximately 590 billion hash computations. The amount of computational power required to generate a hash collision is far beyond the average desktop computer. To put this in perspective using 10,000 custom ASICs[4] that can each perform 2 billion hash operations per second, the attack would take about one year. Moore’s Law[5] will make the attack more practical over time and there are also community initiatives that may make more feasible in the near future (http://www.certainkey.com/dnet/).

In conclusion there is no reason to believe that we are even close to a successful 1st preimage or 2nd preimage attack. Pratically speaking there is no cause for concern over the use of either the MD5 or SHA-1 algorithm. Customers who are concerned with MD5 and/or SHA-1 algorithms should inquire about alternative hashing algorithms, most vendors will support multiple.


[1] Write Once Read Many
[2] Message-Digest algorithm 5
[3] Secure Hash Algorithm
[4] Application Specific Integrated Circuits
-RJB
[ratings]

Vendor Neutrality?

Throughout the IT marketplace VARs and system integrators spend a significant amount of time discussing vendor neutrality. Is it really possible to be vendor neutral? What size VAR or systems integrator has the ability execute a vendor neutral strategy.

Companies such as Accenture and CSC have the critical mass to truly execute on a vendor neutral plan. In my travels I often run into companies with employees numbering in the 10s not even 100s who tout a vendor neutral approach. How many products can a 40 person company have an in depth understanding of? With so many products in the marketplace all competing to solve the similar problems I would prefer a vendor who intimately understands the solutions they are proposing which in turn should increase the probability of a successful implementation.

-RJB

Holographic storage?

Ironically I was sitting at lunch today and was asked by one of my associates about holographic storage. This is ironic because I was reading about the technology just this weekend.

To put the future of storage technology into perspective a ferroelectric storage drive device the size of an iPod nano or 3.5 inch drive could hold enough MP3 music to play for 300,000 years without repeating a song or enough DVD quality video to play movies for 10,000 years without repetition. I would say this technology would be pretty appealing to the storage market.

What is available today. InPhase Technologies plans to release their first generation drive at the end of 2006. This drive will be a write-once WORM discs designed for fixed content archiving with a capacity of about 300 GB. Re-writable discs with 1.6 TB capacities are planned for release in 2009. As well as enterprise class storage, InPhase is also considering small consumer targeted devices with capacities ranging from 2 GB on a postage stamp to 210 GB on a credit card. They predict that the 200R will have a media shelf life of about 50 years, compared to 10 years for tape.

InPhase claims that the tapestry system has a recording density of 200 Gigabits / square inch and can read data at 27 MB/s. Contrast this with traditional magnetic disk that has a recording density of 120 Mbpsi and the newer perpendicular recording disks with a density of 240Mbpsi. High performance tapes can read data at 40 MB/s or more.
Tapestry uses a twin polymer implementation for the storage medium. The recording medium polymer is dissolved inside a solid matrix polymer; This 2-chemistry combination is a 1.5 mm thick recording material that is sandwiched between two plastic plates; there is no metallic layer such as used in DVD storage. Data is stored by crossing two separate laser beams inside the polymer, which records pages of data. One laser beam works the write the data and the second laser beam is used as a reference beam. Individual pages can hold approximately 1 megabit, and multiple pages are recorded by varying the angle of incidence and wavelength of the reference beam. 252 ‘pages’ are collected together into one ‘book’, and fifteen books or 3780 pages can all the stored in the same piece of recording material. The media used today will most often be referred to a HVD (holographic versatile disc).

While holographic and ferroelectirc technology is extremely interesting most of these technologies are in the incubation phases of research and development. While the technology does hold the promise of revolutionizing the storage market I would not hold off on the 500GB magnetic hard disk purchase for a storage device that holds 15 Terabytes per square inch. I think it may be a while.

-RJB

Desktop blogging application

This weekend I was sitting at home and doing a significant amount of blogging from my adirondak chair on my lawn and getting fairly frustrated with the HTML online editor that WordPress and Serendipity offer. Alternatively I started typing the blogs in AbiWord and cutting and pasting them into the online HTML editor when I was ready to post, not the most elegant solution but better the the previous method. Late on Saturday night I began searching around for a desktop blogging application and tested Quamana, w.bloggar, and BlogDesk. By far my favorite is BlogDesk, it just works and offers all the right features.

-RJB

The devils in the mathematical detail…

Looking forward to my tax relief in New Jersey 🙂 What a joke. A 1% increase in sales tax is now the great hope to lower my property tax in New Jersey. This is Jon Corzine’s master plan, it is hard to believe that he was the Chairman and CEO of Goldman Sachs from 1994 to 1999. Lets just do some simple math, to lower taxes an average of $1000.00 a year per NJ resident (sounds like a number that would get me excited) working with a .01 (1% tax increase) increase per dollar in sales tax the state would have to generate an average of $100,000.00 in revenue on sales taxable items. Is it me or does this seem ludicrous? Let assume the state does a phenomenal job investing (note this oxymoronic statement, when was the last time the local, state or federal government did a phenomenal job investing?) and this is a multi-year plan I would still rather have my 1% back.

It is also my understanding that only a portion of the 1% is going toward property tax breaks, some portion will be going towards the already depleted state workers pension fund (another fine example of the state government investing geniuses at work).

Some tax facts about the state of New Jersey before we get to my conclusion.

New Jersey is one of the 37 states that collect property taxes at both the state and local levels. As in most states, local governments collect far more. New Jersey’s localities collected $18,225,594,000 in property taxes in fiscal year 2004, which is the latest year the Census Bureau published state-by-state property tax collections. At the state level, New Jersey collected $3,660,000 in property taxes during FY 2004, making its combined state/local property taxes $18,229,254,000. At $2,099, New Jersey’s combined per capita collections were the highest in the nation.

Estimated at 10.8% of income, New Jersey?s state/local tax burden percentage ranks 17th highest nationally, above the national average of 10.6%. New Jersey tax payers pay $5,234 per-capita in state and local taxes.

New Jersey?s personal income tax system consists of six brackets and a top rate of 8.97% kicking in at an income level of $500,000. Among states levying personal income taxes, New Jersey?s top rate ranks 6th highest nationally. New Jersey’s 2004 individual income tax collections were $852 per person,

New Jersey levies a 6% general sales or use tax on consumers, which is above the national median of 5%. State and local governments combined collect approximately $721 per capita in general sales taxes, ranking 31st highest nationally. New Jersey?s gasoline tax stands at 14.5 cents per gallon and ranks 4th lowest nationally. New Jersey’s cigarette tax stands at $2.40 per pack of twenty and ranks 2nd highest nationally. The sales tax was adopted in 1966, the gasoline tax in 1927 and the cigarette tax in 1948.

So at $721 per capita that put the per capita revenue at $12,016.66. Assuming the entire 1% tax increase was going to lower property taxes this would yield a potential reduction of $120.16 which I guess if you apply that to the average state/local property taxes of $2,099.00 would be a 5.7% reduction.

I won’t hold my breath from my property tax decrease. But maybe I should stop avoiding the guy walking around my neighborhood reassessing the homes, maybe I have it all wrong and he looking to lower my taxes. Sure he is! I would like to see a little more detail on this plan, right now I am less than excited.

-RJB