Compliance and Collisions

Over the past couple of years compliance has become a major buzz word in the storage industry. Regulatory bodies such as the SEC and Federal Government have mandated that organizations begin adhering to the over 16,000 worldwide regulations.There are numerous technologies that the storage industry has responded with; many of them legacy technologies and many of them new more advanced technologies which have changed the face of compliance and long term archiving. While traditional technologies such as WORM[1] optical and WORM tape continue to play an active role in compliance and long term archiving, CAS (Content Addressable Storage) has emerged as the technology of choice. Organizations can now cost effectively host petabytes of archive data online with reliability, availability, manageability, and serviceability that surpasses that of traditional WORM devices.While Content Addressable Storage has revolutionized the compliance and long term archiving market place there have been some concerns raised in the past 6 to 9 months. This article will examine at a high level the workings of Content Addressable Storage and some of the associated concerns.The basic premise for Content Addressable Storage is that data sent to a compliant device is hashed and stored on the device. The hash acts as a digital fingerprint for the data, in theory the only way to generate a duplicate hash is to hash the exact same data. The concept of digitally fingerprinting data has provided benefits beyond compliance and guaranteed authenticity. Hashing has facilitated single instance storage so data with an identical fingerprint can be deduplicated; this type of functionality has a positive cascading effect throughout an organization. CAS vendors provide hashing algorithm options with their products, the more common hashing algorithms are MD5[2], SHA-1[3], SHA-256, and SHA-512.

Criteria MD5 SHA-1 SHA-256
Key Length 128 bit 160 bit 256 bit
Maximum Size of data Infinite 2^64 bits 2^64 bits
Main advantage Speed Security Perceived as more secure

Figure 1 displays the relative performance and response time of the MD5, SHA-1 and SHA-512 algorithms. It is important to note the increased overhead associated with larger hash sizes.

Figure 1: Hashing performance metrics



Recently collisions have been discovered for both MD5 and SHA-1. Essentially these collisions were produced in the lab and were found by generating data that produces the same MD5 or SHA-1 hash. Figure 2 is an example of two hex data stings that cause an MD5 collision. Figure 3 is the PERL source that can be used to prove that the collision.

Figure 2: MD5 Collision Example


Figure 3: Example Perl Script to demonstrate MD5 collision (requires Digest::MD5 and Digest::SHA1)


The above example represents a hash collision, H(M) = H(M?1). This hash collision was lab generated by looking for random theoretical data that would cause a hash collision.

There are two common cryptosystem attacks that this article will concentrate on, the brute force attack and the birthday attack. A brute force and birthday attack both solve for collisions by generating M and M1 until there is a hash collision.

The MD5 hash function produces 128-bit values, whereas SHA?1 produces 160-bit values and SHA-256 produces a 256-bit value. The question becomes how many bits do we need for security? Practically 2128, 2160 and 2256 are all more than large enough to thwart a brute force attack that simply searches randomly for colliding pairs (M,M1). However, a Birthday Attack reduces the size of the search space to roughly the square root of the original size. Thus, MD5 has roughly the same resistance to the birthday attack as a cryptosystem with 64-bit keys would have to a brute force attack. Similarly, SHA?1?s effective size in terms of birthday attack resistance is only 80-bits, etc?.

The birthday attack is named for the birthday paradox, which simply states that there is approximately a 50?50 chance that two people in a room of 23 strangers have the same birthday. For a complete description of the birthday paradox click the following link

A birthday attack essentially creates random messages, takes their hash value, and checks to see if that hash value has been encountered before. For MD5, as an example, an attacker could expect to find collisions after trying 264 messages. Given today’s computing power, this is a difficult, but not impossible task.

While collisions to both MD5 and SHA-1 have been found using both brute force and birthday attacks these are not real world examples. The concept of generating artificial data until a collision is found in no way threatens the authenticity or integrity of existing data.

The two hash attacks that can cause authenticity and data integrity problems are a 1st preimage and 2nd preimage attack.

A 1st preimage attack is best described as given X (X represents an existing hash) solve for M, or by the equation H(M) = X. A 1st preimage attack has never been successful against any of the mentioned hashing algorithms.

A 2nd preimage attack can be described as a given M solve for M1 where the hashes are equal. This can be represented by the equation H(M) = H(M1). A 2nd preimage attack has also never been successful against any of the mentioned hashing algorithms.

While brute force and birthday attacks provide reason for concern for both MD5 and SHA-1 the key is to consider is what damage can generating two bogus messages with the same hash do? Why is this important?

Imagine for a moment that an adversary constructs two messages with the same hash where one message appears legitimate or innocuous. For example, suppose the attacker discovers that the message “I, Bob, agree to pay Charlie $ 5000.00 on 4/12/2005.” has the same hash as “I, Bob, agree to pay Charlie $18542841.54 on 9/27/2012.” Charlie could then try to get the victim to digitally sign the first message (e.g., by purchasing $5000 of goods). Charlie would then claim that Bob actually signed the second message, and “prove” this assertion by showing that Bob’s signature matches the second message. While in theory this possible it would require a 2nd preimage attack which has never been successfully perpetrated against any of the aforementioned algorithms.

A SHA-1 attack requires an estimated 269 or approximately 590 billion hash computations. The amount of computational power required to generate a hash collision is far beyond the average desktop computer. To put this in perspective using 10,000 custom ASICs[4] that can each perform 2 billion hash operations per second, the attack would take about one year. Moore’s Law[5] will make the attack more practical over time and there are also community initiatives that may make more feasible in the near future (

In conclusion there is no reason to believe that we are even close to a successful 1st preimage or 2nd preimage attack. Pratically speaking there is no cause for concern over the use of either the MD5 or SHA-1 algorithm. Customers who are concerned with MD5 and/or SHA-1 algorithms should inquire about alternative hashing algorithms, most vendors will support multiple.

[1] Write Once Read Many
[2] Message-Digest algorithm 5
[3] Secure Hash Algorithm
[4] Application Specific Integrated Circuits

Leave a Reply

Your email address will not be published. Required fields are marked *