Inline with my previous post “The simple value of archiving…” I thought I would post a CAS (Content Addressable Storage) overview because again it was something that I presented during the “Evolution of Disaster Recovery” seminar series that I felt we could have gone much deeper. Unlike traditional storage “Content Addressable Storage” (CAS) uses a content address or globally unique ID to represent the binary contents of the file, location of the file, etch… in contrast traditional file systems such as UFS and NTFS which use file names to identify files. A content address is typically calculated using an algorithm called a file digest (i.e. MD5, SHA-1, SHA-256, SHA-512) and a hash is created to identify the file contents.
Exercise 1:
If you are curious to see how hashing works you can download md5sum from here , Next open MS Word, Notepad, VI, etc… and type something and save the doc as “filename.doc” now open a command prompt (DOS window) and run “md5sum filename.doc” this will return something like this “b3a6616fb5cee0f1669b1d13dd4c98cb *filename.doc” now open the file “filename.doc” and change a couple of letters do not delete or add characters because this will change the file size (makes the demonstration a little less powerful). For instance if you typed “Hello dad” change it to “Hello mom” and save as “filename.doc” the file should be identical in size to the previous version, now run the “md5sum “filename.doc” the output is a globally unique identifier and it is different because it is examining the binary makeup of the file not the file name, location, etc… Right now on you file system you only have one document called “filename.doc” which contains “Hello mom” the version containing “Hello dad” is gone. If this had been saved to a content addressable storage device both instance would have been saved because although they have the same name they are in fact unique. After this excise you can probably see the value for compliance, corporate governance, revision control, etc…
Exercise 2:
Create another doc named “filename.doc”, run “md5sum filename.doc”. Now copy “filename.doc” to “filename.doc” and “filename3.doc”, run “md5sum ….” on these two new instances of filename.doc you will notice that the hash is identical. On a traditional file system we have consumed 3x the space required because the only identifier is the file name which is unique, on a CAS device the file names would be stored with pointers to “filename.doc” this is what we call single instance storage. The practical application of this scenario dramatically reduces storage capacity required by dramatically reducing the amount of duplication present on most traditional file systems.
The following graphic is a simplistic representation of how “Content Addressable Storage” works:
Hope this adds some additional clarification to the discussions we had at the seminars. If you have an comments, concerns, corrections or questions please comment on this post.
-RJB