Today we finished the 3 week 9 city “Evolution of Disaster Recovery” tour with our most interactive session thus far. Congratulation San Diego you were by far one of the most interactive groups. We are thinking about what our next road show topic might be. Maybe “Intelligent Information Management for the SMB”, I am interested in what the folks who attended this road show would like to hear about. Thanks again for taking time out of your busy schedules to come and listen to us. If you have any suggestions on the next show topic please do not hesitate to post a comments. Thanks again!
-RJB
Inline with my previous post “The simple value of archiving…” I thought I would post a CAS (Content Addressable Storage) overview because again it was something that I presented during the “Evolution of Disaster Recovery” seminar series that I felt we could have gone much deeper. Unlike traditional storage “Content Addressable Storage” (CAS) uses a content address or globally unique ID to represent the binary contents of the file, location of the file, etch… in contrast traditional file systems such as UFS and NTFS which use file names to identify files. A content address is typically calculated using an algorithm called a file digest (i.e. MD5, SHA-1, SHA-256, SHA-512) and a hash is created to identify the file contents.
Exercise 1:
If you are curious to see how hashing works you can download md5sum from here , Next open MS Word, Notepad, VI, etc… and type something and save the doc as “filename.doc” now open a command prompt (DOS window) and run “md5sum filename.doc” this will return something like this “b3a6616fb5cee0f1669b1d13dd4c98cb *filename.doc” now open the file “filename.doc” and change a couple of letters do not delete or add characters because this will change the file size (makes the demonstration a little less powerful). For instance if you typed “Hello dad” change it to “Hello mom” and save as “filename.doc” the file should be identical in size to the previous version, now run the “md5sum “filename.doc” the output is a globally unique identifier and it is different because it is examining the binary makeup of the file not the file name, location, etc… Right now on you file system you only have one document called “filename.doc” which contains “Hello mom” the version containing “Hello dad” is gone. If this had been saved to a content addressable storage device both instance would have been saved because although they have the same name they are in fact unique. After this excise you can probably see the value for compliance, corporate governance, revision control, etc…
Exercise 2:
Create another doc named “filename.doc”, run “md5sum filename.doc”. Now copy “filename.doc” to “filename.doc” and “filename3.doc”, run “md5sum ….” on these two new instances of filename.doc you will notice that the hash is identical. On a traditional file system we have consumed 3x the space required because the only identifier is the file name which is unique, on a CAS device the file names would be stored with pointers to “filename.doc” this is what we call single instance storage. The practical application of this scenario dramatically reduces storage capacity required by dramatically reducing the amount of duplication present on most traditional file systems.
The following graphic is a simplistic representation of how “Content Addressable Storage” works:

Hope this adds some additional clarification to the discussions we had at the seminars. If you have an comments, concerns, corrections or questions please comment on this post.
-RJB
I am flying from the east coast to the west coast to close out the “Evolution of Disaster Recovery” seminar series, with show 7,8 and 9 in LA, Orange County and San Diego respectively. As part of the seminar I spent a significant amount of time discussing Backup, Recovery and Archiving (BURA). I thought it might be a worthwhile endeavor to document this a little further and seeing that I have 6 hours to kill there is no time like the present.
The following example assumes a 5 week backup rotation, which is fairly common. In the event that you have a backup rotation that archives monthly backups for one year and yearly backups for 7 years, etc… the model grows exponentially. Typically most backup policies consist of incremental backups which are taken Monday through Thursday with a weekly full backup on Fridays. Tape sets are vaulted for 4 weeks, on week six week fives tapes are recycled, this rotation maintains 4 offsite copies at any point in time. Assuming a typical weekly rate of change of 10% data duplication is massive which extends backup times and raises cost due to the amount of media required. The following is a graphical representation of a typical 5 week rotation:

By introducing an archiving strategy we can greatly reduce the amount of data that is duplicated in the offsite facility and remove stale or unwanted data from the production data set which greatly improves backup and recovery times. The archive is an active archive which means that archived data is replaced by a stub (not a shortcut, stubs are not traversed during backups) and moved to an archiving platform of choice such as ATA(Advanced Technology Attachment), NAS (Network Attach Storage), CAS (Content Addressable Storage) , tape, optical, etc… - The user experience is seamless. A sample of what an archiving strategy might look like is represented by the following graphic:

Some duplication will continue to exist due to the fact that we may have frequently accessed data that we choose not to archive. The archive is static, any data that is read or modified is pulled back into the production data set thus there is no need to backup the archive on a daily of weekly basis. We refresh the archive backup following an archiving process which in this example takes place monthly.
-RJB
As a follow-up to my previous post on modifying the prefetch registry key, I found a nice little FREE tool to cleanup windows temp files and the prefetch directory for those of you who don’t want to mess with the registry settings and do the manual cleanup.? The tool is CCleaner and can be found here.
-RJB
August 19th, 2006 by rbocchinfuso Tips
EMC had a recent management re-org.? Steve Duplessie does a nice job on his blog laying out the new roles of top execs at EMC.
-RJB
Here is a podcast preview from our Dallas seminar. ?The fully combined and edited podcast should be released sometime in September. ?We are also discussing a video podcast of the Irvine, CA seminar but no promises. ?Enjoy.
-RJB
So I have been reading other storage industry pundits’ opinions on the uses for LC-FC (Low Cost Fibre Channel Disk) and ATA (Advanced Technology Attachment) disk drives.? The? use of ATA disks in mid-tier storage devices and virtual tape libraries as a primary disk-to-disk backup (D2D) and/or archiving target has become quite pervasive.? While the drives lack some fibre channel (FC) features like tagged command queuing and the mean-time-between-failure (MTBF) is not as high as a fibre channel drives they work well in applications like backup-to-disk (B2D) due to their ability to achieve tolerable sequential read and write speeds.? The industry as a whole has now begun to leverage LC-FC disks in enterprise class storage subsystems, while fundamentally this is a great idea, the marketing and packaging of these solutions needs a bit of work.? Most enterprise class subsystems leverage platform software for functions like snapshots and replication, many vendors price these platform software applications by capacity.? Why would anyone install 500GB LC-FC drives in an enterprise class subsystem and push their platform software licensing through the roof.? It would more cost effectively purchase a mid-tier storage device with 15K rpm fibre channel drives to use as a tier 2 or 3 storage platform, this solution would also most likely be higher performing.? This is a fundamental problem, if a storage device can truly accommodate multi-tier storage requirements why should the addition of tier 2 or tier 3 storage capacity where functionality like synchronous or asynchronous replication are typically not required raise the cost of my platform software licensing?? Until this is resolved it will be difficult to realize the vision of a single multi-tier platform.
-RJB
Well, we just wrapped up week 2 of "The Evolution of Disaster Recovery" seminar series, this week we visited Washington DC, NY and Boston.? The week was topped of by the what I would call the best seminar thus far in Boston with good attendee interaction and some of the best questions we have seen to date, west coast there is still time to nudge out Boston for best seminar.? I think we are putting together some great questions from all the seminars and I am really excited to publish the podcast sometime in September.? Next week we wrap-up on the west coast with a stop in LA, Orange County and San Diego, I am looking forward to a capacity crowd. The highlight of the show thus far for me was a comment we received on one of the comment cards which read "It was really nice to attend a seminar where the presenters knew the material so well".? Hopefully you have been happy with the content and walked away with some valuable insight.? We are reading every single comment card? and next quarter we will be looking to improve the material and format based on them.? Thanks again for attending!
-RJB
Just arrived at my hotel in Boston an I am preparing to deliver “The Evolution of? Disaster Recovery” seminar in our sixth city tomorrow morning.? I put my new video iPod (that Cisco so graciously awarded me after passing a certification exam) to good use on the way up from Newark.? I watched the “The Code (Linux)” a Swedish Linux documentary.? As a long time Linux hacker and a technology history buff I was already familiar with most of the content that was presented in the film but there were a couple of concepts that got me excited, so excited I needed to blog on them.? The first was a statement by Linus Torvalds, for those of you who don’t know who Linus is, he is generally recognized creator of the Linux kernel.? Linus has arguably directed one of the most complex collaborative software development initiatives in the history of computing.? We can learn a lot from him and the Linux development effort on how to motivate and extract the most from people.? Linus discusses how the management of Linux kernel development project morphed from a hierarchical management structure to a what he describes today as a “web of trust” where developers are empowered to act.? Eric Raymond the author of “The Cathedral and the Bazaar” and all around open source czar also has some phenomenal words of wisdom.? Eric talks about what motivates open source engineers outside of monetary gain.
- Artistic pride, the satisfaction derived from good craftsman like work
- An idealistic feeling that you are part of something larger than yourself
- A general desire to help and deliver good solutions
- Increasing ones reputation and statue within the community
I found these points to be interesting because I have always embraced the philosophy of empowerment and mentoring over dictating policy and managing to that policy.? In my opinion the difference between a good organization and an insanely great organization is the ability to apply concepts such as the ones discussed above, so everyone participates in a culture where free thinking, innovation, empowerment, reputation, recognition, and responsibility are allowed to flourish.? The concept of totalitarian rule is ignorant and stifles innovation.? Empowerment is the key to innovation and ultimately greatness!
-RJB
August 16th, 2006 by rbocchinfuso Musings
All, the much anticipated presentation is now avialable. I hope everyone enjoyed the seminars thus far.
Click here to download the presentation.
-RJB