Deduplication Redux
I don’t know for sure, but I think I have lost a big account because my potential customer has been seduced and blinded by Deduplication. My competitor for this account promised the customer that he would save huge amounts of storage space as a result of his “amazing deduplication technology.” This customer had already been burned by another company promising the same thing, and was unable to deliver. Now it looks like it’s going to happen again.
It’s not that our software can’t do global deduplication, it’s that NO online backup software can do it at a reasonable price while staying secure and compliant. I’ve written on the subject a few times. Here’s another article: http://blog.remote-backup.com/deduplication-the-latest-buzzword-in-online-backup-and-why-it-is-impossible/
Global deduplication is essentially forced File Sharing that you don’t want your end users to find out about, because if they did, they would not be happy with it.
Here it is in a nutshell: The Server stores only one copy of any file. Subsequent copies of identical files are not stored on the Server, rather, only small pointers are stored which say, “Hey, I’m not your file, but THERE is one that is identical to it!” Pointers are much smaller than files, so theoretically you can save disk space.
When an end user requests a restore, instead of restoring HIS files, he might be sent copies of somebody else’s files. Likewise, other end users might be sent copies of your files, as long as they were identical at the time of backup.
Disk space is saved by forcing end users to share files (behind the scenes, of course, they don’t know this is going on.)
Here are some problems with Deduplication.
Problem 1 – You can’t have it both ways.
Any online backup software worth its salt compresses and encrypts backup files with a strong encryption method and a key known only to the end user. It removes filenames and folder names, and stores files that way on the server, completely safe from anyone who doesn’t know the key.
This procedure is required by data privacy regulations like HIPAA, SOX, and GLB which determine how many businesses must back up their data.
Obviously, if the Server contains only encrypted data protected with different keys, it cannot (for security reasons) be shared among users because users don’t share their keys. Doing so would break the privacy regulations.
Encrypting files changes their “signature” or “hash” value, which is used by deduplication technology to find duplicate files. So, encrypted files with different unknown keys, even if they contain the same information, will have different hash values and will not deduplicate.
Regardless, some Online Backup software companies claim their software does it. Well, you can’t have it both ways. Either your software is insecure and noncompliant, or it does global deduplication. You get only one of those features – not both.
Problem 2 – It doesn’t do any good.
Deduplication saves space on an unencrypted file system, like a standard hard drive containing a set of files that you can open and edit. You know, regular files. However, good online backup software like RBackup uses a different sort of file system which is far more secure and reliable, custom designed for efficiently storing and retrieving secure backups.
Enforcing deduplication on such a file system encrypted with different encryption keys does not gain enough disk space to make it worthwhile. I tested this myself. I ran a test against a set of 50,000 backup files on an RBackup Server, looking for duplicate files using an SHA-256 hash.
I found exactly 10 duplicate files. The files I found were all either 0-byte files or unencrypted configuration files of 8 bytes, used by the RBackup Server itself. I found no identical backup files. Deduplicating this file set would have saved a total of maybe 64 BYTES of drive space (out of 120GB tested,) while using up 30% of CPU time, 23% of RAM, and 20 hours to calculate file hashes.
My conclusion is this: The space-saving technology already built into the RBackup file system is more than sufficient to save more disk space than deduplication, at ¼ the cost in dollars, and even more than that in computer resources and time, while maintaining 100% compliance and security.
UPDATE: 17 November 2011 – I ran this same test with a much larger file set – 631,275 backup files from many different end users, taking up 810GB on the drive array. As with the first test, there were no duplicate backup files – just a small handful of config files used by the RBS Server, and a few zero byte files.
Problem 3 – Customers won’t like it, if they know this is what’s going on.
There are many ways to do deduplication as long as your end users agree that some of their personal data might be disclosed at the Server, and might be shared with other users as a result of the deduplication process (after all, deduplication is a form of file sharing.)
They would also have to agree that it’s OK that a file they store in their space (and are being charged for) might not really be there, rather, they might be sharing a copy of someone else’s file.
They would also have to be OK with sharing a single copy of a file with other end users, and if that only copy became corrupted on the Server’s hard drive somehow, or accidentally erased, they would have to be OK with not being able to restore it, and so would everyone else who shares the recently corrupted or erased file.
In many cases customers really do want to store duplicate copies of files, for the safety of redundancy. After all, that’s what backup is all about – having several copies of critical files stored offsite. Some are even required by procedure to have multiple copies.
The “standard” file rotation protocol used by many companies (because it’s a good idea) requires no fewer than 5 copies of files, even if they are exact duplicates.
Deduplication technology lies to customers. In their file restore interface they see what looks like their files, but what might really be just pointers to someone else’s files.
So, customers would have to be OK with having only the appearance of redundancy and safety. They have to be happy to be lied to.
Problem 4 – It is expensive.
At the top of this article I said “reasonable price” and “global.” There is software that can do local deduplication (not global) within a single account. It is done by installing an appliance at each customer’s site.
The appliance stages backup data within the secure environment of the customer’s network, performs deduplication, and then sends it up to the online backup server. This is secure and compliant, but it is expensive, and it still does not do “global” deduplication, which is deduplication across multiple accounts.
So, while it saves the customer some storage space, it doesn’t really save very much space on the online backup server. And there are the additional problems caused by the maintenance and energy usage of yet another on-site appliance.
Conclusion
Deduplication saves drive space only in a trusted environment within a single organization. Global deduplication does not save a reasonable amount of space on a proper online backup server like RBackup, which already has adequate space-saving technology built into its file system. Anyone who says otherwise needs to be deduplicated himself.
Rob Cosgrove is the President of Remote Backup Systems, founder of the Online Backup Industry, and a vocal advocate for maintaining the highest standards in Online Backup software. His latest book, the Online Backup Guide for Service Providers: How to Start and Operate an Online Backup Service, is available online now, on Amazon.com, and at bookstores.
Remote Backup Systems provides brandable, scalable software and solutions to MSPs and VARs enabling them to offer Online Backup Services.