Deduplication – The Latest Buzzword in Online Backup, and Why it is Not Always Possible
I hear the term “deduplication” more and more lately. I just got another email from someone asking if our software supports deduplication. Here’s the answer.
Our software supports deduplication at the local level, within files and databases, as a function of our compression technology. Pieces of files that are duplicated will be replaced with smaller pointers, reducing the file size. But, that is just a function of the compression methods we use. It isn’t anything to brag about.
True file level deduplication will never be possible in a cost effective, mass market, secure Online Backup system with a globally centralized data store. Deduplication isn’t even an Online Backup term. It comes from the world of Data Warehousing.
Here’s the basic idea. Before a file is stored on a Server, a deduplication system looks to see if there is already an identical file on the Server. If there is, the deduplication software writes a “pointer” instead of another copy of the same file. The pointer is a very small piece of code that contains just a redirect instruction pointing to the location of the original file. This saves a lot of space. The pointer weighs 128 bytes, but it might represent a file that weighs many gigs.
If many users want to store exactly the same file, deduplication saves a lot of space, because instead of storing multiple copies of the same file, it stores only one copy of the same file, and many small pointers. This is great in a file system where users all work for the same company and trust one another, or where files are not encrypted.
But this is never true in an Online Backup system where files are always encrypted with different encryption keys, and where they belong to many different people, none of whom trust one other. When I want to restore a file, I want MY copy of that file, not yours. Yours might have a virus. (It isn’t the job of an Online Backup service to purge your files of viruses.)
In a secure Online Backup system files are compressed and encrypted with an encryption key known only to the user before they are sent to the Server. They show up at the Server in a form that cannot be opened or used in any way. Filenames are stripped, as are dates, folder names, permissions, and any personally identifiable information as per Data Security and Privacy Regulations. If your Online Backup software doesn’t do this, change software vendors immediately.
For an Online Backup service to do deduplication at the file level, the Server would have to have access to the unencrypted files. Files would have to be decompressed and decrypted by the Server, and that would violate HIPAA, GLB, SOX, and most other privacy regulations.
For the past twenty some-odd years I have been at the forefront of Online Backup innovation. Believe me, I have tried many ways to do deduplication in secure Online Backup, and I have not yet found a way that can do both global deduplication AND proper security that doesn’t end up costing the end user a fortune.
You can very easily do deduplication in Online Backup if you don’t need security. You just run a checksum of each file, store it in a database along with that file’s location, and do a lookup for the checksum before you store subsequent files, storing a pointer if you find a duplicate. This will automatically build a public database of shared files that can be restored transparently, just as though you had many copies of the same file – except you don’t. Everyone restores from the same file. But, it isn’t secure.
In proper Online Backup, where all files are encrypted with different encryption keys unknown to the Server or to the other users, there is no way to determine if files are identical. Even if they were identical before compression and encryption, the encryption process changes the checksum and to the server they then look unique. So, there’s no way to do file-level data deduplication.
“OK, calculate the checksum pre-encryption and send it along as metadata with each file!” Like I haven’t thought the same thing about a thousand times. Sure, that will identify files without disclosing private information. BUT remember, the files are encrypted with different encryption keys not known to the Server. So, on restore, we cannot decrypt them. OUCH! And that’s where I always get brain freeze and have to go out for a walk to try to get unstuck.
“OK, then deduplicate only within each user’s account. It isn’t perfect, but it’s SOMETHING.” Nah, as it turns out, that’s nothing too. I won’t go into the reasons. My brain would seize up again and I wouldn’t be able to finish this article.
You can deduplicate at the binary level – blocks of data on the hard drive instead of files. There are a number of software utilities, appliances and OS drivers that attempt to do that, but those are well outside the realm of Online Backup, like virus protection. You can add it on if you want, but it isn’t usually built into the software itself.
“But Rob, what about claims by companies like Asigra, Avamar, and NetBackup that they have built-in deduplication? What about that, huh, mister smarty pants?” OK, remember back up at the top of this article when I said, “cost effective”, “mass market,” and “globally centralized?”
YES, these guys can do it. But, they are expensive solutions, not designed for providing affordable online backup services to the mass market. They do much of their deduplication within a trusted environment like a local appliance (expensive) or a private cloud (ditto). There’s a price for that kind of technology, not just in cost of ownership, but in flexibility, administration, and end user pricing.
Yes, I could add deduplication to my products. But, if Idid it right I wouldn’t be able to sell them to you at $2250 for software that can back up 25 Windows Servers, and you wouldn’t be able to sell it to end users for $89 a month.
So, here’s the bottom line on deduplication in Online Backup. No commercial mass market Online Backup solution can provide global deduplication at a low price to the Service Provider and the end user. It cannot be done securely and still maintain the requisite flexibility in changing encryption keys.
External utilities and appliances CAN do it at a binary block level. Deduplication CAN be done if end users are willing to install an expensive appliance in a trusted environment, or if they stage backups in a private cloud. All these options are expensive.
I’m not giving up, though.
There is a newer article on deduplication here: http://blog.remote-backup.com/deduplication-redux/
Rob Cosgrove is the President of Remote Backup Systems, founder of the Online Backup Industry, and a vocal advocate for maintaining the highest standards in Online Backup software. His latest book, the Online Backup Guide for Service Providers: How to Start and Operate an Online Backup Service, is available online now, and on Amazon.com and bookstores after April 30, 2010.