Demystifying Deduplication

TwinsIf deduplication is not only a great way to cut your storage costs but also a good way to save on WAN costs, is that reduplication? Either way, deduplication has been around for a while and is pretty mature now, yet there have been a couple of reminders recently that it is still not as widely understood or used as perhaps it ought to be.

The first was a survey of 1,100 European IT resellers by backup company Acronis. It asked them what they thought were the top IT enablers for companies to cut costs and become more efficient, and one of the top answers by a large margin was dedupe. (The others were archiving and the cloud, both of which can also benefit from dedupe.)

Then there was the news that HP has expanded its StoreOnce range of backup appliances to take federated dedupe technology to small and midsize customers. Federated dedupe removes duplicate data at the branch or remote office, before replicating it to HQ for backup, offloading both the storage and the WAN. It seems HP at least has realized that it is not just large enterprises that need to manage the load that their growing data storage requirements put on their WAN.

Of course, if you have WAN optimizers in place, the chances are you already have dedupe — it is a standard part of Silver Peak’s Network Memory technology, for example. It may well be a bit of a mystery to you, however — in which case, read on.

Deduplication looks for repeated patterns in data, and stores or transmits each pattern only once, with subsequent copies replaced by a pointer. Obviously, some data sets contain more duplication than others — for example, virtual machines created from templates will be almost identical. Depending on the kind of data being handled, commercial users have reported it giving an effective compression ratio as high as 20:1 or even 50:1.

And now dedupe is being applied to live storage too, whether it be email data where the same message or attached file is sent to – and stored by – several different people, photos and documents which contain repeated elements, or revisions of CAD files, each a variation on the last.

Dedupe is possible at several different levels, most notably file, block, and byte. The technology can also be general, in which case it is broadly applicable, or more finely defined, which gives higher data reduction ratios but is more processor-intensive. And for storage it can be a post-process technology, where the data is stored in full and then deduped off-line, or an in-line technology where it is deduped as it is stored.

On top of all that, dedupe is not the only way to do storage optimization. The main alternative is random-access data compression, although the two are not always in competition — most dedupe storage systems also compress the files and blocks that make it through the dedupe process and onto disk. WAN optimizers will compress data, too.

Dedupe’s maturity is highest in the backup and archiving sector, where it has been used for some years now, and is offered by most, if not all, of the key players. An example of its importance was the battle fought in 2009 between EMC and Network Appliance for control of dedupe pioneer Data Domain; the winner was EMC, which paid around $2.1 billion to acquire Data Domain and its disk-based dedupe backup boxes.

The technology is also being used on WANs for disaster recovery between data centers and for backups between branch offices and headquarters, as it reduces the amount of data that must be sent over the WAN. However, you need to ensure for this that the dedupe work is done at source — as WAN optimizers and those HP boxes do — and not off-line as some storage systems do it.

Dedupe is not a panacea, however, and each type of dedupe will best suit certain application cases. Adding dedupe to a disk-based backup subsystem will save disk capacity but will not reduce the amount of data that must be sent to the subsystem for storage, for instance. Conversely, deduplicating backup software will reduce the load on the network but increase the workload of the backup server.

And while pretty much any data is going to have some degree of duplication as far as backup and archiving is concerned, it’s a challenge to get dedupe working in high performance primary applications, such as storage for transaction processing. There are also perfectly good reasons why an organization might want to have duplicated data, for example where it relies on creating multiple copies to guarantee the protection and availability of critical information.

Image credit: cooper.gary (flickr) – CC-BY-NC-ND