Nov
6

Dupe, Dedupe

Dupe, Dedupe

November 6
By

No, this is not an article about Miley Cyrus’ latest song.  This is an article about data deduplication, often referred to as “dedupe”.  The intent of this article is to briefly discuss what data duplication is and how it might be employed in your current BDR plan.

Data deduplication is a specialized data compression technique.  In its simplest form, the deduplication process compares unique byte patterns in chunks of data intended for storage with an internal index of data already stored.  Whenever a match occurs the redundant chunk of data is replaced with a small reference that points to the previously stored data.

Another way to think about data deduplication is where it occurs.  A deduplication process which occurs close to where the data is created is referred to as “source deduplication” whereas a similar deduplication process occurring close to where the data is stored is a “target deduplication”.

Data deduplication carries with it many of the same drawbacks and benefits of other compression processes.  For example, whenever data is transformed there is a potential risk of lost or corrupted data.  In addition, there may be the added overhead of computational resources required for the compression process.  Hopefully the benefit of an optimized storage footprint outweighs the risk and where large amounts of data is concerned, this is very possible.

However if we consider the low cost of drive space today a small business might do well to consider buying additional storage capacity rather than purchase and implement a deduplication process.  One study using IBM disk manufacturing data implies that the cost per GigaByte is dropping by roughly 37.5 percent each year.

So, before you “throwdown” that pile of cash you might consider integrating low cost data storage as a safer and easier solution to implementing data deduplication processes.  Meanwhile, check back here at StorageCraft often for more backup and data recovery solutions.

  • ▾ Comments

    1. Mark Hicks on

      What I’d like to know is if Shadow Protect can be setup in such a way as to be “compatible” with storage-side (or “target”) dedupe?

      I understand that SP does not do source dedupe at this point and that turning on compression and/or encryption in SP basically creates a set of backup files that won’t really dedupe.

      However it seems at least “possible” that if we were to configure backups in SP to not use either “compression” or “encryption” we could then “in theory” store all of the backup files on an array where storage side dedupe could take place.

      To cite a specific scenario we have 100 workstations backing up with SP. Those backups are stored on a 3TB array which does not have enough space to hold more than 1 full plus 3-4 weeks of incremental images for all those stations. It seems to me that 90% of what is contained within the “full” images should, in theory, be “identical” between those 100 stations since they are all new Win7 stations built from the same image.

      I would not expect there to be a lot of duplicate “blocks” within the incremental images of those stations but cutting down the storage used by the full backups would allow us to store more backup history and not have to “nuke” the entire backup set before creating a new full each month. Not to mention constantly being on the edge of out of space in the current scenario.

      I expect what we lose in “compression” of the individual partition images would be far outweighed by the space gained back in “recovery” 80% or 90% across 100 stations.

      Yes, we could buy more storage space but going past 3T makes it very difficult to carry the backups offsite economically. And windows servers have the ability to dedupe a volume. So can I just turn on NTFS dedupe on the partition that holds the backup images, configure the backup jobs for no-compress/no-encrypt and gain back a big chunk of storage space?

      Is this scenario “supported” by Storage Craft?

      -/\/\ark

    2. Steven Snyder on

      Hello Mark,

      First of all, thank you for your post. We consider StorageCraft’s ShadowProtect technology a powerful tool in your Disaster Recovery and Business Continuity Planning. You are not limited to how you use this tool. In fact we expect you to find creative ways of using ShadowProtect that will best meet your unique needs. In this example you are looking at a way of using the ShadowProtect tool to provide reliable backups while at the same time you want to use Microsoft server’s deduplication functionality to meet your storage limitations.

      For those who are not familiar with Microsoft’s data duplication, it is intended to optimize file data on a volume by performing the following steps:

      1. Segment the data in each file into small variable-sized chunks.
      2. Identify duplicate chunks.
      3. Maintain a single copy of each chunk.
      4. Compress the chunks.
      5. Replace redundant copies of each chunk with a reference to a single copy.
      6. Replace each file with a reparse point containing references to its data chunks.

      As Mark has pointed out, Microsoft allows deduplication on NTFS data volumes (often referred to as storage pools) and on Windows Server operating systems beginning with Windows Server 2012. The idea then is to combine ShadowProtect’s backups to an NTFS volume with Windows deduplication in an effort to minimize required storage space. As Mark pointed out, he has 100 Windows 7 workstations all built from the same image and expects to see some duplication across those workstations and the backup images that they produce.

      The key here is in knowing how much redundancy exists between the backup images in order to calculate an estimate of how much storage space can be saved. ShadowProtect technology is an image-based backup system, which means that it backs up at the byte/sector level of a Windows system instead of at a file/folder level. With 100 workstations all built off of the same image I would expect to see a large amount of file and folder redundancy. On the other hand these workstations might have dissimilar hardware, different use patterns, a range of various softwares installed, different levels of disk defragmentation, and other modifications which make the byte and sector patterns on these systems vary from the original image. This variation will decrease the amount of chunk duplication between image files and reduce the space benefits Microsoft’s deduplication creates.

      At the same time, this variation can also create patterns which may offer new opportunities for data duplication within the image file. Variation over time across the pool of 100 workstations should normalize statistically to an average measure.

      I would suggest that your first step is to use Microsoft’s command line utility “ddpeval.exe” (which is found in the \Windows\System32 directory on Server 2012 installations) to estimate possible space savings. This executable can be run against an NTFS volume to provide a report showing how much storage space can be freed using Microsoft’s deduplication. Running this utility will give you a baseline using current images. Keep in mind that this estimate may change over time as new incrementals are added to your repository. If you decide to use the Server 2012 deduplication you should be able to monitor future data growth in your storage pool from the admin console.

      You mentioned leaving your backup chains unencrypted and uncompressed in order to maximize deduplication. This may be true, but as I pointed out earlier Microsoft’s deduplication is based on similarities between chunks of data and with an image-based backup you may see a reasonable amount of data pattern duplication within image files when you compress and encrypt your data. As long as there exists duplication between data chunks Windows will attempt to maximize storage space by removing the duplicates and leaving pointers in their place.

      The theory looks plausible, but there are some unknowns as to how much space in your data pool will be freed. StorageCraft supports backup chains that are uncompressed and unencrypted, though we recommend that you encrypt your data for security and that you take advantage of a compression rate that averages around a 60% size reduction in storage. Leaving off these two advantages may not be worth the price of any savings deduplication can provide. You will need to measure your environment to decide.

      In addition, your scenario seems to exclude the use of ImageManager to set retention policies, verify backup images, and create consolidated daily, weekly, and monthly files. If you expect to minimize data pattern redundancy than ImageManager will only stir the pot (so to speak) as it constantly verifies and consolidates your backup image chains. These modifications might mean that over time you will see pattern duplication between backup image chains drop as system use and image consolidation alter the data patterns in your stored backup images. I would expect your data pattern duplication to normalize as a percentage of your total data requirements. I will restate again that I’m not sure that the storage savings you may or may not find outweigh the advantages ImageManager provides.

      Ok I’ve painted this answer to your question in broad strokes. My intent has been to discuss the forces at play and some recommendations. I believe that if you measure your expected storage pool space savings that you will be in a better position to make a decision. I’ve checked with our QA team and they have this scenario in their list of items to test; however, at this time combining Microsoft’s Server 2012 deduplication with ShadowProtect is unsupported. In theory it should work, it just hasn’t been tested. I’ve mentioned my concerns, namely foregoing the benefits of ImageManager and ShadowProtects encryption and compression in exchange for space savings from deduplication. It may be that you can use all of these features and still use deduplication. That would be ideal.

      In closing, I’m currently setting up a 2012 server to play around with your idea. If you would like to continue this discussion I would welcome your correspondence.

      Cheers!

    3. Octavian Grecu on

      Hello,

      I’m just wondering if any of you have actually tested this scenario in the end and come to any conclusion since this article was published.

      Thank you!

    4. Steven Snyder on

      Hello Octavian,

      Thank you for asking. To be honest I haven’t tested this theory, though it’s been on my “to do” list since the question first came up. Have any of our other readers tried storing backup images on a Server 2012 deduplicated volume? I would be interested in at least two qualities of this test: 1) how much storage can be freed using this process (as a percentage of the original data size), and 2) is their any discernible difference in I/O speed compared with a data volume that isn’t managed? I’m interested in your comments.

      Cheers!

    5. tommcg on

      I think you are missing the point entirely here. I have a home with 5 PCs all running same Windows OS version and same versions of Office. MOST of the file data on the machines are copies of same files on other machines: the Windows OS files and Office binaries. I want to backup full system snapshot images (not just photos and music) daily to a NAS on my LAN, or even a headless Windows machine acting as a NAS (like the old Windows Home Server product). I want the bandwidth savings of laptops backing up over wifi to notice that those windows files are already stored and not transmit them over wifi. I also want the total NAS storage of all combined backups reduced so that I can copy the NAS storage to either external drive for offsite storage, or more interesting up to the cloud for redundancy. ISP bandwidth caps, limited upstream bandwidth, and cloud storage annual cost per GB mean that deduplicated backup storage is essential. The cost of additional local storage is NOT the only consideration.

      I don’t care about Windows Server’s integrated deduplication. The deduplication has to be part of the backup system itself, especially if you are doing cluster or sector level deduplication, to avoid sending the duplicate data over the wire to the data storage in the first place.

      I’ve been looking at different backup solutions to replace Windows Home Server (a decade-old product that offered deduplication), and your product looked very interesting, but unfortunately the lack of built-in deduplication rules it out for me. I can only imagine how this affects 100-desktop customers when I wont’t even consider it for 5-desktop home use.

    6. Steven Snyder on

      Thank you for your comments. We appreciate all points of view on this topic.

      I agree that ISP bandwidth caps, limited upstream bandwidth, and cloud storage cost per GB show how critical it is to minimize data transmissions offsite. I also believe that much like modems and BETA video tapes, the bandwidth of today is giving way to higher access everywhere. For example, Google Fiber is now available to some of my peers at the office. Cellular LTE and satellite technologies are also increasing bandwidth for small business and home offices. At the same time, our data consumption and data creation is increasing at a rate that may outpace this increased supply of bandwidth. Either way, there are ways to work around data transmission limits.

      One way we help with data transmission over slower networks is we incorporate WAN acceleration and bandwidth scheduling technologies into our offsite replication tools. These allow you to not only get the most efficient use of available bandwidth but to also schedule your data replication during off-peak hours. Another way we help with data transmission is through compression. Deduplication is after all simply another form of data compression which reduces the near side (source) data before it is transmitted over the wire (target).

      In your case, you could use our product to store images on a local volume which has deduplication. You could then replicate data over the wire to offsite storage using ImageManager or some other tool. Many of our customers do this very thing.

      Keep in mind that the deduplication process has to occur at some point: either at the source or at the target. If you wanted to deduplicate your 5 PCs you would be best served with a BDR solution that can read each of those PCs, see the duplicate files on each, and avoid copying those files to storage. In this example, deduplication would occur on your BDR but you’re still reading data from each PC over the wire to your BDR. In addition, your BDR would control the index for data stored on a separate volume or perhaps has the storage volume incorporated in the BDR. This creates a single point of failure because if your BDR crashes then the backup images for your 5 PCs wouldn’t be recoverable and current backup processes cease.

      At StorageCraft we focus on the recovery. Our philosophy means that we take the smallest fastest backup images we can and then we give you ways to automatically test those images for reliability, compress them into daily/weekly/monthly files according to your retention policy, and replicate those images locally and offsite. This gives you a solid foundation from which to recover those images quickly to almost any new environment. I have yet to see a faster more reliable solution among our competitors.

      Cheers,
      Steven

    Leave a Reply

    • (will not be published)