Deduplication of data: 10 tips for effective solution evaluation

When it comes to ensuring deduplication of data, enterprises are looking at integrating dedupe hardware and software with their existing infrastructure to cut down storage costs. In parallel, vendors have also come

    Requires Free Membership to View

up with various data deduplication offerings for primary storage or backups.

Rather than straightaway going in for, and deploying a solution for deduplication of data, you should first know what different vendors offer in terms of technology—and which solution meets your requirements. For now, these best practices will be useful for deciding upon and deploying a solution to ensure deduplication of data in your environment.

• When it comes to backup solutions that have features for deduplication of data, there are source-based and target-based offerings in the market. In source-based data deduplication backup solutions, deduplication of data takes place at the source level. The deduped data then flows on the network to the backup device. In target-based dedupe methods, deduplication of data takes place at the target—that is, on the backup appliance, and not on the client.

• Choose a target-based solution for deduplication of data if you can't afford the host processing cycles to be used for dedupe processes.

• Source-based data deduplication backup software is a good candidate for WAN backups. This is due to the fact that backup data across the WAN will be much less, compared to the data traffic if you opt for a target-based data deduplication product.

• Nowadays, companies such as DataDomain have come up with software components (like DD Boost) which reside on the backup server, storage nodes or media servers; communicate with the DataDomain appliance to check whether the segment to be backed up is unique; and send only the unique data segments. This feature for deduplication of data leads to a reduction in backup traffic by up to 80%.

• Always choose a backup application that is intelligent enough to categorize data in terms of which data is a better candidate for deduplication and compression—and which is not. Databases are not good candidates for deduplication, but file systems and virtual environments are excellent environments for deduplication of data. In the case of virtual environments, the commonality factor is much higher (as if we have deployed Windows-based virtual machines) than the base OS commonality factor (which will be much higher for backups).

• Certain data deduplication solutions do variable length and fixed length deduplication. In the case of fixed length dedupe solutions, the data stream is broken down into fixed size chunks so that a change in one block can lead to subsequent changes in the following blocks. This leads to backing up of the whole file in the next backup cycle. In the case of variable length dedupe solutions, a change in the file segment only identifies that segment as unique by adjusting the chunk boundaries, and only the changed segment is backed up. Therefore, in the case of variable length data deduplication solutions, the dedupe ratios are far more than in the case of fixed length dedupe solutions.

• Look at the granularity at which a solution performs deduplication. The smaller the chunks for deduplication of data, the greater the chances of finding commonality.

• Some hardware solutions perform post-process or inline data deduplication. In post-process data deduplication, you need storage equal to the data size being backed up. In the case of inline dedupe solutions (as the data deduplication process runs afterward), the data is deduped as it comes to the backup appliance. Hence, inline dedupe solutions are the best option to go in for in the case of backup solutions.

• As of today, most backup software vendors support multiplexing—that is, backup streams from different clients being backed up simultaneously. Set the multiplexing factor to 1, as data from various clients and heterogeneous platforms are fed to the backup appliance. This practice will ensure lesser chances of finding commonality. When the multiplexing factor is set to 1, only one stream is backed up at a time.

• If you use the appliance as a NAS for CIFS or NFS shares, avoid the inline data deduplication process. This is essential, since it may affect response times for applications or users accessing the shares, and schedule deduplication post- processing cycles.

To wind up, every dedpulication system has certain checks specified by the vendor. Perform these checks post the rollout at regular intervals to maintain data consistency.

Anuj Sharma

About the author: Anuj Sharma is an EMC Certified and NetApp accredited professional. Sharma has experience in handling implementation projects related to SAN, NAS and BURA. He also has to his credit several research papers published globally on SAN and BURA technologies.

This was first published in May 2010

Disclaimer: Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.