Data deduplication technology is very important in virtual environments, which are very quick to create data. Data deduplication technology allows you to backup only the data that has changed. It takes a file, creates segments, and every time a change is made, the segment in the database is not sent for backup, thereby reducing the amount of data sent from the client to the media server. However, an organization needs to consider some important parameters before deciding to adopt data deduplication technology:
- Although it can be implemented in any storage environment, an organization with a high data retention rate is more prone to duplicated data and thus, data deduplication technology yields more benefits.
- Data deduplication can be done on the source, the media server or the target. Deduplicating the data at the source allows you to send only the segments, thereby reducing bandwidth usage and enabling encryption of data. The central processing unit (CPU) is used at the source to do dedupe. So, it works relatively well where lot of CPU or rather significant amount will be available. However, if you are an information technology organization and are backing up data for another department, you might consider using target or media server data deduplication technology.
- The nature of data for data deduplication technology ought to be considered as some data is more dedupable. For instance, a set of word files will have a high dedupe ratio, whereas databases are not very dedupeable as updating even one record messes up the file. Hence, it should be decided whether to back up data as it is or to delete some of it at the source. Temporary data can be deleted and permanent data can be backed up.
- Another consideration is the availability of deduplicability hardware such as the virtual tape library, which makes data deduplication technology more conducive.
- The bandwidth is another important factor, as it is expensive. Hence, it is important to determine where you want to implement data deduplication technology (front end or back end), what data is conducive for deduplication, and other such issues. In primary data, the duplication rate is not as high as secondary data. When you are talking about suitability of data in a traditional environment, you are doing full backups every 12 weeks or so. While doing it at the front end, it becomes a little more difficult to justify it, since the frequency of duplication is relatively low.
About the author: Oussama El-Hilali is the vice president for NetBackup Engineering at Symantec. He is responsible for planning, development and release of Symantec’s NetBackup range, including PureDisk and RealTime Protection. He has also led the Product Management team for NetBackup.
(As told to Jasmine Desai)
This was first published in December 2010