For over 50 years, IT organizations have performed nightly backup of all their data to tape, which has been used primarily due to low cost. The retention of most organizations is typically 12 weeks onsite and can be up to as much as seven years offsite. The average organization keeps 40 to 100 copies of their backup data, each at a different point in time for regulatory compliance, SEC audits, legal discovery, and various business reasons. If disk had been used instead of tape, the cost of backup would have been 40 to 100 times the cost of the primary storage. Due to the number of copies kept in backup, tape was the only economical approach.
Over the last 10 years, the majority of IT organizations have placed some disk in front of tape libraries. This is called “disk staging” and allows for faster and more reliable backups and restores as the latest backups are kept on disk. However, due the cost of disk, most organizations only keep one to two weeks of retention on disk and then longer-term retention on tape.
The challenge is the amount of backup data stored due to keeping multiple copies. However, the data from one backup to another is highly redundant. If you have 50TB of data to be backed up, only about 1TB or 2% changes from backup to backup. Instead of backing up 50TB over and over, why not just back up the 50TB once and then the only changes from then on? This would drastically reduce the amount of disk required.
Data deduplication solves the challenge by storing only the unique bytes and not storing bytes or blocks that have already been stored. This approach can reduce the amount of disk required to be about 20:1. For example, if 50TB of data is kept for 20 weeks, one petabyte of storage would be required. However, if the 50TB was compressed 2:1 to 25TB and then just the 2% change between backups is kept, then you would store 25TB plus 19 copies at 1TB each or about 44TB of data.
In this very simplistic example, the amount of storage required would be reduced about 1/20th. By only storing unique bytes and blocks, data deduplication uses far less disk and brings the cost of disk to about the cost of tape. Disk is faster and more reliable for both backups and restores. With the advent of data deduplication, about 50% of IT organizations have already moved to disk and eliminated tape backup.
As with all advancements, there is always a warning label. Data deduplication is a compute-intensive process as all the data has to be split and compared. Depending on how deduplication is implemented, backup and restore speed can be greatly impacted. In some cases, data is only stored as bytes and blocks and needs to be rehydrated for every restore request.
There are three major areas of impact.
The first is backup performance. One approach is inline deduplication where the data is deduplicated inline on the way to the disk. This approach can slow down backups, as it is a very compute-intensive and cumbersome process. The alternative is to write direct to disk and perform deduplication after the data is committed to disk. This allows for the fastest backup performance.
The second area of impact is the storage architecture that the deduplication is deployed on. A scale-up architecture has a front-end controller with disk shelves. As data grows, disk shelves need to be added. Using this approach, the backup window will invariably get longer as data grows. The more data there is, the longer it takes to deduplicate as no additional processor, memory, or bandwidth is added. The backup window will ultimately grow to a point where the controller will have to be replaced with a bigger, faster controller which adds cost.
The alternative is a scale-out architecture where appliances are added into a grid. As data grows, the backup window stays fixed in length as each appliance comes with processor, memory, and bandwidth as well as disk. Therefore, both compute and capacity resources are added, allowing for additional deduplication resources, resulting in a fixed-length backup window.
The third area of impact revolves around restores, VM boots, and offsite tape copies. With an inline approach, the stored data is 100% deduplicated and for each restore request, the data needs to be rehydrated, which takes time. If the alternate approach is used – writing direct to disk and then deduplicate the data – the most recent backups are in their complete undeduplicated form and ready for fast restores, instant VM boots, and fast tape copies. The difference in restore time, boot time, and copy time can be measured in minutes versus hours between the two approaches.
With data deduplication, architecture matters.
You cannot simply buy disk with deduplication. Understanding the different architectural approaches and the resulting impact on backup performance and backup window length as data grows as well as the impact on performance when doing restores, VM boots, and tape copies will save you a lot of work over time.
About the author:
Photo courtesy of Shutterstock.