I recently responded to an emergency at a customer site. The customer had an important database that had gotten corrupted, which they proceeded to restore from backup tapes. The customer has a good backup policy, yet it took them six days to get back on line.
Naturally, this raises a number of questions. First of all, what caused the corruption? And second, why did it take so long to restore the database? And finally, what could be done to improve the restoration time in the future?
These three issues need to be addressed as part of any backup/restore strategy. It also reminded me of a point I've often made: backups are only as good as your ability to restore (see Testing Backup and Restore: Why It Matters).
I still believe that the problem is not backup but restoration, and system designers ought to be architecting for restoration of data, not backup of that data. So let's proceed to answer the three questions.
What Caused the Corruption?
There are a number of ways that data can get corrupted: a problem with database software, the file system, a device driver or a RAID or disk firmware problem could all corrupt data.
On a couple of occasions, I have seen a Fibre Channel switch port and a Fibre Channel cable that have corrupted files. You would think that this should be caught with higher-level protocols such as SCSI, since the command should be corrupted or the Fibre Channel CRC should require a retransmit, but on both occasions this did not happen as expected.
This has led me down the path to try to better understand the issues surrounding undetectable bit error rate (UDBER). An undetectable error basically occurs when you get two errors at the same time such that the error encode (such as Reed Solomon encoding) does not pick up the error.
Here's a simple example from early in my career. The Cray-1A used something called SECDED (Single bit Error Correction Double bit Error Detection). One time, out of the blue, programs started aborting randomly, but the operating system was still up and running. On the half hour, low-level diagnostics were run on each memory location, and lo and behold, the system was having a triple bit error.
The system was never designed to correct nor report triple bit errors. This was my first exposure to undetectable errors, but unfortunately not my last.
There is good data for what the UDBER is for tapes on the Internet. LTO is documented at 10E-27, while Sun's T10000 is listed at 10E-33. Both of these are very small numbers, and the likelihood of getting an UDBER on either of these devices is low.
On the other hand, the bit error rate for Fibre Channel is 10E-12, and for SATA, 10E-14, and Fibre Channel disk drives, 10E-15. I do not know what the UDBER rates are for any of these devices, since they are not published, documented or even talked about in whispers. Believe me, I have asked till I am blue in the face and gotten nowhere.
It is always easy to point fingers for corruptions at software, and I am sure that more often than not it is a good place to start, but as channels get faster and faster, error encoding for disk drives has not changed in a long time. Also, keep in mind that you have to add up the whole data path from the CPU to the device to calculate the UDBER. So who knows what happened at this site, but corruptions do happen.
Why Did It Take So Long To Restore?
This was a large environment, and backups were being done via a backup client and server with the tape drives attached to the backup server. Since the clients were connected by at most GigE, the absolute fastest transfer rate that could be done was about 60 MB/sec.
As you might remember from your LTO-3 specifications, the tape drive can run up to 80 MB/sec uncompressed, and with compression the drive can run almost twice as fast. At this site, with the average compression, the tapes were being written on average about 140 MB/sec. This is far faster than an uncontended GigE. What the site did to optimize backup performance was allow multiple client streams to be combined and written to the same tape.
While this certainly improves backup performance, it does quite the opposite for restoration performance. Now multiple tapes have to be mounted to restore files, since they were combined from multiple machines. In fact, for this site the number of tape mounts that had to be done to restore just 6 TB was more than 140 tapes; the 6 TB of data could have fit on a little more than 15 LTO-3 tapes.
Just the mount and position time for all of these extra tapes was more than three hours. You might be able to reduce this time slightly with a different tape technology, but this will not make a significant difference. Products such as Copan's MAID device and Imation's Ulysses (disk in a tape cartridge) would significantly improve this time.
Having all of this extra time is not helpful for meeting service level agreements for restoration, but it does improve backup time.
What Can Be Done
Part of understanding restoration is understanding why it is needed in the first place. From what I have seen, restoration is very important in the desktop environment because careless users delete files, and there is almost a constant large restoration problem. You aren't restoring a great deal of data, but you are doing it almost constantly.
On the other hand, if a mission critical database goes bye bye and you need to restore the whole thing, this becomes a critical event, and your business depends on how fast you can restore the database. So what could this customer site have done differently?
In my opinion, they were using a one-size-fits-all backup/restore policy that was not very good for the desktop environment and even worse for the mission-critical environment because they were trying to optimize the backup problem rather than the restoration problem.
I'm not trying to point fingers here, but this often happens when the staff explains the problem in terms of how fast backup can be done instead of how long it would take to restore the data. When they explain the problem and feel the pain of restoration, most often it is for the desktop environment and not the mission-critical data, and unfortunately management budgets based on that type of restoration.
So what could the site have done? In the short term, not much, because the problem was an architectural problem, not something intrinsic to their procedures. They did not have to combine client tape streams, but they did not have the time to wait for all of the backups to complete within the time window. The site had all kinds of options, but not with the architecture they had developed.
The point is that that not only is backup often less important than restoration, but most important, a single backup architecture is unlikely to solve the multitude of issues with backup and restore. You need to consider many things, such as:
I have said time and time again that working with storage is just plain hard because storage is not scaling with CPU performance, so we are having trouble keeping up with data generation. This trend is not going to change, so what is needed so we do not get yelled at by management when things go bad (and they will sooner or later) is clear explanation to management and the user community about what can and cannot be done with the software and hardware you have so they know what they are getting. If they want something different, tell them to send money and you can create a different architecture to meet the requirements. All too often I hear expectations that backup/restore for desktops should work exactly the same as backup/restore for mission-critical applications. This cannot happen without a great deal of work and money because they are not the same.
Henry Newman, a regular Enterprise Storage Forum contributor, is an industry consultant with 26 years experience in high-performance computing and storage. See more articles by Henry Newman.
This column was first published on Enterprise Storage Forum.