It seems like just about every day brings with it a new cloud storage product announcement from vendors big and small, but the reality is that beyond enterprise firewalls, cloud storage’s potential is limited.
There are two reasons for this: bandwidth limitations and the data integrity issues posed by the commodity drives that are typically used in cloud services. Together those two issues will limit what enterprise data storage users can do with external clouds.
The ideal for cloud storage is to be self-managed, self-balanced and self-replicated, with regular data checksums to account for the undetectable or mis-corrected error rates of various storage technologies. Cloud storage depends on being able have multiples copies of files managed and checksummed and verified regularly, distributed across the storage cloud for safekeeping.
It’s a great idea, but it faces more than a few challenges, such as reliability, security, data integrity, power, and replication time and costs. But one of the biggest issues is simply that hardware is going to break. There are two ways disk and tape drives break:
- Hitting the hard error rate of the media, which is expressed in average number of bits before an error occurs
- Hitting the Annualized Failure Rate (AFR) of a device based on the number of hours used
The most common type of failure is known as the vendor’s bit error rate. The bit error rate is the expected failures per number of bits moved. The following is generally what is published by vendors:
These seem like good values, but it is important to note that they haven’t improved much in the last 10 years, maybe by an order of magnitude, while densities have soared and performance has increased moderately. This will begin to cause problems as the gaps get worse in the future (see RAID’s Days May Be Numbered). So using vendors’ best-case number, how many errors will we see from moving data around, which is needed for replication in clouds?
Clearly, moving 100PB on even 1TB enterprise drives can potentially cause significant loss of data, especially as many clouds I am familiar with do not use RAID and maintain data protection via mirroring. Remember, this is a perfect world and does not include channel failures, memory corruptions and all the other types of hardware failures and silent corruption. What happens if the world is not perfect and failure rates are an order of magnitude worse?
With current technology, you could lose 900TB of data, which is not trivial and would take some time to replicate.
Bandwidth Limits Replication
Now let’s look at the time to replication with various Internet connection speeds and data volumes.
Clearly, no one has an OC-768 connection, nor are they going to get one anytime soon, and very few have 100PB of data to replicate into a cloud, but the point is that data densities are growing faster than network speeds. There are already people talking about 100PB archives, but they don’t talk about OC-384 networks. It would take 10 months to replicate 100PB with OC-384 in the event of a disaster, and who can afford OC-384? That’s why, at least for the biggest enterprise storage environments, a centralized disaster recovery site that you can move operations to until everything is restored will be a requirement for the foreseeable future.
Consumer Problem Looming?
The bandwidth problem isn’t limited to enterprises. In the next 12 to 24 months, most of us will have 10Gbit/sec network connections at work (see Falling 10GbE Prices Spell Doom for Fibre Channel), while at home the fastest connect available as the current backbone of the Internet is OC-768, and each of us internally is going to have a connection that is 6.5 percent of OC-768. That will be limited, of course, by our DSL and cable connections, but their performance is going to grow and use up the backbone bandwidth. This is pretty shocking when you consider how much data we create and how long it takes to move it around. I have a home Internet backup service and about 1TB of data at home. It took me about three months to get all of the data copied off site via my cable connection, which was the bottleneck. If I had a crash before the off-site copy was created, I would have lost data.
Efforts like Internet2 may help ease the problem, but I worry that we are creating data faster than we can move it around. The issue becomes critical when data is lost — through human error, natural disaster or something more sinister — and must be re-replicated. You’d have all this data in a cloud, with two copies in two different places, and if one copy goes poof for whatever reason, you’d need to restore it from the other copy or copies, and as you can see that is going to take a very long time. During that time, you might only have one copy, and given hard error rates, you’re at risk. You could have more than two copies — and that’s probably a good idea with mission-critical data — but it starts getting very costly.
Google, Yahoo and other search engines for the most part use the cloud method for their data, but what about all the archival sites that already have 10, 20, 40 or more petabytes of storage data that is not used very often? There are already a lot of these sites, whether it is medical data, medical images or genetic data, or sites that have large images such as climate sites or the seismic data that is used for oil and gas exploration, and, of course, all the Sarbanes-Oxley data that is required to be kept in corporate America. Does it make sense to have all of this data online? Probably not, and the size of the data and the cost of power will likely be the overriding issues.
Henry Newman, CTO of Instrumental Inc. and a regular Enterprise Storage Forum contributor, is an industry consultant with 28 years experience in high-performance computing and storage.