You can’t dig into Big Data storage without first discussing Big Data in general. Big Data is a concept that any IT professional or knowledge worker understands almost by instinct, as the trend has been covered so extensively.
Data has been growing exponentially in recent years, yet much of it is locked in application and database siloes. If you could drill into all of that data, if you could share it, if you could cross-pollinate, say, a CRM system with information from your marketing analytics tools, your organization would benefit. Easier said than done.
That, essentially, is the Big Data challenge.
Arguably, the concept of Big Data entered the public imagination with the publication of Michael Lewis’ Moneyballin 2003. Of course, the term “Big Data” is nowhere to be found in the book, but that’s what the book was about – finding hidden patterns and insights within the reams of data collected during each and every major league baseball game.
One statistic that has been buried – well, buried isn’t right; ignored is more accurate – was about drafting college players over high school players. College players have a track record. They have statistics that can be measured, and they played against at least a half-decent level of competition:
"[Bill James] looked into the history of the draft and discovered that “college players are a better investment than high school players by a huge, huge, laughably huge margin.” The conventional wisdom of baseball insiders – that high school players were more likely to become superstars – was also demonstrably false. What James couldn’t understand was why baseball teams refused to acknowledge that fact."
Pushing past gathering raw information and onto challenging preconceptions is at the heart of Big Data. So, too, is discovering truths that no one would have ever suspected before.
However, in order to gain these new insights and to challenge our misconceptions, we must find ways to access all of that data, hidden away in all of those proprietary applications and databases.
That’s not just a Big Data problem. It’s also a management problem, and it’s most certainly a storage problem.
Just how much data is out there? No one knows for sure, of course, but IBM’s Big Data estimates conclude that “each day we create 2.5 quintillion bytes of data.” The exponential growth of data means that 90 percent of the data that exists in the world today has been created in the last two years. “This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, e-commerce transaction records, and cell phone GPS coordinates, to name a few.”
To put the data explosion in context, consider this. Every minute of every day we create:
• More than 204 million email messages
• Over 2 million Google search queries
• 48 hours of new YouTube videos
• 684,000 bits of content shared on Facebook
• More than 100,000 tweets
• 3,600 new photos shared on Instagram
• Nearly 350 new WordPress blog posts
This volume of data could not be saved, collected and stored were it not for the fact that data storage is so incredibly cheap. Today, everything from tablets to desktops is sold with ever bigger hard drives. Why would you bother deleting anything when it’s so cheap and easy to store it?
Between 2000 and today, the cost of storage has plummeted from about $9/GB to a mere $.08/GB, and as soon as I typed that low price point, you can bet that downward price pressure has already made those numbers obsolete.
If you are a highly paid knowledge worker, it’s probably cheaper to store data than delete it, since the productivity lost while purging old files may well cost your organization more than the storage costs -- unless you have to find something lost in this data maze for, say, regulatory compliance.
Data is collected from everywhere, but where is it stored? That’s the crux of the problem. It’s stored everywhere, as well. Typically, these data repositories – “data silos”– are application specific.
Big Data storage, then, is as much about managing data as about storing it.
This is not a new problem. Way back in prehistoric times when the only way we humans stored information was inside our heads, you could say that each person was a data silo. We broke those silos down through language, conversation, storytelling, the oral tradition and eventually books.
In Big Data storage management, we’re encountering a problem we’ve dealt with many times before.
We haven’t yet figured out a workable Dewey Decimal system for data. We’re moving in the right direction, with such tools as hyperlinks and wikis. But most data in enterprise applications, email servers and social networks is not structured for easy sharing to other applications.
1. Unstructured data. There are two types of data in storage, structured and unstructured data. Structured data has a high degree of organization, and is typically stored in a relational database that can be easily searched.
Unstructured data is, obviously, not structured in any meaningful way, including such things as photographs, videos, MP3 files, etc. Unstructured data is difficult to search and analyze.
2. I/O barriers. If you’re dealing with something like mapping genomes, gathering information from the Mars Rover or running sophisticated weather simulations, the transaction volumes of these data sets challenge traditional storage systems, which don’t have enough processing power to keep up with the huge number of I/O requests.