You can’t dig into Big Data storage without first discussing Big Data in general. Big Data is a concept that any IT professional or knowledge worker understands almost by instinct, as the trend has been covered so extensively.
Data has been growing exponentially in recent years, yet much of it is locked in application and database siloes. If you could drill into all of that data, if you could share it, if you could cross-pollinate, say, a CRM system with information from your marketing analytics tools, your organization would benefit. Easier said than done.
That, essentially, is the Big Data challenge.
Arguably, the concept of Big Data entered the public imagination with the publication of Michael Lewis’ Moneyballin 2003. Of course, the term “Big Data” is nowhere to be found in the book, but that’s what the book was about – finding hidden patterns and insights within the reams of data collected during each and every major league baseball game.
One statistic that has been buried – well, buried isn’t right; ignored is more accurate – was about drafting college players over high school players. College players have a track record. They have statistics that can be measured, and they played against at least a half-decent level of competition:
“[Bill James] looked into the history of the draft and discovered that “college players are a better investment than high school players by a huge, huge, laughably huge margin.” The conventional wisdom of baseball insiders – that high school players were more likely to become superstars – was also demonstrably false. What James couldn’t understand was why baseball teams refused to acknowledge that fact.”
Pushing past gathering raw information and onto challenging preconceptions is at the heart of Big Data. So, too, is discovering truths that no one would have ever suspected before.
However, in order to gain these new insights and to challenge our misconceptions, we must find ways to access all of that data, hidden away in all of those proprietary applications and databases.
That’s not just a Big Data problem. It’s also a management problem, and it’s most certainly a storage problem.
Exponential data growth and cheap storage – the scale of the problem
Just how much data is out there? No one knows for sure, of course, but IBM’s Big Data estimates conclude that “each day we create 2.5 quintillion bytes of data.” The exponential growth of data means that 90 percent of the data that exists in the world today has been created in the last two years. “This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, e-commerce transaction records, and cell phone GPS coordinates, to name a few.”
To put the data explosion in context, consider this. Every minute of every day we create:
• More than 204 million email messages
• Over 2 million Google search queries
• 48 hours of new YouTube videos
• 684,000 bits of content shared on Facebook
• More than 100,000 tweets
• 3,600 new photos shared on Instagram
• Nearly 350 new WordPress blog posts
This volume of data could not be saved, collected and stored were it not for the fact that data storage is so incredibly cheap. Today, everything from tablets to desktops is sold with ever bigger hard drives. Why would you bother deleting anything when it’s so cheap and easy to store it?
Between 2000 and today, the cost of storage has plummeted from about $9/GB to a mere $.08/GB, and as soon as I typed that low price point, you can bet that downward price pressure has already made those numbers obsolete.
If you are a highly paid knowledge worker, it’s probably cheaper to store data than delete it, since the productivity lost while purging old files may well cost your organization more than the storage costs — unless you have to find something lost in this data maze for, say, regulatory compliance.
If data growth is exploding, where do you store it all?
Data is collected from everywhere, but where is it stored? That’s the crux of the problem. It’s stored everywhere, as well. Typically, these data repositories – “data silos”– are application specific.
Big Data storage, then, is as much about managing data as about storing it.
This is not a new problem. Way back in prehistoric times when the only way we humans stored information was inside our heads, you could say that each person was a data silo. We broke those silos down through language, conversation, storytelling, the oral tradition and eventually books.
In Big Data storage management, we’re encountering a problem we’ve dealt with many times before.
We haven’t yet figured out a workable Dewey Decimal system for data. We’re moving in the right direction, with such tools as hyperlinks and wikis. But most data in enterprise applications, email servers and social networks is not structured for easy sharing to other applications.
5 storage barriers that impede Big Data progress
1. Unstructured data. There are two types of data in storage, structured and unstructured data. Structured data has a high degree of organization, and is typically stored in a relational database that can be easily searched.
Unstructured data is, obviously, not structured in any meaningful way, including such things as photographs, videos, MP3 files, etc. Unstructured data is difficult to search and analyze.
2. I/O barriers. If you’re dealing with something like mapping genomes, gathering information from the Mars Rover or running sophisticated weather simulations, the transaction volumes of these data sets challenge traditional storage systems, which don’t have enough processing power to keep up with the huge number of I/O requests.
3. Management. There are a million and one storage management tools out there. The most basic one – and one still in wide use even in business, believe it or not, is a simple Excel spreadsheet – but vendors from EMC to Hitachi Data Systems to NetApp offer solid storage management solutions. The trouble is, though, that data-sharing standards are still lacking and escaping vendor-lock is a never-ending challenge.
4. The WAN. As cloud computing becomes mainstream, the simplest way to break down data silos is to leverage the cloud to help with everything from search to backups to raw processing. However, as more storage moves into the cloud, the more the WAN will impede on Big Data progress. The WAN, unfortunately, isn’t keeping up with Moore’s Law, nor with the storage-specific analog Kryder’s Law. Any Big Data storage solution must include some combination of redundant MPLS links, WAN optimization and CDN services.
5. Security. As you break down data barriers, certain people may get access to data (say HR records) that they should never, ever see. Thus, authentication, access and security in general are a major Achilles heel of Big Data storage.
Hadoop helps tame data
Since traditional databases, such as SQL, weren’t designed with Big Data in mind, eventually a Big Data alternative emerged: Apache Hadoop.
According to the Apache Software Foundation, Hadoop is a “framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
“Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.”
Key modules to consider, as far as Big Data storage is concerned, include HDFS, a distributed file system to access application data; the Hive data warehouse infrastructure; and Chukwa, a data collection system for managing large distributed systems.
The Future of Big Data storage: the cloud
Big Data storage is quickly becoming a subset of cloud storage. As data centers are virtualized, and as more data is moved into third-party data centers, Big Data and cloud storage challenges (and opportunities) will begin to merge.
Granted, not all applications will move off-site, nor will every single application be one open to Big Data sharing. However, as security and access rights solutions strengthen, don’t be surprised if nearly every application under the sun is able to share data with nearly every other one – in an ideal world. Of course, standards fights will brew, vendors will do their best to lock customers into their solutions and problems like data loss and IP theft will undermine this ideal world, but it will be feasible from a technical standpoint.
Most enterprises will get their cloud storage feet wet with data backups. Eventually, they will use APIs to connect their on-premise data repositories with cloud and SaaS services, such as Salesforce.com, and cloud storage will evolve into Big Data storage. Further out, as most infrastructure moves into the cloud, various cloud providers will offer an array of Big Data storage options as a service.
Along the way, Flash and SSD (Solid-State Drives) may make disk drives obsolete; in-memory storage could break out of its Java purgatory, and biologists working with the human genome may well provide storage insights derived from DNA and gene sequencing.
Top Big Data storage vendors
Note: This is by no means an exhaustive list, and placing vendors in one category or another is largely subjective.
EMC (Key Big Data acquisitions: Greenplum and Isilon)
IBM (Key acquisitions: Cognos, Netezza, OpePages, Algorithmics, Texas Memory Systems)
Oracle (Key acquisition: Endeca)
HP (Key acquisitions: Autonomy and Vertica)
Cisco (Key acquisition: Truviso)
Dell (Key acquisition: Compellent)
Photo courtesy of Shutterstock.