Hadoop and Big Data are practically synonymous these days, but as the Big Data hype machine gears up, there's a lot of confusion about where Hadoop actually fits into the overall Big Data landscape.
Hadoop is an open-source software framework that stores and analyzes large data sets distributed across multiple off-the-shelf servers. Responsible for much of the heavy lifting involved in analyzing data from such varied sources as mobile phones, email, social media, sensor networks and pretty much anything that can offer up actionable data, Hadoop is often considered the operating system of Big Data.
And that's where the first myth creeps in:
It's not. Whether you prefer to call it a "framework" or a "platform," just don't think Hadoop will solve all of your Big Data problems.
"There is no standard Hadoop stack," said Phil Simon, author of Too Big to Ignore: The Business Case for Big Data. "It's not like going to IBM or SAP to get a standard database."
However, Simon doesn't think that will be a long-term problem. First, since Hadoop is an open-source project, many other Hadoop-related projects, such as Cassandra and HBase, can address specific needs. HBase, for instance, offers a distributed database that supports structured data storage for large tables.
Moreover, just as Red Hat, IBM and plenty of others vendors packaged Linux into a variety of user-friendly products, Big Data startups are emerging to do the same with Hadoop.
So, while Hadoop isn't a complete solution in and of itself, most enterprises will actually encounter it as something packaged in larger Big Data suites.
Hadoop is often talked about like it's a database, but it isn't. "There’s nothing in the core Hadoop platform like a query or an index," said Marshall Bockrath-Vandegrift, a software engineer with Damballa, a security company. Damballa uses Hadoop to analyze real-time security threats.
"We use HBase to give our threat analysts the ability to run real-time queries against passive DNS data. HBase and the other real-time technologies are not only complementary to Hadoop, but most depend on the core Hadoop distributed storage technology (HDFS) to provide performant access to distributed datasets," he added.
Or, as Prateek Gupta, a data scientist with marketing analytics firm BloomReach said: "Hadoop is not a replacement for a database system, but you can use it to build one."
Many organizations fear that Hadoop is too new and untested to be suited for the enterprise. Nothing could be further from the truth.
Remember, Hadoop was built on the Google File System (GFS) distributed storage platform and Google MapReduce, a data analytics tools running on top of GFS. Yahoo actually put the time and money behind Hadoop, and in 2008 launched its first major Hadoop application, a search "webmap," which indexed all known webpages and the corresponding meta-data needed to search those pages.
Today, Hadoop is used by everyone from Netflix to Twitter to eBay, and major vendors including Microsoft, IBM and Oracle all sell Hadoop tools.
It's too early to call Hadoop a "mature" technology – which is the case with any Big Data platform – but it has been adopted and tested by major enterprises.
That doesn't mean it's a risk-free platform. Security is a sticking point for instance, but businesses shouldn't be scared off by Hadoop's youthful veneer.