Hadoop and Big Data are practically synonymous these days, but as the Big Data hype machine gears up, there’s a lot of confusion about where Hadoop actually fits into the overall Big Data landscape.
Hadoop is an open-source software framework that stores and analyzes large data sets distributed across multiple off-the-shelf servers. Responsible for much of the heavy lifting involved in analyzing data from such varied sources as mobile phones, email, social media, sensor networks and pretty much anything that can offer up actionable data, Hadoop is often considered the operating system of Big Data.
And that’s where the first myth creeps in:
1. Hadoop is a complete solution.
It’s not. Whether you prefer to call it a “framework” or a “platform,” just don’t think Hadoop will solve all of your Big Data problems.
“There is no standard Hadoop stack,” said Phil Simon, author of Too Big to Ignore: The Business Case for Big Data. “It’s not like going to IBM or SAP to get a standard database.”
However, Simon doesn’t think that will be a long-term problem. First, since Hadoop is an open-source project, many other Hadoop-related projects, such as Cassandra and HBase, can address specific needs. HBase, for instance, offers a distributed database that supports structured data storage for large tables.
Moreover, just as Red Hat, IBM and plenty of others vendors packaged Linux into a variety of user-friendly products, Big Data startups are emerging to do the same with Hadoop.
So, while Hadoop isn’t a complete solution in and of itself, most enterprises will actually encounter it as something packaged in larger Big Data suites.
2. Hadoop is a database.
Hadoop is often talked about like it’s a database, but it isn’t. “There’s nothing in the core Hadoop platform like a query or an index,” said Marshall Bockrath-Vandegrift, a software engineer with Damballa, a security company. Damballa uses Hadoop to analyze real-time security threats.
“We use HBase to give our threat analysts the ability to run real-time queries against passive DNS data. HBase and the other real-time technologies are not only complementary to Hadoop, but most depend on the core Hadoop distributed storage technology (HDFS) to provide performant access to distributed datasets,” he added.
Or, as Prateek Gupta, a data scientist with marketing analytics firm BloomReach said: “Hadoop is not a replacement for a database system, but you can use it to build one.”
3. Hadoop is too risky for enterprise use.
Many organizations fear that Hadoop is too new and untested to be suited for the enterprise. Nothing could be further from the truth.
Remember, Hadoop was built on the Google File System (GFS) distributed storage platform and Google MapReduce, a data analytics tools running on top of GFS. Yahoo actually put the time and money behind Hadoop, and in 2008 launched its first major Hadoop application, a search “webmap,” which indexed all known webpages and the corresponding meta-data needed to search those pages.
Today, Hadoop is used by everyone from Netflix to Twitter to eBay, and major vendors including Microsoft, IBM and Oracle all sell Hadoop tools.
It’s too early to call Hadoop a “mature” technology – which is the case with any Big Data platform – but it has been adopted and tested by major enterprises.
That doesn’t mean it’s a risk-free platform. Security is a sticking point for instance, but businesses shouldn’t be scared off by Hadoop’s youthful veneer.
4. We’ll need to hire a bunch of programmers to use Hadoop.
Depending on what you plan to do, this myth may come true. If you plan to build the next great Hadoop-based Big Data suite, you’ll need programmers who can write in Java and understand specialized MapReduce programming.
However, if you’re content to build on the work of others, programming shouldn’t scare you off. Data Integration vendor, Syncsort, recommends leaning on Hadoop-compatible data integration tools that will allow analysts to run advanced queries without having to do any coding.
Most data integration tools will have GUIs that abstract MapReduce programming complexity, and many come with pre-built templates.
Moreover, startups including Alpine Data Labs, Continuuity and Hortonworks offer tools to simplify Big Data in general, and Hadoop in particular.
5. Hadoop isn’t suitable for SMBs.
Many SMBs fear that they’ll be locked out of the Big Data trend. The big vendors, the IBMs and Oracles, predictably peddle big, expensive solutions. That doesn’t mean there aren’t SMB-friendly tools out there.
Cloud computing is rapidly democratizing access to sophisticated technologies. “The cloud is turning Capex into Opex,” Big Data author Phil Simon notes. “You can take advantage of the same cloud services that Netflix does, and the same thing is starting to happen with Big Data. A company of five can use Kaggle.”
Kaggle calls itself a “marketplace that bridges the gap between data problems and data solutions.” For instance, startup Jetpac offered $5,000 to someone who could come up with an algorithm that would identify compelling vacation photographs. Most vacation photos are pretty awful, after all, and separating the wheat from the chaff is a tedious, time-consuming process.
Jetpac had people manually rate 30,000 photos, and sought an algorithm that would rank photos the same way actual humans did, just by analyzing metadata (photo size, captions, descriptions, etc.). If Jetpac tried to develop this itself, the company would have spent a heck of a lot more than $5,000, and they would have had a single solution, not their pick of several.
In fact, Jetpac’s image processing tool helped them land $2.4 million in VC funding from Khosla Ventures and Yahoo co-founder Jerry Yang.
6. Hadoop is cheap.
This is a common misconception associated with anything open source. Just because you’re able to reduce or eliminate the initial costs of purchasing software doesn’t mean you’ll necessarily save money. One of the problems with the cloud, for instance, is that it’s so easy to run a science project on Amazon that developers of all sorts throw projects up in AWS, forget about them, but keep paying for them.
And virtual server sprawl already makes physical server sprawl look quaint.
While Hadoop helps you store and analyze data, how will you get legacy data into the system? How will you visualize the data? How will you share it? How will you secure data as it is shared more often across the enterprise?
A Hadoop solution is actually a patchwork of solutions. You can turn to a company like Cloudera for a complete enterprise solution, or you can start putting together a highly customized solution yourself. Whatever route you choose, you’ll need to budget carefully because free software is never really free.
Jeff Vance is a Santa Monica-based writer. He’s the founder of Startup50, a site devoted to emerging tech startups. Connect with him on Twitter @JWVance.