Friday, March 29, 2024

Hadoop: Can the Tortoise be a Hare?

Datamation content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

As early as 2012, writers, industry critics, and big data companies such as Cloudera predicted Hadoop’s demise as the de facto standard for big data analytics. Hadoop’s future as a viable real-time big data analytics platform seemed questioned at the height of its hype and adoption.

And indeed, many businesses that manage large data sets have looked elsewhere to find something better to use. In the view of some, Hadoop’s complexity and management requirements make it a technology that cannot survive long-term in business.

The necessity to go real-time for analytics and the push toward smaller, cheaper, and more agile systems and solutions are strong driving forces in big data. Businesses need to do more with less and with fewer people. Hadoop, by definition, is the 180-degree tug to this trend. The current business quest is to find something better than Hadoop for big data analytics. It is a quest that’s happening across industries.

Hadoop’s Requirements

Although Hadoop’s production hardware requirements can seem daunting, the Apache Foundation points out that you can install it on a single computer for testing and that there is no single hardware requirement set for installing Hadoop. That said, Cloudera’s blog supplies the following information for those wishing to explore a Hadoop-based big data analytics cluster of their very own.

Hardware recommendations for DataNode/Task Trackers in a balanced Hadoop cluster:

  • 12-24 1-4TB hard disks in a JBOD (Just a Bunch Of Disks) configuration
  • 2 quad-/hex-/octo-core CPUs, running at least 2-2.5GHz
  • 64-512GB of RAM
  • Bonded Gigabit Ethernet or 10Gigabit Ethernet (the more storage density, the higher the network throughput needed)

For NameNode/JobTracker/Standby NameNodes, the recommendations are:

  • 4–6 1TB hard disks in a JBOD configuration (1 for the OS, 2 for the FS image [RAID 1], 1 for Apache ZooKeeper, and 1 for Journal node)
  • 2 quad-/hex-/octo-core CPUs, running at least 2-2.5GHz
  • 64-128GB of RAM
  • Bonded Gigabit Ethernet or 10Gigabit Ethernet

The article provides more recommendations based upon growth expectations, but you get the idea that Hadoop’s requirements are substantial.

But, because Hadoop runs well on commodity hardware, hardware costs are seen as less important in the overall build of a Hadoop cluster environment. However, for research institutions and smaller businesses with limited project budgets, hardware is only one consideration. Some of the problems facing would-be Hadoop adopters are high management overhead, long learning curves, and limited staff.

Hadoop “Takes Time”

The Apache Foundation’s Hadoop Wiki explains what Hadoop is and is not, which is an excellent resource that everyone who’s considering Hadoop should read. One passage from this Wiki entry is especially enlightening and should give pause to anyone who requires real-time analytics or believes that Hadoop’s information retrieval powers are anything but batch-oriented:

Hadoop stores data in files, and does not index them. If you want to find something, you have to run a MapReduce job going through all the data. This takes time, and means that you cannot directly use Hadoop as a substitute for a database. Where Hadoop works is where the data is too big for a database (i.e. you have reached the technical limits, not just that you don’t want to pay for a database license). With very large datasets, the cost of regenerating indexes is so high you can’t easily index changing data. With many machines trying to write to the database, you can’t get locks on it. Here the idea of vaguely-related files in a distributed filesystem can work.

If that paragraph alone doesn’t deter potential converts, the following entries may provide more insight for those who have limited staff resources and non-existent training budgets:

  • Hadoop and MapReduce is not a place to learn Java programming
  • Hadoop is not an ideal place to learn networking error messages
  • Hadoop clusters are not a place to learn Unix/Linux system administration

The Wiki author doesn’t mince words in warning against installing and attempting to maintain a Hadoop cluster without the proper skills already in hand. Additionally, someone who is skilled in administering a few Linux systems might not be prepared to manage 80+ systems, because one-off administration skills do not scale.

As the Wiki author clearly states above, you should consider Hadoop because, “you have reached the technical limits [of a database], not just that you don’t want to pay for a database license.”

Hadoop Primary Roadblocks

The two primary real-time analysis roadblocks for Hadoop are that Hadoop’s query engines are not as fast as mainstream relational databases are and that Hadoop data is not easily manipulated or modified once it’s written to the Hadoop Distributed File System (HDFS).

To allow users to write regular SQL queries against Hadoop data, projects such as Hawq and Impala exist, but the queries are translated into MapReduce, which slows down the returned results. Hadoop is a batch processing mechanism for huge data sets that generally run in overnight number crunching scenarios.

If you peruse the Hadoop Wiki PoweredBy page, which is a list of companies that use Hadoop for their own big data processing tasks, you’ll notice that no company that uses Hadoop uses it for real-time analytics or any current data processing. It’s mostly for log analysis, data mining, trend analysis, and similar Hadoop-appropriate tasks.

Hadoop wasn’t designed as a real-time data engine. It does what it does well, albeit slower than many would prefer. It was designed as a big data distributed processing engine across commodity computer clusters. It is highly available and fault tolerant. It is not fast.

It’s true that Hadoop is open source and also free software, but most businesses that deploy it do so using a commercial distribution. Commercial support is very important for businesses that place valuable data in the hands of free software. Compare this attitude to businesses that have adopted Linux in that most that have also tend to use commercial distributions.

Hadoop Alternatives

What is the alternative for those who need real-time data processing with large data sets? One promising project is Apache Storm. Storm is to real-time data processing as Hadoop is to batch processing. In the words of the Storm project:

Apache Storm is a free and open source distributed real-time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing.

Storm has many use cases: real-time analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.

Storm integrates with the queuing and database technologies you already use. A Storm topology consumes streams of data and processes those streams in arbitrarily complex ways, repartitioning the streams between each stage of the computation however needed.

Storm, like Hadoop, is a distributed system. It can run on top of Hadoop YARN and can be used with Flume to store data on HDFS. Although it isn’t yet commercially supported, some famous adopters are Spotify, Yelp, and WebMD.

Apache Spark is an alternative that currently enjoys quite a bit of fame. It is a clustered, in-memory solution that runs up to 100 times faster than Hadoop MapReduce. It also runs up to ten times faster on disk, so using it for real-time analytics is an obvious choice. One of its features is that Hadoop is not a requirement—it can run standalone and not on top of some other technology. And with most of these Apache.org technologies, you can use a variety of programming languages and databases with them.

Apache Samza is a distributed stream processing framework that uses Apache Kafka for messaging and Hadoop YARN for fault tolerance, security, and resource management. While Samza works with other Apache Foundation products such as Kafka and Hadoop YARN by default, you can run it with other messaging and execution environments.

To help you start quickly with Samza, there is a Hello Samza project that stands as a separate project. The Hello Samza project does assume that you’re going to use YARN, Kafka, and ZooKeeper. Samza’s developers and users are a passionate lot and believe that their project has many superior qualities that differentiate it from competing technologies, such as:

  • Samza supports fault-tolerant local state. State can be thought of as tables that are split up and co-located with the processing tasks. State is itself modeled as a stream. If the local state is lost due to machine failure, the state stream is replayed to restore it.
  • Streams are ordered, partitioned, replayable, and fault tolerant.
  • YARN is used for processor isolation, security, and fault tolerance.
  • Jobs are decoupled: if one job goes slow and builds up a backlog of unprocessed messages, the rest of the system is not affected.

You can check out a full list of Samza’s attributes on its Comparisons page.

Amazon Kinesis is the only non-Apache solution in this list of alternatives, but it’s a substantial one. Amazon doesn’t do anything in light fashion and Kinesis is no exception to that assertion. Kinesis is Amazon’s service entry into the real-time processing game. Like the Apache Foundation’s projects, it is naturally integrated with Amazon’s other offerings such as DynamoDB, Redshift, S3, and Elasticsearch.

But Amazon doesn’t stop at simply providing users with an API and an instruction manual; it also provides an easy to load streaming data system it calls Firehose and he ability to build custom streaming data applications. And the best part of Amazon’s Kinesis is that Hadoop is nowhere in sight.

Overnight processing is fine for logs or trends that don’t require answers today from data gathered today. Hadoop is a workhorse, but was never designed as a real-time data engine, and many adopters and would-be adopters believe that this “design flaw has doomed Hadoop to the big data scrap heap.

Hadoop has its place and its own job in big data, but adopters should be prepared to spend money and time on building and supporting its infrastructure. There are several alternatives that not only process data in real-time, but allow easy data ingest, and simple application construction that have a far shorter learning curve than Hadoop does. However, Hadoop’s critics should hold off on calling it done for a few more years, because major companies have bet the data farm on it.

Subscribe to Data Insider

Learn the latest news and best practices about data science, big data analytics, artificial intelligence, data security, and more.

Similar articles

Get the Free Newsletter!

Subscribe to Data Insider for top news, trends & analysis

Latest Articles