Also see: Hadoop and Big Data
Hadoop is an open source software framework for Big Data, managed by the Apache Software Foundation. Hadoop stores huge data sets across distributed server and storage clusters, and supports distributed analysis across the clusters. Hadoop is primarily a data warehouse for massively growing semi-structured and unstructured data – a huge part of its Big Data toolset. It offers unmatched scaling and performance for this very demanding data segment. It is not a replacement for RDBMS warehouses, although there are products that ease structured data queries within Hadoop.
Hadoop keeps down the cost of big clusters by running on commodity server and storage. High performance is built-in as Hadoop architecture maintains high throughput. High availability is also native since Hadoop HDFS resides on every cluster, enabling continuous operation even if a node goes down.
Hadoop’s scale is breathtaking, managing large-scale Big Data and data processing across many thousands of distributed hosts. Computations occur locally on clusters, and performance and storage scale linearly by adding commodity servers and storage. At present, Hadoop enterprise implementations are overwhelmingly found in large enterprise and in industries where Big Data is critical to survival, such as transportation and healthcare.
Smaller deployments are more common in other industries and smaller companies, often restricted to a single on-premise cluster. Complexity and intensive I/O are part of the problem. To increase adoption in these smaller companies, Apache and third-party Hadoop developers are concentrating on making Hadoop simpler and less resource-intensive. Even so, Hadoop is a multi-billion dollar market – and growing – thanks to its scalability and open source structure.
From its inception, Apache architected Hadoop for modularity and flexibility. Many Hadoop components are swappable for different Apache and third-party software. This encourages experimentation on the customers’ part without lock-in, and encourages partners to actively develop for Hadoop.
At its most basic, Hadoop is made up of a distributed file system for data and the Hadoop framework that enables data processing. Hadoop Distributed FileSystems (HDFS) is by far the most common Hadoop file system, although Hadoop will support different systems if they meet set architectural requirements. These filesystems fall under the Apache designation Hadoop Compatible FileSystems (HCFS), and include filesystems located on local clusters, some distributed filesystems, and cloud-based filesystems such as MS Azure and Amazon S3 blobs.
· Hadoop Distributed File System (HDFS). HDFS is Hadoop’s highly scalable, distributed file system that enables Hadoop to process data across multi-PB clusters. HDFS is made up of block storage and namespaces. Block storage functionality manages clusters, replication and block processes. Namespace manages file and directory operations. HDFS stores metadata on the NameNode, a dedicated server. The NameMode manages the application data blocks stored on clustered servers, or DataNodes. All clustered servers connect with each other. The replication architecture natively provides high availability, and fast TCP connections accelerate bandwidth. Performance and storage scaling are linear since the entire cluster has access to new processors and storage.
· MapReduce. MapReduce (Map + Reduce, or MR) is a Java-based engine integrated with YARN that performs parallel processing of large data sets. In response to requests, MR launches Java processes as individual jobs accessing data in the clusters. MR works using two components within each cluster. A master process called JobTracker resides on a master node and node trackers called TaskTrackers reside on each node. MR performs a read from HDFS, parses jobs and assigns them to individual nodes, nodes return results, MR reduces the data subset back to the master node, which writes the data set.
· YARN. The YARN layer accomplishes job scheduling and cluster resource management. YARN sits underneath the MR layer, and with HDFS makes up Hadoop’s data management layer. The YARN framework enables multiple processing types to run in a shared dataset simultaneously and separates scheduling and resource management tasks from MapReduce. YARN significantly improved application-specific performance over Hadoop v. 1.
Open Source Alternatives to MapReduce
Although MR is a powerful and flexible data tool, it is more complex to use than query languages. It also adds complexity and single job batch processing overhead. Other computation models that are compatible with Hadoop can achieve more complex operations requiring multiple steps. Changes will not occur overnight: MapReduce is widely used, and many Hadoop customers and Hadoop partners have complex processes based on MapReduce. Note that replacing MapReduce does necessarily replace Hadoop, which massively scaled distributed clusters.
· Apache Spark. Apache Spark is compatible with Hadoop data but does not require Hadoop to run. However, Spark does not have native clustering capabilities, so users who want to expand Spark’s data processing to large-scale clusters still need shared file systems like HDFS.
· Apache Hive and Pig. Apache offers additional focused alternatives to MapReduce including Pig and Hive. Neither offering is capable of MapReduce’s performance but is useful with specialized semi-structured or structured database queries. Commercial Hadoop vendor Cloudera offers packages containing all three offerings.
· Apache Tez. Tez does not replace MapReduce but adds a more flexible processing technology that does not require data writes after every reduce. Apache designed Tez as a library of processes that adds processing power to MapReduce and accelerate development for Hadoop partners. Built on YARN, Tez is a data-flow programming engine is ideal for intensive data operations like Extract/Transform/Load (ETL). Tez requires Hadoop YARN to run.
· Google Cloud Dataflow. Open source Google Cloud Dataflow supports Hadoop and MapReduce but also developed its own streaming analytics managed service. Like Apache Tez, Dataflow is well suited to large-scale ETL and is open source.
Hadoop is the focus of a highly active development community, and these are just a few of many open source alternatives and extensions of MapReduce. The commercial side also develops very actively for Hadoop. Hortonworks and Cloudera successfully built their business models on Hadoop. Other large vendors also develop for Hadoop including Microsoft, IBM and HP; and public cloud vendors Google, AWS and Azure.
Given all of this activity — plus massively growing big data stores — look for increasing enterprise adoption of Hadoop. Standardization (or the lack thereof) is an issue with widespread vendor development, and many companies will choose to stay with Hadoop’s traditional infrastructure to avoid integration problems. Expert consultants should be able to smooth over these issues but require additional investment. Nevertheless Hadoop is not going anywhere and is worth the investment for large distributed data needs.
Photo courtesy of Shutterstock.