Also see: Hadoop and Big Data
Hadoop is an open source software framework for Big Data, managed by the Apache Software Foundation. Hadoop stores huge data sets across distributed server and storage clusters, and supports distributed analysis across the clusters. Hadoop is primarily a data warehouse for massively growing semi-structured and unstructured data – a huge part of its Big Data toolset. It offers unmatched scaling and performance for this very demanding data segment. It is not a replacement for RDBMS warehouses, although there are products that ease structured data queries within Hadoop.
Hadoop keeps down the cost of big clusters by running on commodity server and storage. High performance is built-in as Hadoop architecture maintains high throughput. High availability is also native since Hadoop HDFS resides on every cluster, enabling continuous operation even if a node goes down.
Hadoop’s scale is breathtaking, managing large-scale Big Data and data processing across many thousands of distributed hosts. Computations occur locally on clusters, and performance and storage scale linearly by adding commodity servers and storage. At present, Hadoop enterprise implementations are overwhelmingly found in large enterprise and in industries where Big Data is critical to survival, such as transportation and healthcare.
Smaller deployments are more common in other industries and smaller companies, often restricted to a single on-premise cluster. Complexity and intensive I/O are part of the problem. To increase adoption in these smaller companies, Apache and third-party Hadoop developers are concentrating on making Hadoop simpler and less resource-intensive. Even so, Hadoop is a multi-billion dollar market – and growing – thanks to its scalability and open source structure.
Flexibility
From its inception, Apache architected Hadoop for modularity and flexibility. Many Hadoop components are swappable for different Apache and third-party software. This encourages experimentation on the customers’ part without lock-in, and encourages partners to actively develop for Hadoop.
At its most basic, Hadoop is made up of a distributed file system for data and the Hadoop framework that enables data processing. Hadoop Distributed FileSystems (HDFS) is by far the most common Hadoop file system, although Hadoop will support different systems if they meet set architectural requirements. These filesystems fall under the Apache designation Hadoop Compatible FileSystems (HCFS), and include filesystems located on local clusters, some distributed filesystems, and cloud-based filesystems such as MS Azure and Amazon S3 blobs.
Core Modules
· Hadoop Distributed File System (HDFS). HDFS is Hadoop’s highly scalable, distributed file system that enables Hadoop to process data across multi-PB clusters. HDFS is made up of block storage and namespaces. Block storage functionality manages clusters, replication and block processes. Namespace manages file and directory operations. HDFS stores metadata on the NameNode, a dedicated server. The NameMode manages the application data blocks stored on clustered servers, or DataNodes. All clustered servers connect with each other. The replication architecture natively provides high availability, and fast TCP connections accelerate bandwidth. Performance and storage scaling are linear since the entire cluster has access to new processors and storage.
· MapReduce. MapReduce (Map + Reduce, or MR) is a Java-based engine integrated with YARN that performs parallel processing of large data sets. In response to requests, MR launches Java processes as individual jobs accessing data in the clusters. MR works using two components within each cluster. A master process called JobTracker resides on a master node and node trackers called TaskTrackers reside on each node. MR performs a read from HDFS, parses jobs and assigns them to individual nodes, nodes return results, MR reduces the data subset back to the master node, which writes the data set.
· YARN. The YARN layer accomplishes job scheduling and cluster resource management. YARN sits underneath the MR layer, and with HDFS makes up Hadoop’s data management layer. The YARN framework enables multiple processing types to run in a shared dataset simultaneously and separates scheduling and resource management tasks from MapReduce. YARN significantly improved application-specific performance over Hadoop v. 1.
Open Source Alternatives to MapReduce
Although MR is a powerful and flexible data tool, it is more complex to use than query languages. It also adds complexity and single job batch processing overhead. Other computation models that are compatible with Hadoop can achieve more complex operations requiring multiple steps. Changes will not occur overnight: MapReduce is widely used, and many Hadoop customers and Hadoop partners have complex processes based on MapReduce. Note that replacing MapReduce does necessarily replace Hadoop, which massively scaled distributed clusters.
· Apache Spark. Apache Spark is compatible with Hadoop data but does not require Hadoop to run. However, Spark does not have native clustering capabilities, so users who want to expand Spark’s data processing to large-scale clusters still need shared file systems like HDFS.
· Apache Hive and Pig. Apache offers additional focused alternatives to MapReduce including Pig and Hive. Neither offering is capable of MapReduce’s performance but is useful with specialized semi-structured or structured database queries. Commercial Hadoop vendor Cloudera offers packages containing all three offerings.
· Apache Tez. Tez does not replace MapReduce but adds a more flexible processing technology that does not require data writes after every reduce. Apache designed Tez as a library of processes that adds processing power to MapReduce and accelerate development for Hadoop partners. Built on YARN, Tez is a data-flow programming engine is ideal for intensive data operations like Extract/Transform/Load (ETL). Tez requires Hadoop YARN to run.
· Google Cloud Dataflow. Open source Google Cloud Dataflow supports Hadoop and MapReduce but also developed its own streaming analytics managed service. Like Apache Tez, Dataflow is well suited to large-scale ETL and is open source.
Hadoop is the focus of a highly active development community, and these are just a few of many open source alternatives and extensions of MapReduce. The commercial side also develops very actively for Hadoop. Hortonworks and Cloudera successfully built their business models on Hadoop. Other large vendors also develop for Hadoop including Microsoft, IBM and HP; and public cloud vendors Google, AWS and Azure.
Given all of this activity — plus massively growing big data stores — look for increasing enterprise adoption of Hadoop. Standardization (or the lack thereof) is an issue with widespread vendor development, and many companies will choose to stay with Hadoop’s traditional infrastructure to avoid integration problems. Expert consultants should be able to smooth over these issues but require additional investment. Nevertheless Hadoop is not going anywhere and is worth the investment for large distributed data needs.
Photo courtesy of Shutterstock.
Huawei’s AI Update: Things Are Moving Faster Than We Think
FEATURE | By Rob Enderle,
December 04, 2020
Keeping Machine Learning Algorithms Honest in the ‘Ethics-First’ Era
ARTIFICIAL INTELLIGENCE | By Guest Author,
November 18, 2020
Key Trends in Chatbots and RPA
FEATURE | By Guest Author,
November 10, 2020
FEATURE | By Samuel Greengard,
November 05, 2020
ARTIFICIAL INTELLIGENCE | By Guest Author,
November 02, 2020
How Intel’s Work With Autonomous Cars Could Redefine General Purpose AI
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 29, 2020
Dell Technologies World: Weaving Together Human And Machine Interaction For AI And Robotics
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 23, 2020
The Super Moderator, or How IBM Project Debater Could Save Social Media
FEATURE | By Rob Enderle,
October 16, 2020
FEATURE | By Cynthia Harvey,
October 07, 2020
ARTIFICIAL INTELLIGENCE | By Guest Author,
October 05, 2020
CIOs Discuss the Promise of AI and Data Science
FEATURE | By Guest Author,
September 25, 2020
Microsoft Is Building An AI Product That Could Predict The Future
FEATURE | By Rob Enderle,
September 25, 2020
Top 10 Machine Learning Companies 2020
FEATURE | By Cynthia Harvey,
September 22, 2020
NVIDIA and ARM: Massively Changing The AI Landscape
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
September 18, 2020
Continuous Intelligence: Expert Discussion [Video and Podcast]
ARTIFICIAL INTELLIGENCE | By James Maguire,
September 14, 2020
Artificial Intelligence: Governance and Ethics [Video]
ARTIFICIAL INTELLIGENCE | By James Maguire,
September 13, 2020
IBM Watson At The US Open: Showcasing The Power Of A Mature Enterprise-Class AI
FEATURE | By Rob Enderle,
September 11, 2020
Artificial Intelligence: Perception vs. Reality
FEATURE | By James Maguire,
September 09, 2020
Anticipating The Coming Wave Of AI Enhanced PCs
FEATURE | By Rob Enderle,
September 05, 2020
The Critical Nature Of IBM’s NLP (Natural Language Processing) Effort
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
August 14, 2020
Datamation is the leading industry resource for B2B data professionals and technology buyers. Datamation's focus is on providing insight into the latest trends and innovation in AI, data security, big data, and more, along with in-depth product recommendations and comparisons. More than 1.7M users gain insight and guidance from Datamation every year.
Advertise with TechnologyAdvice on Datamation and our other data and technology-focused platforms.
Advertise with Us
Property of TechnologyAdvice.
© 2025 TechnologyAdvice. All Rights Reserved
Advertiser Disclosure: Some of the products that appear on this
site are from companies from which TechnologyAdvice receives
compensation. This compensation may impact how and where products
appear on this site including, for example, the order in which
they appear. TechnologyAdvice does not include all companies
or all types of products available in the marketplace.