Apache Hadoop is an open source software framework that enables high throughput processing of big data sets across distributed clusters. Apache modules include Hadoop Common, a set of common utilities that run through the modules. These include Hadoop Distributed File System (HDFS), Hadoop YARN for scheduling jobs and managing cluster resources, and Hadoop MapReduce, a system based on YARN that enables parallel processing of large data sets.
Apache also offers additional open source software that runs atop Hadoop, such as analysis engine Spark (which can also run independently) and programming language Pig (whose name is a play on Pig Latin).
Protecting your company’s data is critical. Cloud storage with automated backup is scalable, flexible and provides peace of mind. Cobalt Iron’s enterprise-grade backup and recovery solution is known for its hands-free automation and reliability, at a lower cost. Cloud backup that just works.
Hadoop is popular because it provides a nearly limitless environment for big data processing using commodity hardware. Adding nodes is a simple process with no negative impact on the framework. Hadoop is highly scalable from a single server to thousands of servers with each cluster running its own compute and storage. Hadoop provides high availability at the application layer so cluster hardware can be off-the-shelf.
Real-life usage cases include online travel (Hadoop claims to be the go-to big data platform for 80% of online travel bookings), batch analytics, social media application serving and analysis, supply chain optimization, mobile data management, healthcare, and more.
The downside? Hadoop is complex and requires significant staff time and expertise, which has dampened its adoption rate in businesses lacking specialized IT staff. It can also be a challenge to derive business value in the face of expert administrator requirements and capital expenditures on widely distributed clusters.
Cluster management can also be tricky. Hadoop unifies distributed clusters but equipping and managing additional data centers – not to mention working with remote staff – add to complexity and cost. The upshot is that Hadoop clusters can be far more isolated than they should be.
Cloud to the Rescue?
Going to the cloud is not an either/or proposition for Hadoop owners. Some businesses with Hadoop expertise will choose Infrastructure as a Service (IaaS) for better cluster management and will continue to manage Hadoop in-house. This article will discuss going all the way to a fully managed Hadoop deployment online. We refer to this as Hadoop-as-a-Service (HaaS), a sub-category of Platform-as-a-Service (PaaS).
Running Hadoop as a managed cloud-based service is not a cheap proposition but it does save money over buying large numbers of clusters. It also eases Hadoop expert management requirements and avoids long learning curves. Most Hadoop installations will maintain a self-service portal for analytics and other data operations while the provider manages all infrastructure, management and processing operations.
This is not an easy thing to do. Hadoop architecture requires a highly scalable and dynamic compute environment, and Hadoop experts are necessary for complex configuration and software integration. If the business decides to go with a managed service they will not have to hire staff experts but the managed service will. The more expertise, customized configuration and capacity the customer requires, the more expensive the service.
Nevertheless, the expense is usually less than running large Hadoop deployments on-site and it does cut down on complexity. Instead of spending staff time and high dollar amounts on cluster and workload management, IT can set policies and enable data operations from a web-based console. The provider will manage day-to-day tasks and automatic provision for dynamic workloads. The service will also handle data and process distribution.
Some Drawbacks
Of course, nothing is perfect including HaaS. To begin with, the business will be moving big data in and out of the cloud. This creates latency that IT must redress by buying fatter pipes and/or investing in data movement acceleration. IT must also carry out due diligence on the HaaS provider’s performance levels and Quality of Service. Here are a few top capabilities to look for:
· The provider should store data persistently in HDFS. Hadoop does not require using HDFS as a persistent data store but there are clear advantages to doing so. Granted that before In-Memory Cache there were performance problems associated with using HDFS as a persistent store. Now active processes occur in HDFS’ In-Memory Cache and Hadoop uses write-behind to store data on disk. This capability now positions HDFS as a data warehouse, with no need to purchase third party warehouses or ETL. Queries hit the entire store including the cache and HDFS. And since HDFS is native to Hadoop, it works seamlessly with Yarn and MapReduce.
· Highly elastic compute environment. Hadoop’s core ability is maintaining elastic clusters for widely varying workload types. This is even more of a critical consideration when running cloud-based Hadoop instances. You are already dealing with remote connectivity to the Internet and cannot afford to add another layer of latency. The Hadoop cloud provider must maintain highly dynamic and scalable environments. The service should also be able to support mixed workloads such as data ingestion and customer data analysis. Server and storage capacity should be capable of on-the-fly, highly automated provisioning.
· Non-stop operations. Another consideration is the ability to recover from processing failures without having the restart an entire process. The Hadoop provider should be capable of non-stop operations, which is a non-trivial matter. Clarify that the provider supports non-stop, which restarts an operation from the beginning of a failed sub-service and not the entire job.
Hadoop-as-a-Service Providers
Many large cloud vendors offer services to Hadoop service providers including HP Helion, Google, Amazon, Rackspace and MS Azure. However, the cloud vendors may or may not offer their own managed Hadoop services. This vendor section covers managed Hadoop service providers; not simply the infrastructure on which Hadoop runs.
Qubole’s core offering is Hadoop-as-a-service (HaaS). Qubole Data Service offers fully managed, on-demand clusters that scale up or down depending on data size. Qubole partners with Google Cloud using Google’s Computer Engine (GCE). Speaking of Google, the Google Cloud Storage connector for Hadoop lets users run MapReduce jobs directly on data stored in GCS, which eliminates having to write data on-premise and running in local Hadoop. Additional data connectors enable GCS users to run MapReduce on data stored in Google Datastore and Google BigQuery.
Hortonworks Data Platform offers managed enterprise-level HaaS. Hadoop YARN enables processing multiple workloads through a variety of operations. Altiscale has made a big splash with a purpose-built Hadoop cloud service. They stress robust native security and compliance, sophisticated management services, a high degree of automation, and extensive data and language integration.
Amazon offers Amazon Elastic MapReduce (EMR) as a Hadoop web service. EMR distributes client data and processes across dynamic EC2 instances. Microsoft Azure HDinsight is also a cloud-based Hadoop distribution. HDinsight is Hadoop-only and does not contain additional MS software. The installation processes both unstructured and semi-structured data from multiple data locations.
IBM BigInsights on Cloud is based on Hadoop, integrating Hadoop core offerings and modules with IBM management consoles, analytics, and query engines. The cloud version runs BigInsights as a Hadoop service on IBM SoftLayer.
Frankly, Hadoop adoption hype has not lived up to its reputation. Enterprises with massive big data needs have widely adopted it because they have the computing budgets to match. But many more mid-market and even enterprise-level companies have not adopted Hadoop because of its complexity and ongoing optimization process.
We believe that managed Hadoop services will bring many more business users into the fold as long as Hadoop managed service providers optimize their data centers for performance, and users know to accelerate data transfer.
Photo courtesy of Shutterstock.
Huawei’s AI Update: Things Are Moving Faster Than We Think
FEATURE | By Rob Enderle,
December 04, 2020
Keeping Machine Learning Algorithms Honest in the ‘Ethics-First’ Era
ARTIFICIAL INTELLIGENCE | By Guest Author,
November 18, 2020
Key Trends in Chatbots and RPA
FEATURE | By Guest Author,
November 10, 2020
FEATURE | By Samuel Greengard,
November 05, 2020
ARTIFICIAL INTELLIGENCE | By Guest Author,
November 02, 2020
How Intel’s Work With Autonomous Cars Could Redefine General Purpose AI
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 29, 2020
Dell Technologies World: Weaving Together Human And Machine Interaction For AI And Robotics
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 23, 2020
The Super Moderator, or How IBM Project Debater Could Save Social Media
FEATURE | By Rob Enderle,
October 16, 2020
FEATURE | By Cynthia Harvey,
October 07, 2020
ARTIFICIAL INTELLIGENCE | By Guest Author,
October 05, 2020
CIOs Discuss the Promise of AI and Data Science
FEATURE | By Guest Author,
September 25, 2020
Microsoft Is Building An AI Product That Could Predict The Future
FEATURE | By Rob Enderle,
September 25, 2020
Top 10 Machine Learning Companies 2020
FEATURE | By Cynthia Harvey,
September 22, 2020
NVIDIA and ARM: Massively Changing The AI Landscape
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
September 18, 2020
Continuous Intelligence: Expert Discussion [Video and Podcast]
ARTIFICIAL INTELLIGENCE | By James Maguire,
September 14, 2020
Artificial Intelligence: Governance and Ethics [Video]
ARTIFICIAL INTELLIGENCE | By James Maguire,
September 13, 2020
IBM Watson At The US Open: Showcasing The Power Of A Mature Enterprise-Class AI
FEATURE | By Rob Enderle,
September 11, 2020
Artificial Intelligence: Perception vs. Reality
FEATURE | By James Maguire,
September 09, 2020
Anticipating The Coming Wave Of AI Enhanced PCs
FEATURE | By Rob Enderle,
September 05, 2020
The Critical Nature Of IBM’s NLP (Natural Language Processing) Effort
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
August 14, 2020
Datamation is the leading industry resource for B2B data professionals and technology buyers. Datamation's focus is on providing insight into the latest trends and innovation in AI, data security, big data, and more, along with in-depth product recommendations and comparisons. More than 1.7M users gain insight and guidance from Datamation every year.
Advertise with TechnologyAdvice on Datamation and our other data and technology-focused platforms.
Advertise with Us
Property of TechnologyAdvice.
© 2025 TechnologyAdvice. All Rights Reserved
Advertiser Disclosure: Some of the products that appear on this
site are from companies from which TechnologyAdvice receives
compensation. This compensation may impact how and where products
appear on this site including, for example, the order in which
they appear. TechnologyAdvice does not include all companies
or all types of products available in the marketplace.