Apache Hadoop is an open source software framework that enables high throughput processing of big data sets across distributed clusters. Apache modules include Hadoop Common, a set of common utilities that run through the modules. These include Hadoop Distributed File System (HDFS), Hadoop YARN for scheduling jobs and managing cluster resources, and Hadoop MapReduce, a system based on YARN that enables parallel processing of large data sets.
Apache also offers additional open source software that runs atop Hadoop, such as analysis engine Spark (which can also run independently) and programming language Pig (whose name is a play on Pig Latin).
Hadoop is popular because it provides a nearly limitless environment for big data processing using commodity hardware. Adding nodes is a simple process with no negative impact on the framework. Hadoop is highly scalable from a single server to thousands of servers with each cluster running its own compute and storage. Hadoop provides high availability at the application layer so cluster hardware can be off-the-shelf.
Real-life usage cases include online travel (Hadoop claims to be the go-to big data platform for 80% of online travel bookings), batch analytics, social media application serving and analysis, supply chain optimization, mobile data management, healthcare, and more.
The downside? Hadoop is complex and requires significant staff time and expertise, which has dampened its adoption rate in businesses lacking specialized IT staff. It can also be a challenge to derive business value in the face of expert administrator requirements and capital expenditures on widely distributed clusters.
Cluster management can also be tricky. Hadoop unifies distributed clusters but equipping and managing additional data centers – not to mention working with remote staff – add to complexity and cost. The upshot is that Hadoop clusters can be far more isolated than they should be.
Cloud to the Rescue?
Going to the cloud is not an either/or proposition for Hadoop owners. Some businesses with Hadoop expertise will choose Infrastructure as a Service (IaaS) for better cluster management and will continue to manage Hadoop in-house. This article will discuss going all the way to a fully managed Hadoop deployment online. We refer to this as Hadoop-as-a-Service (HaaS), a sub-category of Platform-as-a-Service (PaaS).
Running Hadoop as a managed cloud-based service is not a cheap proposition but it does save money over buying large numbers of clusters. It also eases Hadoop expert management requirements and avoids long learning curves. Most Hadoop installations will maintain a self-service portal for analytics and other data operations while the provider manages all infrastructure, management and processing operations.
This is not an easy thing to do. Hadoop architecture requires a highly scalable and dynamic compute environment, and Hadoop experts are necessary for complex configuration and software integration. If the business decides to go with a managed service they will not have to hire staff experts but the managed service will. The more expertise, customized configuration and capacity the customer requires, the more expensive the service.
Nevertheless, the expense is usually less than running large Hadoop deployments on-site and it does cut down on complexity. Instead of spending staff time and high dollar amounts on cluster and workload management, IT can set policies and enable data operations from a web-based console. The provider will manage day-to-day tasks and automatic provision for dynamic workloads. The service will also handle data and process distribution.
Of course, nothing is perfect including HaaS. To begin with, the business will be moving big data in and out of the cloud. This creates latency that IT must redress by buying fatter pipes and/or investing in data movement acceleration. IT must also carry out due diligence on the HaaS provider’s performance levels and Quality of Service. Here are a few top capabilities to look for:
· The provider should store data persistently in HDFS. Hadoop does not require using HDFS as a persistent data store but there are clear advantages to doing so. Granted that before In-Memory Cache there were performance problems associated with using HDFS as a persistent store. Now active processes occur in HDFS’ In-Memory Cache and Hadoop uses write-behind to store data on disk. This capability now positions HDFS as a data warehouse, with no need to purchase third party warehouses or ETL. Queries hit the entire store including the cache and HDFS. And since HDFS is native to Hadoop, it works seamlessly with Yarn and MapReduce.
· Highly elastic compute environment. Hadoop’s core ability is maintaining elastic clusters for widely varying workload types. This is even more of a critical consideration when running cloud-based Hadoop instances. You are already dealing with remote connectivity to the Internet and cannot afford to add another layer of latency. The Hadoop cloud provider must maintain highly dynamic and scalable environments. The service should also be able to support mixed workloads such as data ingestion and customer data analysis. Server and storage capacity should be capable of on-the-fly, highly automated provisioning.
· Non-stop operations. Another consideration is the ability to recover from processing failures without having the restart an entire process. The Hadoop provider should be capable of non-stop operations, which is a non-trivial matter. Clarify that the provider supports non-stop, which restarts an operation from the beginning of a failed sub-service and not the entire job.
Many large cloud vendors offer services to Hadoop service providers including HP Helion, Google, Amazon, Rackspace and MS Azure. However, the cloud vendors may or may not offer their own managed Hadoop services. This vendor section covers managed Hadoop service providers; not simply the infrastructure on which Hadoop runs.
Qubole’s core offering is Hadoop-as-a-service (HaaS). Qubole Data Service offers fully managed, on-demand clusters that scale up or down depending on data size. Qubole partners with Google Cloud using Google’s Computer Engine (GCE). Speaking of Google, the Google Cloud Storage connector for Hadoop lets users run MapReduce jobs directly on data stored in GCS, which eliminates having to write data on-premise and running in local Hadoop. Additional data connectors enable GCS users to run MapReduce on data stored in Google Datastore and Google BigQuery.
Hortonworks Data Platform offers managed enterprise-level HaaS. Hadoop YARN enables processing multiple workloads through a variety of operations. Altiscale has made a big splash with a purpose-built Hadoop cloud service. They stress robust native security and compliance, sophisticated management services, a high degree of automation, and extensive data and language integration.
Amazon offers Amazon Elastic MapReduce (EMR) as a Hadoop web service. EMR distributes client data and processes across dynamic EC2 instances. Microsoft Azure HDinsight is also a cloud-based Hadoop distribution. HDinsight is Hadoop-only and does not contain additional MS software. The installation processes both unstructured and semi-structured data from multiple data locations.
IBM BigInsights on Cloud is based on Hadoop, integrating Hadoop core offerings and modules with IBM management consoles, analytics, and query engines. The cloud version runs BigInsights as a Hadoop service on IBM SoftLayer.
Frankly, Hadoop adoption hype has not lived up to its reputation. Enterprises with massive big data needs have widely adopted it because they have the computing budgets to match. But many more mid-market and even enterprise-level companies have not adopted Hadoop because of its complexity and ongoing optimization process.
We believe that managed Hadoop services will bring many more business users into the fold as long as Hadoop managed service providers optimize their data centers for performance, and users know to accelerate data transfer.
Photo courtesy of Shutterstock.