Also see: Hadoop and Big Data
When it comes to tools for working with Big Data, open source solutions in general and Apache Hadoop in particular dominate the landscape. Forrester Analyst Mike Gualtieri recently predicted that "100 percent of large companies" would adopt Hadoop over the next couple of years. A report from Market Research forecasts that the Hadoop market will grow at a compound annual growth rate (CAGR) of 58 percent through 2022 and that it will be worth more than $1 billion by 2020. And IBM believes so strongly in open source Big Data tools that it assigned 3,500 researchers to work on Apache Spark, a tool that is part of the Hadoop ecosystem.
This month, we've updated our list of top open source Big Data tools. This area has a seen a lot of activity recently, with the launch of many new projects. Many of the most noteworthy projects are managed by the Apache Foundation and are closely related to Hadoop.
Please note that this is not a ranking; instead, projects are organized by category. And as always, if you know of additional open source big data and/or Hadoop tools that should be on our list, please feel free to note them in the Comments section below.
Apache's Hadoop project has become nearly synonymous with Big Data. It has grown to become an entire ecosystem of open source tools for highly scalable distributed computing. Operating System: Windows, Linux, OS X.
Part of the Hadoop ecosystem, this Apache project offers an intuitive Web-based interface for provisioning, managing, and monitoring Hadoop clusters. It also provides RESTful APIs for developers who want to integrate Ambari's capabilities into their own applications. Operating System: Windows, Linux, OS X.
This Apache project provides a data serialization system with rich data structures and a compact format. Schemas are defined with JSON and it integrates easily with dynamic languages. Operating System: OS Independent.
Cascading is an application development platform based on Hadoop. Commercial support and training are available. Operating System: OS Independent.
Based on Hadoop, Chukwa collects data from large distributed systems for monitoring purposes. It also includes tools for analyzing and displaying the data. Operating System: Linux, OS X.
Flume collects log data from other applications and delivers them into Hadoop. The website boasts, "It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms." Operating System: Linux, OS X.
Designed for very large tables with billions of rows and millions of columns, HBase is a distributed database that provides random real-time read/write access to big data. It is somewhat similar to Google's Bigtable, but built on top of Hadoop and HDFS. Operating System: OS Independent.
HDFS is the file system for Hadoop, but it can also be used as a standalone distributed file system. It's Java-based, fault-tolerant, highly scalable and highly configurable. Operating System: Windows, Linux, OS X.
Apache Hive is the data warehouse for the Hadoop ecosystem. It allows users to query and manage big data using HiveQL, a language that is similar to SQL. Operating System: OS Independent.
Hivemall is a collection of machine learning algorithms for Hive. It includes highly scalable algorithms for classification, regression, recommendation, k-nearest neighbor, anomaly detection and feature hashing. Operating System: OS Independent.
According to its website, the Mahout project's goal is "to build an environment for quickly creating scalable performant machine learning applications." It includes a variety of algorithms for doing data mining on Hadoop MapReduce, as well as some newer algorithms for Scala and Spark environments. Operating System: OS Independent.
An integral part of Hadoop, MapReduce is a programming model that provides a way to process large distributed datasets. It was originally developed by Google, and it also used by several other big data tools on our list, including CouchDB, MongoDB and Riak. Operating System: OS Independent.
This workflow scheduler is specifically designed to manage Hadoop jobs. It can trigger jobs by time or by data availability, and it integrates with MapReduce, Pig, Hive, Sqoop and many other related tools. Operating System: Linux, OS X.
Apache Pig is a platform for distributed big data analysis. It relies on a programming language called Pig Latin, which boasts simplified parallel programming, optimization and extensibility. Operating System: OS Independent.
Enterprises frequently need to transfer data between their relational databases and Hadoop, and Sqoop is one tool that gets the job done. It can import data to Hive or HBase and export from Hadoop to RDBMSes. Operating System: OS Independent.
An alternative to MapReduce, Spark is a data-processing engine. It claims to be up to 100 times faster than MapReduce when used in memory or 10 times faster when used on disk. It can be used alongside Hadoop, with Apache Mesos, or on its own. Operating System: Windows, Linux, OS X.
Built on top of Apache Hadoop YARN, Tez is "an application framework which allows for a complex directed-acyclic-graph of tasks for processing data." It allows Hive and Pig to simplify complicated jobs that would otherwise take multiple steps. Operating System: Windows, Linux, OS X.
This administrative big data tool describes itself as "a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services." It allows nodes within a Hadoop cluster to coordinate with each other. Operating System: Linux, Windows (development only), OS X (development only).
Originally developed by Nokia, Disco is a distributed computing framework that, like Hadoop, is based on MapReduce. It includes a distributed filesystem and a database that supports billions of keys and values. Operating System: Linux, OS X.
An alternative to Hadoop, HPCC is a big data platform that promises very fast speeds and exceptional scalability. In addition to the free community version, HPCC Systems offers a paid enterprise version, paid modules, training, consulting and other services. Operating System: Linux.