Hadoop and Big Data: 60 Top Open Source Tools

These Hadoop and Big Data applications are helping enterprises manage and analyze large stores of data.
(Page 1 of 3)

Also see: Hadoop and Big Data

When it comes to tools for working with Big Data, open source solutions in general and Apache Hadoop in particular dominate the landscape. Forrester Analyst Mike Gualtieri recently predicted that "100 percent of large companies" would adopt Hadoop over the next couple of years. A report from Market Research forecasts that the Hadoop market will grow at a compound annual growth rate (CAGR) of 58 percent through 2022 and that it will be worth more than $1 billion by 2020. And IBM believes so strongly in open source Big Data tools that it assigned 3,500 researchers to work on Apache Spark, a tool that is part of the Hadoop ecosystem.

This month, we've updated our list of top open source Big Data tools. This area has a seen a lot of activity recently, with the launch of many new projects. Many of the most noteworthy projects are managed by the Apache Foundation and are closely related to Hadoop.

Please note that this is not a ranking; instead, projects are organized by category. And as always, if you know of additional open source big data and/or Hadoop tools that should be on our list, please feel free to note them in the Comments section below.

Hadoop-Related Tools

1. Hadoop

Apache's Hadoop project has become nearly synonymous with Big Data. It has grown to become an entire ecosystem of open source tools for highly scalable distributed computing. Operating System: Windows, Linux, OS X.

2. Ambari

Part of the Hadoop ecosystem, this Apache project offers an intuitive Web-based interface for provisioning, managing, and monitoring Hadoop clusters. It also provides RESTful APIs for developers who want to integrate Ambari's capabilities into their own applications. Operating System: Windows, Linux, OS X.

3. Avro

This Apache project provides a data serialization system with rich data structures and a compact format. Schemas are defined with JSON and it integrates easily with dynamic languages. Operating System: OS Independent.

4. Cascading

Cascading is an application development platform based on Hadoop. Commercial support and training are available. Operating System: OS Independent.

5. Chukwa

Based on Hadoop, Chukwa collects data from large distributed systems for monitoring purposes. It also includes tools for analyzing and displaying the data. Operating System: Linux, OS X.

6. Flume

Flume collects log data from other applications and delivers them into Hadoop. The website boasts, "It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms." Operating System: Linux, OS X.

7. HBase

Designed for very large tables with billions of rows and millions of columns, HBase is a distributed database that provides random real-time read/write access to big data. It is somewhat similar to Google's Bigtable, but built on top of Hadoop and HDFS. Operating System: OS Independent.

8. Hadoop Distributed File System

HDFS is the file system for Hadoop, but it can also be used as a standalone distributed file system. It's Java-based, fault-tolerant, highly scalable and highly configurable. Operating System: Windows, Linux, OS X.

9. Hive

Apache Hive is the data warehouse for the Hadoop ecosystem. It allows users to query and manage big data using HiveQL, a language that is similar to SQL. Operating System: OS Independent.

10. Hivemall

Hivemall is a collection of machine learning algorithms for Hive. It includes highly scalable algorithms for classification, regression, recommendation, k-nearest neighbor, anomaly detection and feature hashing. Operating System: OS Independent.

11. Mahout

According to its website, the Mahout project's goal is "to build an environment for quickly creating scalable performant machine learning applications." It includes a variety of algorithms for doing data mining on Hadoop MapReduce, as well as some newer algorithms for Scala and Spark environments. Operating System: OS Independent.

12. MapReduce

An integral part of Hadoop, MapReduce is a programming model that provides a way to process large distributed datasets. It was originally developed by Google, and it also used by several other big data tools on our list, including CouchDB, MongoDB and Riak. Operating System: OS Independent.

13. Oozie

This workflow scheduler is specifically designed to manage Hadoop jobs. It can trigger jobs by time or by data availability, and it integrates with MapReduce, Pig, Hive, Sqoop and many other related tools. Operating System: Linux, OS X.

14. Pig

Apache Pig is a platform for distributed big data analysis. It relies on a programming language called Pig Latin, which boasts simplified parallel programming, optimization and extensibility. Operating System: OS Independent.

15. Sqoop

Enterprises frequently need to transfer data between their relational databases and Hadoop, and Sqoop is one tool that gets the job done. It can import data to Hive or HBase and export from Hadoop to RDBMSes. Operating System: OS Independent.

16. Spark

An alternative to MapReduce, Spark is a data-processing engine. It claims to be up to 100 times faster than MapReduce when used in memory or 10 times faster when used on disk. It can be used alongside Hadoop, with Apache Mesos, or on its own. Operating System: Windows, Linux, OS X.

17. Tez

Built on top of Apache Hadoop YARN, Tez is "an application framework which allows for a complex directed-acyclic-graph of tasks for processing data." It allows Hive and Pig to simplify complicated jobs that would otherwise take multiple steps. Operating System: Windows, Linux, OS X.

18. Zookeeper

This administrative big data tool describes itself as "a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services." It allows nodes within a Hadoop cluster to coordinate with each other. Operating System: Linux, Windows (development only), OS X (development only).

Big Data Analysis Platforms and Tools

19. Disco

Originally developed by Nokia, Disco is a distributed computing framework that, like Hadoop, is based on MapReduce. It includes a distributed filesystem and a database that supports billions of keys and values. Operating System: Linux, OS X.

20. HPCC

An alternative to Hadoop, HPCC is a big data platform that promises very fast speeds and exceptional scalability. In addition to the free community version, HPCC Systems offers a paid enterprise version, paid modules, training, consulting and other services. Operating System: Linux.

Page 1 of 3

1 2 3
Next Page

Tags: Hadoop, open source, big data

0 Comments (click to add your comment)
Comment and Contribute


(Maximum characters: 1200). You have characters left.