Also see: Hadoop and Big Data
When it comes to tools for working with Big Data, open source solutions in general and Apache Hadoop in particular dominate the landscape. Forrester Analyst Mike Gualtieri recently predicted that “100 percent of large companies” would adopt Hadoop over the next couple of years. A report from Market Research forecasts that the Hadoop market will grow at a compound annual growth rate (CAGR) of 58 percent through 2022 and that it will be worth more than $1 billion by 2020. And IBM believes so strongly in open source Big Data tools that it assigned 3,500 researchers to work on Apache Spark, a tool that is part of the Hadoop ecosystem.
This month, we’ve updated our list of top open source Big Data tools. This area has a seen a lot of activity recently, with the launch of many new projects. Many of the most noteworthy projects are managed by the Apache Foundation and are closely related to Hadoop.
Please note that this is not a ranking; instead, projects are organized by category. And as always, if you know of additional open source big data and/or Hadoop tools that should be on our list, please feel free to note them in the Comments section below.
Hadoop-Related Tools
1. Hadoop
Apache’s Hadoop project has become nearly synonymous with Big Data. It has grown to become an entire ecosystem of open source tools for highly scalable distributed computing. Operating System: Windows, Linux, OS X.
2. Ambari
Part of the Hadoop ecosystem, this Apache project offers an intuitive Web-based interface for provisioning, managing, and monitoring Hadoop clusters. It also provides RESTful APIs for developers who want to integrate Ambari’s capabilities into their own applications. Operating System: Windows, Linux, OS X.
3. Avro
This Apache project provides a data serialization system with rich data structures and a compact format. Schemas are defined with JSON and it integrates easily with dynamic languages. Operating System: OS Independent.
4. Cascading
Cascading is an application development platform based on Hadoop. Commercial support and training are available. Operating System: OS Independent.
5. Chukwa
Based on Hadoop, Chukwa collects data from large distributed systems for monitoring purposes. It also includes tools for analyzing and displaying the data. Operating System: Linux, OS X.
6. Flume
Flume collects log data from other applications and delivers them into Hadoop. The website boasts, “It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.” Operating System: Linux, OS X.
7. HBase
Designed for very large tables with billions of rows and millions of columns, HBase is a distributed database that provides random real-time read/write access to big data. It is somewhat similar to Google’s Bigtable, but built on top of Hadoop and HDFS. Operating System: OS Independent.
8. Hadoop Distributed File System
HDFS is the file system for Hadoop, but it can also be used as a standalone distributed file system. It’s Java-based, fault-tolerant, highly scalable and highly configurable. Operating System: Windows, Linux, OS X.
9. Hive
Apache Hive is the data warehouse for the Hadoop ecosystem. It allows users to query and manage big data using HiveQL, a language that is similar to SQL. Operating System: OS Independent.
10. Hivemall
Hivemall is a collection of machine learning algorithms for Hive. It includes highly scalable algorithms for classification, regression, recommendation, k-nearest neighbor, anomaly detection and feature hashing. Operating System: OS Independent.
11. Mahout
According to its website, the Mahout project’s goal is “to build an environment for quickly creating scalable performant machine learning applications.” It includes a variety of algorithms for doing data mining on Hadoop MapReduce, as well as some newer algorithms for Scala and Spark environments. Operating System: OS Independent.
12. MapReduce
An integral part of Hadoop, MapReduce is a programming model that provides a way to process large distributed datasets. It was originally developed by Google, and it also used by several other big data tools on our list, including CouchDB, MongoDB and Riak. Operating System: OS Independent.
13. Oozie
This workflow scheduler is specifically designed to manage Hadoop jobs. It can trigger jobs by time or by data availability, and it integrates with MapReduce, Pig, Hive, Sqoop and many other related tools. Operating System: Linux, OS X.
14. Pig
Apache Pig is a platform for distributed big data analysis. It relies on a programming language called Pig Latin, which boasts simplified parallel programming, optimization and extensibility. Operating System: OS Independent.
15. Sqoop
Enterprises frequently need to transfer data between their relational databases and Hadoop, and Sqoop is one tool that gets the job done. It can import data to Hive or HBase and export from Hadoop to RDBMSes. Operating System: OS Independent.
16. Spark
An alternative to MapReduce, Spark is a data-processing engine. It claims to be up to 100 times faster than MapReduce when used in memory or 10 times faster when used on disk. It can be used alongside Hadoop, with Apache Mesos, or on its own. Operating System: Windows, Linux, OS X.
17. Tez
Built on top of Apache Hadoop YARN, Tez is “an application framework which allows for a complex directed-acyclic-graph of tasks for processing data.” It allows Hive and Pig to simplify complicated jobs that would otherwise take multiple steps. Operating System: Windows, Linux, OS X.
18. Zookeeper
This administrative big data tool describes itself as “a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.” It allows nodes within a Hadoop cluster to coordinate with each other. Operating System: Linux, Windows (development only), OS X (development only).
Big Data Analysis Platforms and Tools
19. Disco
Originally developed by Nokia, Disco is a distributed computing framework that, like Hadoop, is based on MapReduce. It includes a distributed filesystem and a database that supports billions of keys and values. Operating System: Linux, OS X.
20. HPCC
An alternative to Hadoop, HPCC is a big data platform that promises very fast speeds and exceptional scalability. In addition to the free community version, HPCC Systems offers a paid enterprise version, paid modules, training, consulting and other services. Operating System: Linux.
21. Lumify
Owned by Altamira, which is known for its national security technologies, Lumify is an open source big data integration, analytics and visualization platform. You can see it in action by trying the demo at Try.Lumify.io. Operating System: Linux.
22. Pandas
The Pandas project includes data structures and data analysis tools based on the Python programming language. It allows organizations to use Python as an alternative to R for big data analysis projects. Operating System: Windows, Linux, OS X.
23. Storm
Now an Apache project, Storm offers real-time processing of big data (unlike Hadoop, which only provides batch processing). Its users include Twitter, The Weather Channel, WebMD, Alibaba, Yelp, Yahoo! Japan, Spotify, Group, Flipboard and many other companies. Operating System: Linux.
Databases/Data Warehouses
24. Blazegraph
Formerly known as “Bigdata,” Blazegraph is a highly scalable, high-performance database. It is available under an open source or a commercial license. Operating System: OS Independent.
25. Cassandra
Originally developed by Facebook, this NoSQL database is used by more than 1500 organizations, including Apple, CERN, Comcast, eBay, GitHub, GoDaddy, Hulu, Instagram, Intuit, Netfilx, Reddit and others. It can support incredibly large clusters; for example, Apple’s deployment includes more than 75,000 nodes with more than 10 PB of data. Operating System: OS Independent.
26. CouchDB
“A database that completely embraces the Web,” CouchDB stores data in JSON documents that can be queried through a Web browser and manipulated with JavaScript. It’s easy-to-use, highly available and highly scalable across distributed systems. Operating system: Windows, Linux, OS X, Android.
27. FlockDB
Developed by Twitter, FlockDB is a very fast, very scalable graph database that is good at storing social networking data. While it is still available for download, the open source version of this project has not been updated in quite a while. Operating System: OS Independent.
28. Hibari
This Erlang-based project describes itself as “a distributed, ordered key-value store with strong consistency guarantee.” It was first developed by Gemini Mobile Technologies and is used by several telecommunications carriers in Europe and Asia. Operating System: OS Independent.
29. Hypertable
Used by eBay, Baidu, Groupon, Yelp and many other Internet companies, Hypertable is a Hadoop-compatible big data database that promises fast performance. Commercial support is available. Operating System: Linux, OS X.
30. Impala
Cloudera claims that its SQL-based Impala database is “the leading open source analytic database for Apache Hadoop.” It can be downloaded as a standalone product and is also part of Cloudera’s commercial big data products. Operating System: Linux, OS X.
31. MongoDB
Downloaded more than 10 million times, MongoDB is an extremely popular NoSQL database. An enterprise version, support, training and related products and services are available at MongoDB.com. Operating system: Windows, Linux, OS X, Solaris.
32. Neo4j
Calling itself the “fastest and most scalable native graph database,” Neo4j promises massive scalability, fast cypher query performance and improved developer productivity. Users include eBay, Pitney Bowes, Walmart, Lufthansa and CrunchBase. Operating System: Windows, Linux.
33. OrientDB
This multi-model database combines some of the capabilities of a graph database with some of the capabilities of a document database. Paid support, training and consulting are available. Operating system: OS Independent.
34. Pivotal Greenplum Database
Pivotal boasts that Greenplum is a “best-in-class, enterprise-grade analytical database” that can perform powerful analytics on very large volumes of data very quickly. It’s part of the Pivotal Big Data Suite. Operating System: Windows, Linux, OS X.
35. Riak
“Full of great stuff,” Riak comes in two versions: KV is the distributed NoSQL database, and S2 provides object storage for the cloud. It’s available in open source or commercial editions, with add-ons for Spark, Redis and Solr. Operating System: Linux, OS X.
36. Redis
Now sponsored by Pivotal, Redis is a key-value cache and store. Paid support is available. Note that while the project doesn’t officially support Windows, Microsoft has a Windows fork on GitHub. Operating System: Linux.
37. SQLite
This public-domain software claims to be “the most used database engine in the world” because it is included on every Android, iOS, Mac, and Windows 10 device, as well as being integrated into popular applications like Firefox, Chrome, Skype, iTunes, Dropbox, TurboTax, QuickBooks and others. Its development is sponsored by a consortium of companies that includes Blomberg, Mozilla, Expensify and others. Operating System: Windows, Linux, OS X, Android
Business Intelligence
Downloaded more than 2 million times, Talend’s open source software offers data integration capabilities. The company also makes paid big data, cloud, data integration, application integration and master data management tools. It counts organizations like AIG, Comcast, eBay, GE, Samsung, Ticketmaster and Verizon among its users. Operating System: Windows, Linux, OS X.
39. Jaspersoft
Used by organizations like Groupon, CA Technologies, USDA, Ericsson, Time Warner Cable, Olympic Steel, The University of Nebraska and General Dynamics, Jaspersoft offers flexible, embeddable BI tools. In addition to the open source community edition, it comes in paid reporting, AWS, professional and enterprise versions. Operating System: OS Independent.
40. Pentaho
Owned by Hitachi Data Systems, Pentaho offers a variety of data integration and business analytics tools. The link above will take you to the free community version; see Pentaho.com for information on paid, supported versions. Operating System: Windows, Linux, OS X.
41. SpagoBI
Called an “open source leader” by market analysts, Spago offers BI, middleware and quality assurance software, as well as a Java EE application development framework. The software is all 100% free and open source, but paid support, consulting, training and other services are available. Operating System: OS Independent.
42. KNIME
Short for “Konstanz Information Miner,” KNIME is an open source analytics and reporting platform. Several commercial and open source extensions are available to increase its capabilities. Operating System: Windows, Linux, OS X.
43. BIRT
BIRT stands for “Business Intelligence and Reporting Tools.” It offers a platform for creating visualizations and reports that can be embedded into applications and websites. It is part of the Eclipse community and is supported by Actuate, IBM and Innovent Solutions. Operating System: OS Independent.
Data Mining
44.DataMelt
The successor to jHepWork, DataMelt can do mathematical computation, data mining, statistical analysis and data visualization. It supports Java and related programming languages including Jython, Groovy, JRuby and Beanshell. Operating System: OS Independent.
45. KEEL
Short for “Knowledge Extraction based on Evolutionary Learning,” KEEL is a Java-based machine learning tool that provides algorithms for a variety of big data tasks. It’s also helpful for assessing the effectiveness of algorithms for regression, classification, clustering, pattern mining and similar tasks. Operating System: OS Independent.
46. Orange
Orange believes data mining should be “fruitful and fun,” whether you have years of experience or are just getting started in the discipline. It offers visual programming and Python scripting tools for data visualizations and analysis. Operating System: Windows, Linux, OS X.
47. RapidMiner
RapidMiner boasts more than 250,000 users, including PayPal, Deloitte, Ebay, Cisco and Volkswagen. It offers a wide range of open source and paid versions, but note that the free, open source versions only support data in CSV or Excel formats. Operating System: OS Independent.
48. Rattle
Rattle stands for “R Analytical Tool To Learn Easily.” It provides a graphical interface for the R programming language, simplifying the processes of creating statistical or visual summaries of data, creating models and performing data transformations. Operating System: Windows, Linux, OS X.
49. SPMF
SPMF now includes 93 algorithms for sequential pattern mining, association rule mining, itemset mining, sequential rule mining and clustering. It can be used on its own or incorporated into other Java-based programs. Operating System: OS Independent.
50. Weka
The Waikato Environment for Knowledge Analysis, or Weka, is a set Java-based machine-learning algorithms for data mining. It can perform data pre-processing, classification, regression, clustering, association rules and visualization. Operating System: Windows, Linux, OS X.
Query Engines
51. Drill
This Apache project allows users to query Hadoop, NoSQL databases and cloud storage services using SQL-based queries. It can be used for data mining and ad hoc queries, and it supports a wide variety of databases, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage and Swift. Operating System: Windows, Linux, OS X.
Programming Languages
52. R
Similar to the S language and environment, R was designed to handle statistical computing and graphics. It includes an integrated suite of big data tools for manipulation, calculation and visualization. Operating System: Windows, Linux, OS X.
53. ECL
Enterprise Control Language, or ECL, is the language developers use for creating big data applications on the HPCC platform. An IDE, tutorials and a variety of related tools for working with the language are available on the HPCC Systems website. Operating System: Linux.
Big Data Search
54. Lucene
Java-based Lucene performs full-text searches very quickly. According to the website, it can index more than 150GB per hour on modern hardware, and it includes powerful and efficient search algorithms. Development is sponsored by the Apache Software Foundation. Operating System: OS Independent.
55. Solr
Based on Apache Lucene, Solr is a highly reliable and scalable enterprise search platform. Well-known users include eHarmony, Sears, StubHub, Zappos, Best Buy, AT&T, Instagram, Netflix, Bloomberg and Travelocity. Operating System: OS Independent.
In-Memory Technology
56. Ignite
This Apache project describes itself as “a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash technologies.” The platform includes data grid, compute grid, service grid, streaming, Hadoop acceleration, advanced clustering, file system, messaging, events and data structure capabilities. Operating System: OS Independent.
57. Terracotta
Calling its BigMemory technology “the world’s premier in-memory data management platform,” Terracotta boasts 2.1 million developers and 2.5 million deployments of its software. The company also offers commercial versions of its software, plus support, consulting and training services. Operating System: OS Independent.
Earlier this year, Pivotal announced that it would be open-sourcing key components of its Big Data Suite, including the GemFire in-memory NoSQL database. It has submitted a proposal to the Apache Software Foundation to manage the core engine for the GemFire database under the name “Geode.” A commercial version of the software is also available. Operating System: Windows, Linux.
59. GridGain
Powered by Apache Ignite, GridGrain offers in-memory data fabric for fast processing of big data and a Hadoop Accelerator based on the same technology. It comes in a paid enterprise version and a free community edition, which includes free basic support. Operating System: Windows, Linux, OS X.
60. Infinispan
A Red Hat JBoss project, Java-based Infinispan is a distributed in-memory data grid. It can be used as a cache, as a high-performance NoSQL database, or to add clustering capabilities to frameworks. Operating System: OS Independent.
Photo courtesy of Shutterstock.