SHARE

50 Top Open Source Tools for Big Data

Whenever analysts or journalists assemble lists of the top trends for this year, “big data” is almost certain to be on the list. While the catchphrase is fairly new, in one sense, big data isn’t really a new concept. Computers have always worked with large and growing sets of data, and we’ve had databases and […]

Written By

CH

Cynthia Harvey

Jun 4, 2012

Datamation content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

Whenever analysts or journalists assemble lists of the top trends for this year, “big data” is almost certain to be on the list. While the catchphrase is fairly new, in one sense, big data isn’t really a new concept. Computers have always worked with large and growing sets of data, and we’ve had databases and data warehouses for years.

What is new is how much bigger that data is, how quickly it is growing and how complicated it is. Enterprises understand that the data in their systems represents a gold mine of insights that could help them improve their processes and their performance. But they need tools that will allow them to collect and analyze that data.

Not surprisingly, the big data market is growing very quickly in response to the growing demand from enterprises. According to IDC, the market for big data products and services was worth $3.2 billion in 2010, and they predict the market will grow to hit $16.9 billion by 2015. That’s a 39.4 percent annual growth rate, which is seven times higher than the growth rate IDC expects for the IT market as a whole.

Interestingly, many of the best and best known big data tools available are open source projects. The very best known of these is Hadoop, which is spawning an entire industry of related services and products. This month, we’re profiling Hadoop, as well as 49 other big data projects. Here you’ll find a lot of Apache projects related to Hadoop, as well as open source NoSQL databases, business intelligence tools, development tools and much more.

If we’ve overlooked any important open source big data tools, please feel free to note them in the comments section below.

Also see: Hadoop and Big Data

Big Data Analysis Platforms and Tools
Databases
Business Intelligence Tools
Data Mining Tools
Big Data File Systems and Programming Languages
Transfer and Aggregate Tools
Miscellaneous Big Data Tools

Big Data Analysis Platforms and Tools

Perhaps the most interesting aspect of this list of open source Big Data analytics tools is how it suggests the future. It starts with Hadoop, of course, and yet Hadoop is only the beginning. Open source, with its distributed model of development, has proven to be an excellent ecosystem for developing today’s Hadoop-inspired distributed computing software. So take a look at the entries, all of which are some degree influenced by Hadoop, and realize: these products represent the infancy of what promises to be a very long – and very advanced – development cycle of open source Big Data products.

Databases

The database and data warehouse is one of the cornerstones of open source software in the enterprise. So it’s no surprise that the sixteen open source databases on these pages run the gamut in terms of approach and sheer number of tools, not to mention the list of prestigious companies that deploy these products. Indeed, as this list clearly shows, there’s no lack of expertise among open source developers when it comes to designing and building advanced database products.

Business Intelligence Tools

A good business intelligence tool makes all the difference to a manager or executive looking to run an efficient business. A top BI tool offers extensive reporting, big data analytics and integration with Hadoop and other platforms, all typically viewable on an intuitive, users customizable dashboard. Consequently, the open source business intelligence tools seen on these pages are used by many key personnel across all business sectors to make critical decisions.

Data Mining Tools

This array of open source data mining tools is as diverse as the open source community itself. Some are sponsored by companies with the resources for marketing and constant upgrades – and the benefit of constant feedback from customers – while others are classic open source projects, perhaps with an eye toward becoming the next Hadoop or Spark over time. Whatever the case, these pages contain an impressive level of development expertise in the service of Big Data.

Big Data File Systems and Programming Languages

A roundup of some of the brightest lights in the Big Data world – a list you’ll certainly be well familiar with if you work in Big Data. These open source file systems and open source programming languages are the very foundation of Big Data, the software workhorses that enable IT professionals to turn a vast data set into a source of actionable information and insight. Perhaps most interesting: as advanced as these tools are, the open source community will certainly have quite a lot more to offer Big Data in the years ahead. These advanced tools are just the beginning.

Transfer and Aggregate Tools

When IT professionals need to transfer and aggregate huge data sets for Big Data purposes, they require some heavy duty tools. They need software that can quickly sift and index through structured and unstructured data, tools that speak the diverse data languages of today’s highly complex Big Data platforms. The fact that some of the leaders in this area are open source file transfer and open source aggregation tools certainly showcases the ever-growing influence of open source in enterprise environments.

Miscellaneous Big Data Tools

Terracotta

Terracotta’s “Big Memory” technology allows enterprise applications to store and manage big data in server memory, dramatically speeding performance. The company offers both open source and commercial versions of its Terracotta platform, BigMemory, Ehcache and Quartz software. Operating System: OS Independent.

Avro

Apache Avro is a data serialization system based on JSON-defined schemas. APIs are available for Java, C, C++ and C#. Operating System: OS Independent.

Oozie

This Apache project is designed to coordinate the scheduling of Hadoop jobs. It can trigger jobs at a scheduled time or based on data availability. Operating System: Linux, OS X.

Zookeeper

Formerly a Hadoop sub-project, Zookeeper is “a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.” APIs are available for Java and C, with Python, Perl, and REST interfaces planned. Operating System: Linux, Windows (development only), OS X (development only).

See Also: Top Big Data Companies