12 Leading Open Source Data Tools

  • 12 Leading Open Source Data Tools

    12 Leading Open Source Data Tools
    Open source data analytics tools – sometimes called open source Big Data tools – are used by a wide array of professionals and researchers.
  • The R Project for Statistical Computing

    The R Project for Statistical Computing
    Well known and well respected in the data analytics community, R is a free software platform for data analytics computing and graphics. Also helpful: it runs – and compiles – on Unix, Windows and Mac. https://www.r-project.org/
  • RapidMiner

    RapidMiner
    A multifaceted data science built on an open core, RapidMiner integrates machine learning, predictive analytics, text mining and other data functions. It offers both a commercial license and the free Studio Edition. https://rapidminer.com/
  • Gephi

    Gephi
    Open source and free, the Gephi Open Graph Viz Platform offers (as the name suggests) is visualization software for graphs of many kinds. It helps understand the foundational structures of relationships between objects. Key point: no programming skills are needed. https://gephi.org/
  • Pentaho

    Pentaho
    Think of Pentaho as an “open source success story” – it was acquired by HitachiVantara in 2015. Like many open source tools owned by commercial companies, Pentaho offers both enterprise and community editions of its data analytics application. To be sure, the Pentaho platform is advanced, enabling data mining, ETL, information dashboards and OLAP features. https://community.hitachivantara.com/s/article/downloads
  • NodeXL

    NodeXL
    A free, open source template for Excel, NodeXL allows simple use of network graphs. All you have to do is input your network edge list in Excel, and NodeXL will display the graph. Given that Excel is often referred to as “the leading analytics tool,” NodeXL has a key place in data analytics. https://archive.codeplex.com/?p=nodexl
  • KNIME

    KNIME
    Here’s the KNIME open source philosophy: “There’s a general agreement that opening up previously closed or exclusive platforms, processes, tools, organizational boundaries, idea sourcing or funding can speed up innovation.” Truer words were never spoken. The KNIME Analytics Platform enables data mining, collaboration, and predictive analytics. Plus: a large toolbox of commercial extensions allows a more robust platform. https://www.knime.com/
  • OpenRefine

    OpenRefine
    Previously known as Google Refine, OpenRefine is a free tool that helps clean and transform data to different formats. Users can explore large data sets, and reconcile and match data. Key point: OpenRefine is offered in 15 languages. http://openrefine.org/download.html
  • Alluxio

    Alluxio
    Formerly known as Tachyon, Alluxio describes itself as developer of open source data orchestration software for the cloud.” It routes data closer to machine learning and AI solutions. Alluxio works with tools like Spark and Hadoop to speed performance on big data queries. Operating System: Linux, OS X https://www.alluxio.io/
  • Lumify

    Lumify
    Created by a company called Altamira Technologies, Lumify describes itself as an "open source big data analysis and visualization platform." It makes it easy to create 2D or 3D graphs that show the relationship between entities or to overlay data on maps. For those who are interested in learning more about how it works, the website offers several videos that show Lumify in action, and it also has a demo site that allows users to upload their own data and try out the software. Operating System: Linux. https://www.altamiracorp.com/lumify-slick-sheet/
  • Hadoop

    Hadoop
    Hadoop has had its ups and downs over the years. But this Apache-sponsored project is one of the best-known data tools available. Numerous companies, including Amazon Web Services, Cloudera, Hortonworks, IBM, Pivotal, SyncSort and VMware, offer related products or commercial support for Hadoop. Well-known users include Alibaba, AOL, eBay, Facebook, Google, Hulu, LinkedIn, Spotify, Twitter and Yahoo. Operating System: Windows, Linux, OS X. http://hadoop.apache.org/
  • Hypertable

    Hypertable
    Popular with Web companies, Hypertable was developed by Google as a way to make databases more scalable. Its users include Baidu, eBay, Groupon and Yelp. It is compatible with Hadoop, and commercial support and training are available. Operating System: Linux, OS X. https://www.hypertable.com/
  • Pig

    Pig
    Apache Pig is a platform for distributed big data analysis. It relies on a programming language called Pig Latin, which boasts simplified parallel programming, optimization and extensibility. Operating System: OS Independent. http://pig.apache.org/
  • 1 of

12 Leading Open Source Data Tools

  • 1 of
  • 12 Leading Open Source Data Tools

    12 Leading Open Source Data Tools

    Open source data analytics tools – sometimes called open source Big Data tools – are used by a wide array of professionals and researchers.
  • The R Project for Statistical Computing

    The R Project for Statistical Computing

    Well known and well respected in the data analytics community, R is a free software platform for data analytics computing and graphics. Also helpful: it runs – and compiles – on Unix, Windows and Mac. https://www.r-project.org/
  • RapidMiner

    RapidMiner

    A multifaceted data science built on an open core, RapidMiner integrates machine learning, predictive analytics, text mining and other data functions. It offers both a commercial license and the free Studio Edition. https://rapidminer.com/
  • Gephi

    Gephi

    Open source and free, the Gephi Open Graph Viz Platform offers (as the name suggests) is visualization software for graphs of many kinds. It helps understand the foundational structures of relationships between objects. Key point: no programming skills are needed. https://gephi.org/
  • Pentaho

    Pentaho

    Think of Pentaho as an “open source success story” – it was acquired by HitachiVantara in 2015. Like many open source tools owned by commercial companies, Pentaho offers both enterprise and community editions of its data analytics application. To be sure, the Pentaho platform is advanced, enabling data mining, ETL, information dashboards and OLAP features. https://community.hitachivantara.com/s/article/downloads
  • NodeXL

    NodeXL

    A free, open source template for Excel, NodeXL allows simple use of network graphs. All you have to do is input your network edge list in Excel, and NodeXL will display the graph. Given that Excel is often referred to as “the leading analytics tool,” NodeXL has a key place in data analytics. https://archive.codeplex.com/?p=nodexl
  • KNIME

    KNIME

    Here’s the KNIME open source philosophy: “There’s a general agreement that opening up previously closed or exclusive platforms, processes, tools, organizational boundaries, idea sourcing or funding can speed up innovation.” Truer words were never spoken. The KNIME Analytics Platform enables data mining, collaboration, and predictive analytics. Plus: a large toolbox of commercial extensions allows a more robust platform. https://www.knime.com/
  • OpenRefine

    OpenRefine

    Previously known as Google Refine, OpenRefine is a free tool that helps clean and transform data to different formats. Users can explore large data sets, and reconcile and match data. Key point: OpenRefine is offered in 15 languages. http://openrefine.org/download.html
  • Alluxio

    Alluxio

    Formerly known as Tachyon, Alluxio describes itself as developer of open source data orchestration software for the cloud.” It routes data closer to machine learning and AI solutions. Alluxio works with tools like Spark and Hadoop to speed performance on big data queries. Operating System: Linux, OS X https://www.alluxio.io/
  • Lumify

    Lumify

    Created by a company called Altamira Technologies, Lumify describes itself as an "open source big data analysis and visualization platform." It makes it easy to create 2D or 3D graphs that show the relationship between entities or to overlay data on maps. For those who are interested in learning more about how it works, the website offers several videos that show Lumify in action, and it also has a demo site that allows users to upload their own data and try out the software. Operating System: Linux. https://www.altamiracorp.com/lumify-slick-sheet/
  • Hadoop

    Hadoop

    Hadoop has had its ups and downs over the years. But this Apache-sponsored project is one of the best-known data tools available. Numerous companies, including Amazon Web Services, Cloudera, Hortonworks, IBM, Pivotal, SyncSort and VMware, offer related products or commercial support for Hadoop. Well-known users include Alibaba, AOL, eBay, Facebook, Google, Hulu, LinkedIn, Spotify, Twitter and Yahoo. Operating System: Windows, Linux, OS X. http://hadoop.apache.org/
  • Hypertable

    Hypertable

    Popular with Web companies, Hypertable was developed by Google as a way to make databases more scalable. Its users include Baidu, eBay, Groupon and Yelp. It is compatible with Hadoop, and commercial support and training are available. Operating System: Linux, OS X. https://www.hypertable.com/
  • Pig

    Pig

    Apache Pig is a platform for distributed big data analysis. It relies on a programming language called Pig Latin, which boasts simplified parallel programming, optimization and extensibility. Operating System: OS Independent. http://pig.apache.org/

For any number of reasons, open source software is embraced by data analytics researchers and professionals. This might be because many top researchers work in the education sector, and the emphasis is on cutting costs – hence the attractiveness of an open source free download. Or might be because the same mindset required for the deep exploration of data is similar to the love of software development common among many open source developers. Whatever the case, the data tools on this list are open source leaders as data analytics becomes ever more important.