Big Data Tools

Big Data tools enable business to glean insight from their storehouses of data, providing a critical competitive advantage in a data-driven business landscape. 

Big Data tools, clearly, are proliferating quickly in response to major demand. In the decade since Bid Data emerged as a concept and business strategy, thousands of tools have emerged to perform various tasks and processes, all of them promising to save you time, money and uncover business insights that will make you money. Clearly, Big Data analytics tools are enjoying a growing market.

Many of them started out like the initial Big Data software framework, Hadoop, as open source projects, but commercial entities have sprung up rapidly to provide either new tools or commercial support and development for the open source products.

Weeding through them all can be a challenge especially since many Big Data tools have a single purpose and you can do many different things with Big Data, so your analytics toolbox can get rather filled up. We’ll run down a list of major Big Data analytics tools and then three major categories to keep in mind, as recommended by an expert consultant in this field.

Major Big Data Tools

As said earlier, Big Data tools tend to fall into a single use category and there are multiple ways to use Big Data. So we will break things down by category, then analytics tools in each.

Data Storage and Management

Big Data all starts with the data store. That means starting with Hadoop, the Big Data framework. It’s an open-source software framework run by the Apache Foundation for distributed storage of very large datasets on commodity computer clusters. Major players in this field are:

Cloudera: essentially Hadoop with some extra services added on, which you will need because Big Data is not a trivial exercise. Cloudera’s services team can not only help you build your Big Data cluster but help train your people to better access to the data as well.

MongoDB: The most popular database for Big Data because it’s good for managing unstructured data, which Big Data often is, or data that changes frequently.

Talend: a company with a broad array of solutions, Talend’s offering is built around its Integration Platform, which combines big data, cloud, application, and real-time data integration, data preparation and master data management.


Talend Big Data integration includes data quality and governance features. 

Data Cleaning

Before you can really process the data for insights, you need to clean it up, transform it, and turn it into something remotely searchable. Big Data sets tend to be unstructured and unorganized, so some kind of cleaning or transformation is necessary.

OpenRefine: an easy-to-use open source tool for cleaning up messy data by removing duplicates, empty fields and other errors. It’s open source but has a sizable community around it who will help.

DataCleaner: Like OpenRefine, DataCleaner transforms semi-structured data sets into clean, readable data sets that data visualization tools can read. The company also offers data warehousing and data management services.

Microsoft Excel: Seriously, it has its uses. You can import data from a wide variety of data sources. Excel is particularly good with manual data entry and copy/paste operations. It can remove duplications, do find and replace, spell check, and has a number of formulas for transforming data. But it gets bogged down quickly and is not ideal for large data sets.

Data Mining

Once data is cleaned and prepared for examination, you begin the search process through data mining. This is where you do the actual process of discovery, making decisions and predictions.

RapidMiner: An easy-to-use predictive analysis tool with a very user-friendly visual interface that means you don’t have to write code to run the analytics products.

IBM SPSS Modeler: A suite of five products for data mining meant for enterprise-scale advanced analytics. Plus IBM services and consulting are second to none.

Teradata: Offers end-to-end solutions for data warehousing, Big Data and analytics and marketing applications. This all means that you can truly become a data-driven business, along with business services, consulting, training and support.

RapidMiner Cloud

Like many current Big Data tools, the RapidMiner solution embraces the cloud.

Data Visualization

Data visualization is how your data is displayed in a readable, usable format. It’s where you see charts and graphs and other images that put data into perspective.

Tableau: the leader in this field, its data visualization tool focus on business intelligence to create all kinds of maps, charts, plots and more without the need to know programming. They have five products overall, with a free version called Tableau Public for potential customers to experiment with.

Silk: A simpler version of Tableau, Silk lets you visualize data as maps and charts without requiring any programming. It even tries to visualize your data automatically when you first load it. It also makes it easy to publish results online.

Chartio: Chartio uses its own visual query language to create powerful dashboards with just a few clicks without having to know SQL or other modeling languages. It’s main difference from others is that you connect directly to databases, so no data warehouse is needed in between.

IBM Watson Analytics: A combination of machine learning (ML) and artificial intelligence (AI) helps provide a smart data science assistant, which acts as a guide for users with a wide range of data science skill sets, from business analyst to data scientist.

Three Levels of Big Data Tools

Big Data tools break down into a three-level pyramid, says Ritesh Ramesh, Cto for the mobile data and analytics program at PwC. The first layer, the largest, is a bunch of open source tools. Every company started this way, like Cloudera and Hortonworks. There is very little value other than the basic infrastructure and servers and storage. Most of the cloud players have commoditized that layer.

The second layer is where most of these vendors have realized to increase their market share they have to build some proprietary apps on top of the open source tools to separate themselves from the rest. Cloudera, for example, built a bunch of things like the data science platform that sits on the Hadoop core.

The third layer is vertical-specific apps. Most of these companies are working with system integrators like PwC, Cognizant or Accenture. That’s where the real value is.

Ramesh said there are three major areas of need in tools, beyond the basic functions. The first is data wrangling tools, he said. “Data learning tools are a great tool in the toolkit for clients to do data quality and profiling, to process through 50 million rows of data to find insights,” he said.

He said the leading vendors include Trifacta, Paxata, and Talend.

The second major category of apps is governance, such as how you have metadata definitions. “A lot of people struggle with that. People dump a lot of junk into the data lake. There are not many tools in the market that can effectively work in the lake. Since a lot of this work is done by IT people they are more interested in pumping data into the lake and not putting a governance structure around it,” he said.

Top vendors: Waterline Data, Tamr’s data cataloging tool, and Collibra

The third biggest need that shows up frequently is security, said Ramesh. “People want a single product with all layers of security access, column, row, and objects. They want one product that supports user access and security for diff data objects. That space is also very green,” he said.

Major vendors in this space are Wandisco and FireEye.

Tags: big data, big data analytics, big data tools, big data companies, big data analytics tools and techniques

0 Comments (click to add your comment)
Comment and Contribute


(Maximum characters: 1200). You have characters left.



IT Management Daily
Don't miss an article. Subscribe to our newsletter below.