Call it a perfect storm. Cheap storage and a huge influx of structured and unstructured data have led to the development of numerous Big Data tools designed to help companies ‘unlock the value’ of the giant stores of data they’ve accumulated, from customer records and product performance results to social media feeds and more.
Like traditional business intelligence (BI), these new big data tools can analyze past trends to help companies identify important patterns such as specific sales trends. Many big data tools are now offering a new generation predictive and prescriptive insights as well from all that data buried deep in your data center or off-premise.
The challenge, as Gartner analyst Doug Laney tells Datamation, is less about scaling infrastructure to handle all this data, but the variety of the data itself.
“The real challenge that vendors have only scratched the surface of is dealing with, integrating, co-structuring, and making sense of inputs from one’s own transaction and customer data, plus data from partners and suppliers, then also the expanding world of exogenous data such as social media, open data and syndicated data (from data brokers),” Laney said in an email.
And even though Gartner clients by a 2-to-1 margin say data variety is a greater issue for them than either volume or velocity of data, “vendors continue to peddle ‘bigger and faster’ solutions instead.”
Constellation Research analyst Doug Henschen says big data solutions are definitely evolving.
“In my book, 2014 was the year of SQL-on-Hadoop announcements, but this year companies and vendors started to realize that the opportunity with big data isn’t just to scale up traditional BI and Data Warehousing,” says Henschen. “Thus the Apache Spark open source framework and other analytical options that go beyond SQL have been hot in 2015. Spark was embraced by scores of vendors and hundreds of big companies in 2015. IBM was the most visible vendor advocate, but plenty of other data-integration and big data platform companies joined the bandwagon.”
In fact, it seems the big data bandwagon is getting bigger every day with vendors big and small joining in with solutions ranging from the relatively comprehensive to those designed to glean a specific range of insights. While hardly a comprehensive list, these four tools should be on your radar.
H2O for data scientists
H2O.ai is an open source machine learning system for data scientists that the company says will pull data from any source (e.g. Hadoop, SQL) and let you process it for analysis on commodity hardware across even thousands of nodes on a network or Amazon’s AWS cloud. You can start experimenting and continue using H2O.ai for free. The company charges for enterprise support.
“A lot of companies use Spark instead of Hadoop for its really fast short term memory, it’s like the RAM of big data,” says Oleg Rogynskyy, H2O’s VP of Marketing and Growth. “H20.ai can sit on top of Spark and reads from your short term memory and basically provides superfast analytical capabilities.”
Rogynskyy says H2O.ai is part of a new breed of data tools that aims to provide predictive analysis. He notes that SQL helped drive an earlier phase of descriptive data analysis or “tell me what happened” followed by the more recent “predictive phase” of products that look at what’s happened and try to help you predict what’s next – e.g. when will inventory run out, products break, etc.
“The third phase we’ll see played out in the next couple of years is the prescriptive phase – where the system says ‘Here is what I learned, what I think will happen and the future set of actions you should take to maximize whatever your goal is’,” says Rogynskyy. He points to Google Maps ability to proactively suggest alternate routes as an example of a prescriptive solution.
H20.ai positions itself as a predictive tool and kind of “data scientist in a box” that is being used across a range of industries. Networking giant Cisco for example has 60,000 models it uses to predict purchasing decisions and uses H2O.ai to score those models. Lou Carvalheira, Cisco Principal Data Scientist, said “The results are fantastic…we see anywhere from three to seven times better results with the models that we have. For the modeling and scoring alone, the H2O.ai environment is upwards of 10 to 15 times faster.”
ThoughtSpot 3.0 – A Big Data Appliance
Back in the early days of Google the company offered a hardware appliance for enterprises to enable superfast search capabilities behind the firewall. ThoughtSpot borrows a page from that history with its ThoughtSpot appliance, which can pull in data from virtually any source. The company also plans to offer a cloud-based service as well.
Priced starting at $90,000, ThoughtSpot is a serious tool for companies looking to speed the accessibility of big data insights beyond data scientists. “We’ve seen a rise of data scientists in organizations, but that process of having to go through them for reports slows things down,” says ThoughtSpot Vice President of Marketing Scott Holden. “Two billion humans do search, but at work we still rely on data experts.”
In a demo at company headquarters in Palo Alto, Calif, Holden showed how the system works using the familiar search bar interface. The just-released ThoughtSpot 3.0 has a raft of new features including “DataRank” that works similarly to Google’s Typeahead and PageRank. The software uses machine learning algorithms to suggest keywords as you search to speed up the process.
Popcharts is easily the coolest new feature. As you type say “Sales by East Coast …” in the search box, ThoughtSpot instantly creates a relevant chart based on the query, and uses machine learning to present the best type of chart from over a dozen you can also select from.
Another “instant” feature is AutoJoins, which is designed to navigate the hundreds of data sources enterprises typically have. AutoJoins uses ThoughtSpot’s data index, schema index and machine learning to understand how your tables are related and joins them on-the-fly, presenting the result in under a second.
While ThoughtSpot is more focused on traditional BI analysis of historical data (albeit in a superfast, very accessible way), Holden says predictive and prescriptive analysis features will come in future software revs.
Gartner analyst Doug Laney says Connotate and Bright Planet are on his list of interesting big data tools because they help harvest and structure a variety of content from within an organization’s own data coffers, plus the Internet itself.
“As organizations realize that navel-gazing at their own data is no longer a sure-fire recipe for innovation, digitalization and growth, they’re increasingly looking to exogenous data (i.e. from outside the company),” says Laney.
Connotate says its patented approach to Web content extraction goes well beyond web scraping or custom scripts. Instead it combines a visual understanding of how websites work using machine learning that it says makes the content extraction “scalable, precise and reliable.”
The Connotate platform “easily handles hundreds of thousands of websites and terabytes of data,” according to the company, and delivers targeted information relevant to your business. Connotate says it offers content acquisition that on average costs 55% less than traditional approaches.
In one use case, Connotate helped a sales intelligence provider create a nationwide database of physician profiles by extracting contact data (name, position, phone, email and affiliation) from thousands of hospital websites. Connotate says the big data solution was sold to several large pharmaceutical firms who didn’t have to spend on additional hardware or IT resources. The big data extraction scaled up to provide data on a half-million physicians.
BrightPlanet extracts data from the Internet, though it also touts its ability to scour the so-called “Deep Web” for insights. The deep web includes password-protected websites and other sites typically not indexed by conventional search engines.
BrightPlanet says it harvests data from millions of entries, including Tweets and news stories to bankruptcy databases and medical journals that can be filtered based on the specific needs and criteria of the business.
The company offers a free Data-as-a-Service (DaaS) consultation with one of its Data Acquisition Engineers to see if their services are a good fit. The consultation is designed to help you find the right data to collect for your project and get it in the right format so you can get a good idea of the process and results.
The end user or customer can choose what sites to harvest content from. BrightPlanet in turn “enriches” that content so that, for example, unstructured data like comments at a social media site, are presented in a custom format designed to make it more usable to the client or customer.
Photo courtesy of Shutterstock.