The rapid proliferation of the Internet, the surge in social media activity and the digitization of oceans of information has presented the IT industry with a challenge: how to deal with it all? How to cull through it and make sense of it?
The various technologies and solutions to these questions are lumped under the hot buzzword Big Data.
And as the industry works through this challenge, experts believe 2012 could be the year in which more companies try out non-relational models for data management.
A key indicator in this trend is the move away from the earlier accepted application-centric approach to one that is more data-centric. Applications in the past drove the analytics, and then proceeded to implementation. The approach was to work backward to the respective information sources.
However, companies now recognize the value of data from different sources and the demand is for applications to integrate with them. The ability to work with all kinds of data now ascertains the success of an application. “Partly encouraged by new technologies like Hadoop, data is now the prime component,” explains Eli Collins, Software Engineer at Cloudera.
Management Leveraging Big Data
This is a recognition that management practices based on analytics have a positive impact on businesses and can, in some circumstances, even create competitive advantage. Data drives analytics, which in turn forms the basis of a good decision. And therefore the ability to capture, manage, store, quickly analyze and disseminate it to the right sources becomes vitally important.
“This change has been coming about over the past five years – some of the best companies always knew about it but they may not have had all the tools at an affordable cost, says Dan Vesset, VP, Business Analytics, IDC.
Big Data and Hadoop in particular gets a lot of attention from all of the industry big guns – IBM, Oracle, EMC, and Microsoft to name a few – but there is still a lot of hype that surrounds it. The early adopters have already distanced themselves from others through their use of Big Data analytics.
However, for all “the cutting edge work undertaken by these companies, the majority of the enterprise customers are still in the tire kicking phase,” says Herain Oberoi, Director, Product Management, Microsoft.
The industry is optimistic, nonetheless.
Already all the major players have started to evaluate how to make IT simpler, to manage the data and ensure end users like business analyst and data scientist derive value from it.
Last year Microsoft shipped Hadoop connectors for their SQL Server database and Parallel Data Warehouse appliance, announced Hadoop-based services for Windows Azure, Hadoop on Windows as an on-premise offering and integration with their business Intelligence tools. The company is currently testing a second Community Technology Preview of a Hadoop-based service for Windows Azure.
Jack Norris, VP, Marketing, MapR Technologies, added that Hadoop allows data accessing across commodity hardware and scales linearly, which makes it easier to handle the fast growing machine-generated data sources. Furthermore, the inclusion of enterprise grade capabilities like snapshot and mirroring and integration standard protocols like Network File System (NFS), makes Hadoop easier to integrate into the existing IT environment.
Merging Big Data With Earlier Solutions
Additionally, efforts by the vendors to merge the newer Big Data solutions with traditional platforms both in data management and data warehousing Business Intelligence is also underway.
For instance, the unstructured data requires complex algorithms. This code does not run inside SQL relational databases, but it can be parallelized efficiently using a MapReduce programing model. The twist here, explains Martin Willcox, Director of Product & Solutions Marketing at Teradata, is that the MapReduce is a parallel programming model and the most common implementation is Hadoop.
Hadoop, despite its many strengthes also has some weakness. It is batch-oriented environment and is difficult to maintain high level of user concurrency. In this case, Teradata employs the Aster Mapreduce appliance, which combines Aster’s relational database, and the SQL MapReduce framework that allows users to store complex unstructured data through a simple SQL interface.
“If we are going to really bring this MapReduce programming model and the ability to extract really interesting information from all this new and complex data, we need to industrialize that whole process. In the same way that we have industrialized traditional BI in the past,” added Willcox.
Oracle, too, with its Oracle Big Data Appliance enables its customers to jump-start their big data projects through a comprehensive and pre-integrated solution. “This is well integrated with our database machine Exadata, which the customer can readily deploy and support,” explains Nick Whitehead, Business Analytics Senior Director, Oracle.
The IT department would also need to understand a separate hardware model, because the deployment of the open source Hadoop on Direct-Attached Storage (DAS) does not fit into any of their traditional IT practices around back up or replication or security policy. Here, EMC integrated the technologies from Isilon and Greenplum to offer an enterprise solution, where Isilon storage system is an IT storage play with replication, snapshots and security integration – everything an organization is familiar with on top of the Greenplum Hadoop infrastructure.
Big Data, Extended
There are other concepts in dealing with massive data sets, like Data Policies, Data Protection, and Access to Data, which, according to Michael CHUI, Senior Fellow, McKinsey Global Institute, organizations need to address to capture the full potential of Big Data.
For instance, data policies for multinationals could prove to be a challenge because regulations differ from country to country. Then there is data security and privacy. Organizations will have to consider and address how to protect sensitive data and tackle questions around intellectual property rights attached to an item of data and other legal questions around liability. “Addressing these issues will be the real enabler towards capturing value,” said CHUI.
Vendors have taken note of this and have in place solutions that make addressing these problems easier. IBM, for instance offers comprehensive data integration, data quality and governance capabilities, which ensures information delivered post analysis is trusted. IBM extends these concepts, which are implemented at the database level and application level, to their big data solutions. These policies are plotted into their Hadoop-based platform InfoSphere BigInsights, which allows an organization to audit who accesses the data, when and from where.
“With this we are able to provide a complete picture,” says Anjul Bhambhri, VP, Big Data Products, IBM.
The Confusion of the New
Still, there is some confusion over what organizations can achieve with new technologies as they grapple with which solutions would be appropriate to deal with a given problem.
Traditional relational databases, for instance, have the ability to scale, but may be inappropriate for housing the growing volume of unstructured data. “Organization may have to think of scaling out and scaling up differently to find a cost effective mechanism for extracting benefit from low value unstructured data,” says David Rajan, Director of Technology, Oracle. New technologies like Hadoop and NoSQL not only help organizations derive commercial advantage from a very low value data set but also complement the existing technologies like database infrastructure and data warehousing platform.
Organizations might also endure start up costs. While Hadoop is a good scale out analytics platform, it does have a single point of failure: it’s inefficient from a storage perspective and requires a whole new set of tools, training and processes. “In addition, enterprises are yet to see, post deployment, the way ahead for an open source technology to become mainstream in their architecture,” says Nick Kirsch, Director of Product Management, Isilon.
Hadoop has yet to be deployed on a wider scale and, in the view of some, is difficult to use. Using a more established technology where one can simply push a button to make it work – one backed by a service and support system – can relive headaches for a CIO. “This is the continual challenge faced by freeware in the enterprise, but innovation is not confined to the open source community,” stated Fernando Lucini, Chief Architect Autonomy.
There is already a maturation of the Hadoop mindset and this year the players expect a lot more budget allocated towards these projects.
This exponential data growth shows no signs of slowing down. Vanessa Alvarez, Analyst, Infrastructure & Operations, Forrester Research points to a 50% data growth in organizations. The spotlight, consequently, is shifting to current storage architecture, which could limit the potential of forward-looking Big Data solutions. The need is immediate. “Storage is expensive and accounts for 17% of the IT budget,” she says.
The two main concerns here are the affordability of the uncontrollable demand from the storage infrastructure due to the data growth and the need to harness this growth.
Indeed, high-end storage remains under utilized and increasingly takes the form of solid state disk (SSD). “But customers don’t want to have their rarely used or cold data on the expensive SSDs. And depending on the Service Level Agreement, [they] are uncomfortable about placing their data streams on tapes,” says Steve Wojtowecz, VP Software Storage, IBM.
In this scenario, customers would need to pick a storage approach that would work with the software tools they want to use. Carter George, Executive Director, Dell Storage Strategy exercises caution. Some of the software tools that come with their own built-in storage solutions could lock their user to a particular storage architecture. “This market is yet to mature and the best software for this may not have been written yet, or it may just now be getting prototyped. So I’d be hesitant to commit to a storage strategy that is tied to a specific tool,” added George.
Data analytics workloads will become critical and organizations will prefer to move them away from commodity hardware. The focus will be on reliability, be it on the hardware layer or the file layer that provides RAID or mirroring to be able to recover after the software fails.
“Going forward, workloads will get to a point where the level of functionality and reliability of a NAS and a SAN will be very important,” added Wojtowecz.