SHARE

Hadoop and Big Data Without Storage Headaches

Given the storage needs of Big Data, many things can be the cause of storage headaches, but few are as likely to cause more acute ones than Hadoop. That’s because it’s so difficult to set up Big Data-style Hadoop storage with the right combination of speed, capacity, connectivity and enterprise features that’s required immediately and […]

Written By

Paul Rubens

Jun 23, 2015

7 minute read

Datamation content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

Given the storage needs of Big Data, many things can be the cause of storage headaches, but few are as likely to cause more acute ones than Hadoop. That’s because it’s so difficult to set up Big Data-style Hadoop storage with the right combination of speed, capacity, connectivity and enterprise features that’s required immediately and which can scale to meet likely future needs.

So it’s perhaps not surprising that a new breed of hyper-converged system is beginning to emerge which is designed to help avoid headaches by keeping Hadoop infrastructure simple.

One of the major headaches that Hadoop can bring on is caused by the vast and rapidly growing amounts of data that many organizations are generating and storing away for possible future analysis – a typical Big Data scenario. Extracting value from Big Data entails getting it in to some sort of Hadoop environment where it can be analyzed.

There are a number of options, including a classic Hadoop system made up of individual server nodes with direct attached storage clustered together. Or systems can be built using compute nodes and a separate enterprise storage array that presents itself as Hadoop Distributed File System (HDFS) storage, like an EMC Isilon.

On the face of it this may be simpler because of the enterprise storage management features an array like this offers, and because compute and storage resources can be increased or decreased separately to create the right balance between the two. But a storage array is not cheap, and network connections between a storage array and compute nodes can introduce a bottleneck.

It’s also possible to use virtual nodes providing storage and compute, and virtualized compute nodes can also access external storage systems, including cheap storage provided by the likes of VMware’s Virtual SAN (VSan).

But a hyper-converged system that includes compute and storage nodes, networking and virtualization with Hadoop built in holds out the promise of being far easier to implement and avoiding all storage headaches. In theory it should just be a matter of dropping it in to place and populating it with data needed for your Big Data plans.

“We want to help customers to get a Hadoop environment up and running in three months, not the nine to thirteen months that it us currently taking,” says Fred Oh, a senior product marketing manager at Hitachi Data Systems.

Nine out of ten times, companies testing Hadoop find that it takes them far more time to build the infrastructure they need than they anticipated, claims Oh. “Getting Hadoop working on four nodes is doable in a few months, but when you start to scale to ten or fifty nodes – which is often the long term goal – teams are learning as they scale.”

The company recently unveiled its Hyper-Scale-Out Platform (HSP) for Big Data Analysis that has been designed to work with Hadoop. Hitachi’s HSP nodes each have 48TB of data storage capacity and twin Xeon processors, and with a maximum of 100 nodes the largest systems have a storage capacity of 4PB (4 million GB) of raw data. (In theory the system can scale beyond 100 nodes, but larger setups have not been tested.) Integrated 40GbE (backend) and 10GbE (front end) network interfaces are designed to minimize the chances of data bottlenecks.

Other systems and reference architectures include the Federation Business Data Lake , IBM Solution for Hadoop, Cisco’s UCS Integrated Infrastructure for Big Data, and Teradata’s Appliance for Hadoop

Unlike standard Hadoop setups, the HSP doesn’t use the standard Hadoop Distributed File System (HDFS), which takes responsibility for organizing and marshalling all this storage. Instead it uses a scale-out file system that supports Hadoop Distributed File System or HDFS API, as well as a POSIX-compliant file system.

There are two advantages to this, according to Oh. First, avoiding HDFS also avoids the data bottleneck and single point of failure that an HDFS NameNode represents. Instead, metadata is distributed on to every node on the cluster, so each node knows where every bit of data is.

Offering a standard POSIX-compliant file system also means that other data analytics tools besides Hadoop can access and analyze the data stored in the HSP – there’s no need to move it out to another storage system.

(This is reminiscent of MapR‘s Hadoop distribution, which eschews HDFS and enables NFS reading and writing, providing fine grained access to data and allowing other applications, such as SQL applications, to use data stored in the Hadoop environment.)

But Oh believes it’s too early to start thinking about using these sorts of appliances for more than analytics storage.

“Companies are likely to move just a subset of their data into to an HSP to analyze it,” he says. “Over time this could become a primary data lake – but not overnight. That’s because performance is not as high as an enterprise SAN. So performance (or lack of it) is the driver not to do that overnight.”

But once data is in the system, Oh says that data analytics performance can be very high indeed. That’s because instead of moving data over the network to where the Hadoop application is running, it can carry out data-in-place processing with the application coming to the data.

“Say the data you want to analyze physically resides on nodes 3, 5, and 7, but the application (that you want to use to carry out the analysis) is on node 50. Normally there would have to be data movement, because the data has to get to the application,” says Oh. “But what we do is spin up a VM with the relevant application on one of the nodes which is storing the data, so very little data movement is required.”

So how likely are converged big data systems like this to succeed in the market as an easy way to get a high performance Hadoop system with adequate storage up and running?

Hadoop and Big Data: Headaches Built-In?

Mike Matchett, a senior analyst at Taneja Group, points out that attempts have been made to make easy-to-implement Hadoop systems before, without much apparent success.

“There was the DDN hScaler project some time ago, and that made the argument about ease of use. But I don’t think that it sold well at all,” he says.

And he adds that the difficulty of getting some form of Hadoop system up and running – perhaps using a software system such as Cloudera Enterprise – can be overstated.

“I have rolled up a (small) Hadoop cluster in a matter of minutes, and if you go to Amazon Elastic Map Reduce (EMR) you can get going in less than a minute,” he says.

“But I am not saying that if you have to build something with you own hardware and interface cards and switches and have to make it work it would be that quick – that could take six months,” he adds. So a pre-built, converged system is bound to make things simpler for many organizations looking to get a Big Data analysis system up and running in a hurry, he agrees.

And that means the success of these systems is likely to come down to cost and convenience considerations. At around $25,000 – $35,000 per node, Hitachi’s HSP is not particularly expensive, but there are undoubtedly lower cost solutions to getting a Hadoop environment up and running.

But for companies in a hurry to unlock the value of their data, a hyper-converged system is certainly a solution worth considering for a Big Data analysis environment without the storage headaches.

Photo courtesy of Shutterstock.