Scaling systems on a distributed basis to handle petabytes of information is no easy task, though it's one that the open source Apache Hadoop project delivers for such big names as Facebook, Google and Yahoo.
Now, the Hadoop framework for running applications across large clusters could see a boost -- thanks to the first official commercial distribution from its lead backer, Cloudera.
With today's release of the Cloudera for Hadoop distribution, Cloudera is aiming to push Hadoop into wider usage by making it easier and more flexible to deploy.
It's an important development for one of the key players in a closely watched project responsible for powering projects and products at some of the biggest Internet firms.
For instance, Cloudera founder Christophe Bisciglia, who formerly served as the manager of Google's Hadoop cluster before setting up Cloudera in 2008, said that the search engine leader uses Hadoop to power its academic datacenter. The program, in partnership with the National Science Foundation, makes thousands of CPUs and lots of storage available for data research across different disciplines.
The launch of Cloudera's first release also comes at the same time that the company, which makes its money by providing commercial support, training and consulting for Hadoop, said it has closed on $5 million in venture funding to grow its commercial offerings.
However, Bisciglia said Cloudera would continue to work with the big names also involved in Hadoop's development -- a project in which Bisciglia said his company also actively participates.
"We work closely with developers at Yahoo!, Google and Facebook, and we expect that to continue," Bisciglia told InternetNews.com. "They have solutions for some of the deployment problems we are addressing for regular users, but it's obvious that we all see the value in converging the code that runs on production systems. I'd rather not speculate on specifics, but I am excited to continue working with [those] organizations."
The idea of a clustered file system is not unique to Hadoop. Oracle has its own Oracle Clustered File System (OCFS) and Red Hat has its Global File System (GFS). Yet Bisciglia argued that what Hadoop does is somewhat different, noting that OCFS and GFS are designed to implement the same requirements as regular filesystems, but in a distributed manner.
"Hadoop and HDFS throw out the past requirements, and are optimized for working with very large data sets -- many terabytes to petabytes," Bisciglia said. "What this means is things like accessing a small chunk -- a few KB like an individual Web page or document -- of data from a random file is rather slow in comparison, but Hadoop excels at using many processors and disks to store process exceedingly large volumes of data."
According to Bisciglia, Cloudera's distribution for Hadoop sweetens the deal for using Hadoop -- lowering the barrier to entry for enterprise users by including a number of tweaks and common, important tools.
As a result, while Cloudera's version of Hadoop is based on the most recent stable version of the core open source project, its distribution may not be exactly the same as the open source project.
"We sometimes include code that we developed to resolve customer issues or feature requests, and we may include that code while we are in the process of contributing it back to Apache," Bisciglia said. "The core will always be the same, but our packaging and user experience will be recognizable."
For instance, Cloudera includes the Hadoop Distributed File System (HDFS), one of the key file systems supported under Hadoop, and one that Cloudera claims can support tens of millions of files in a single instance.
The distribution also includes MapReduce technology, an open source project commonly used with Hadoop that enables applications to divide up into multiple parallel blocks. Meanwhile, data summary analysis is provided by way of the Hive data warehousing infrastructure, another open source tool included in the distribution.
Currently, Cloudera's Hadoop distribution is being made available for Red Hat Enterprise Linux and its variants, though Bisciglia said wider support is on the roadmap.
Moving forward, Bisciglia said he expects that companies in the Web 2.0 space will adopt Hadoop as well as those in biotech, financial services and retail. He added that the key challenge to wider adoption is all about ease of use and deployment, which is what Cloudera is trying to fix.
"Hadoop needs to be just as easy to deploy and use as any other piece of enterprise software," Bisciglia said.
"We're taking steps in that direction by using standard tools for packaging and deployment, and you can expect to see similar improvements and standardizations for developers and users enterprise Hadoop clusters," he added. "The primary way we overcome these challenges is by giving our distribution away fro free, and actively engaging the community in solving problems."
This article was first published on InternetNews.com.