The open source Hadoop project is all about providing the ability to manage and understand large datasets. Yahoo which uses Hadoop to manage 120 terabytes of data per day, this week released a new version of their edition of Hadoop but they weren’t the only ones with a new Hadoop release this week.
Commercial Hadoop vendor Cloudera this week announced Cloudera’s Distribution for Hadoop (CDH) version 3, including some technologies that were previous closed source. In addition to the new version of CDH, Cloudera is announcing a new Enterprise version of their Hadoop distribution, providing additional usability and management features for enterprise users.
CDH is a version of the Apache Hadoop project that bundles additional projects and technologies to make Hadoop more usable for enterprises. CDH includes the Yahoo developed open source Oozie workflow engine as well as including projects originated by Cloudera. Among the Cloudera-originated projects is one called HUE (Hadoop User Experience), which began its life as the closed source Cloudera Desktop.
“Cloudera Desktop was a desktop based user interface for people building apps for Hadoop,” Cloudera CEO Mike Olson told InternetNews.com. “That was always available for free, but it wasn’t open source. We believe that the platform has got to be open source in order to succeed.”
Olson added that Cloudera has rebranded the desktop product as HUE and it has now also evolved. He explained that HUE has become a collection of APIs
Additionally Olson noted the Cloudera developed the open source Flume project. The Flume project, which is included as part of CDH, is all about getting various data sources into a Hadoop cluster in a continual, reliable and fault-tolerant way. Flume is a complement to the Sqoop project, also developed and open-sourced by Cloudera, which is a tool for importing database tables into Hadoop.
With the HBase project included in CDH, Cloudera is also aiming to expand beyond just SQL types of database inputs.
“HBase is a NoSQL layer on top of HTFS (Hadoop’s filesystem),” Olson said.
To date, Cloudera has built its business around offering services for Hadoop, but with Cloudera Enterprise, they’re now aiming to monetize software as well. Cloudera Enterprise includes deployment management tools as well as support and legal indemnification.
As to where Cloudera draws the line between what is an open source feature for CDH versus what is an Enterprise feature for paying customers, it’s all about the platform.
“If it is a platform feature, it belongs in the open source platform,” Olson said. “Platform features include ways to store data reliably — basically any of the plumbing that is required to make data storage and analysis work well.”
Olsen explained that the enterprise features are the tools that are required to integrate Hadoop clusters with existing infrastructure and the dashboards that IT staff needs to manage thousands of nodes in a cluster.
While Yahoo is a big contributor and backer of Hadoop, Olson doesn’t see Yahoo’s version of Hadoop as being competitive with Cloudera’s corporate efforts. Olson noted that Cloudera benefits from the work that is done in the open source Hadoop community, including Yahoo’s contributions. That said, in his view the Yahoo version of Hadoop isn’t necessarily the right fit of services for enterprise deployments.
“Yahoo has build a Hadoop distro that runs well on its own infrastructure,” Olson said. “Not all enterprises have the same compute infrastructure as Yahoo does and Yahoo does not provide support for that software.”
Sean Michael Kerner is a senior editor at InternetNews.com, the news service of Internet.com, the network for technology professionals.