Datamation content and product recommendations are
editorially independent. We may make money when you click on links
to our partners.
Learn More
Scaling systems on a distributed basis to handle petabytes of information is no easy task, though it’s one that the open source Apache Hadoop project delivers for such big names as Facebook, Google and Yahoo.
Now, the Hadoop framework for running applications across large clusters could see a boost — thanks to the first official commercial distribution from its lead backer, Cloudera.
With today’s release of the Cloudera for Hadoop distribution, Clouderais aiming to push Hadoop into wider usage by making it easier and more flexible to deploy.
It’s an important development for one of the key players in a closely watched project responsible for powering projects and products at some of the biggest Internet firms.
For instance, Cloudera founder Christophe Bisciglia, who formerly served as the manager of Google’s Hadoop cluster before setting up Cloudera in 2008, said that the search engine leader uses Hadoop to power its academic datacenter. The program, in partnership with the National Science Foundation, makes thousands of CPUs and lots of storage available for data research across different disciplines.
The launch of Cloudera’s first release also comes at the same time that the company, which makes its money by providing commercial support, training and consulting for Hadoop, said it has closed on $5 million in venture funding to grow its commercial offerings.
However, Bisciglia said Cloudera would continue to work with the big names also involved in Hadoop’s development — a project in which Bisciglia said his company also actively participates.
“We work closely with developers at Yahoo!, Google and Facebook, and we expect that to continue,” Bisciglia told InternetNews.com. “They have solutions for some of the deployment problems we are addressing for regular users, but it’s obvious that we all see the value in converging the code that runs on production systems. I’d rather not speculate on specifics, but I am excited to continue working with [those] organizations.”
The idea of a clustered file system is not unique to Hadoop. Oracle has its own Oracle Clustered File System (OCFS) and Red Hat has its Global File System (GFS). Yet Bisciglia argued that what Hadoop does is somewhat different, noting that OCFS and GFS are designed to implement the same requirements as regular filesystems, but in a distributed manner.
“Hadoop and HDFS throw out the past requirements, and are optimized for working with very large data sets — many terabytes to petabytes,” Bisciglia said. “What this means is things like accessing a small chunk — a few KB like an individual Web page or document — of data from a random file is rather slow in comparison, but Hadoop excels at using many processors and disks to store process exceedingly large volumes of data.”
Making Hadoop harder to resist
According to Bisciglia, Cloudera’s distribution for Hadoop sweetens the deal for using Hadoop — lowering the barrier to entry for enterprise users by including a number of tweaks and common, important tools.
As a result, while Cloudera’s version of Hadoop is based on the most recent stable version of the core open source project, its distribution may not be exactly the same as the open source project.
“We sometimes include code that we developed to resolve customer issues or feature requests, and we may include that code while we are in the process of contributing it back to Apache,” Bisciglia said. “The core will always be the same, but our packaging and user experience will be recognizable.”
For instance, Cloudera includes the Hadoop Distributed File System (HDFS), one of the key file systems supported under Hadoop, and one that Cloudera claims can support tens of millions of files in a single instance.
The distribution also includes MapReduce technology, an open source project commonly used with Hadoop that enables applications to divide up into multiple parallel blocks. Meanwhile, data summary analysis is provided by way of the Hive data warehousing infrastructure, another open source tool included in the distribution.
Currently, Cloudera’s Hadoop distribution is being made available for Red Hat Enterprise Linux and its variants, though Bisciglia said wider support is on the roadmap.
Moving forward, Bisciglia said he expects that companies in the Web 2.0 space will adopt Hadoop as well as those in biotech, financial services and retail. He added that the key challenge to wider adoption is all about ease of use and deployment, which is what Cloudera is trying to fix.
“Hadoop needs to be just as easy to deploy and use as any other piece of enterprise software,” Bisciglia said.
“We’re taking steps in that direction by using standard tools for packaging and deployment, and you can expect to see similar improvements and standardizations for developers and users enterprise Hadoop clusters,” he added. “The primary way we overcome these challenges is by giving our distribution away fro free, and actively engaging the community in solving problems.”
This article was first published on InternetNews.com.
-
Ethics and Artificial Intelligence: Driving Greater Equality
FEATURE | By James Maguire,
December 16, 2020
-
AI vs. Machine Learning vs. Deep Learning
FEATURE | By Cynthia Harvey,
December 11, 2020
-
Huawei’s AI Update: Things Are Moving Faster Than We Think
FEATURE | By Rob Enderle,
December 04, 2020
-
Keeping Machine Learning Algorithms Honest in the ‘Ethics-First’ Era
ARTIFICIAL INTELLIGENCE | By Guest Author,
November 18, 2020
-
Key Trends in Chatbots and RPA
FEATURE | By Guest Author,
November 10, 2020
-
Top 10 AIOps Companies
FEATURE | By Samuel Greengard,
November 05, 2020
-
What is Text Analysis?
ARTIFICIAL INTELLIGENCE | By Guest Author,
November 02, 2020
-
How Intel’s Work With Autonomous Cars Could Redefine General Purpose AI
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 29, 2020
-
Dell Technologies World: Weaving Together Human And Machine Interaction For AI And Robotics
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 23, 2020
-
The Super Moderator, or How IBM Project Debater Could Save Social Media
FEATURE | By Rob Enderle,
October 16, 2020
-
Top 10 Chatbot Platforms
FEATURE | By Cynthia Harvey,
October 07, 2020
-
Finding a Career Path in AI
ARTIFICIAL INTELLIGENCE | By Guest Author,
October 05, 2020
-
CIOs Discuss the Promise of AI and Data Science
FEATURE | By Guest Author,
September 25, 2020
-
Microsoft Is Building An AI Product That Could Predict The Future
FEATURE | By Rob Enderle,
September 25, 2020
-
Top 10 Machine Learning Companies 2021
FEATURE | By Cynthia Harvey,
September 22, 2020
-
NVIDIA and ARM: Massively Changing The AI Landscape
ARTIFICIAL INTELLIGENCE | By Rob Enderle,
September 18, 2020
-
Continuous Intelligence: Expert Discussion [Video and Podcast]
ARTIFICIAL INTELLIGENCE | By James Maguire,
September 14, 2020
-
Artificial Intelligence: Governance and Ethics [Video]
ARTIFICIAL INTELLIGENCE | By James Maguire,
September 13, 2020
-
IBM Watson At The US Open: Showcasing The Power Of A Mature Enterprise-Class AI
FEATURE | By Rob Enderle,
September 11, 2020
-
Artificial Intelligence: Perception vs. Reality
FEATURE | By James Maguire,
September 09, 2020
SEE ALL
APPLICATIONS ARTICLES