SHARE

Spark: Lighting a Fire Under Hadoop

Also see: Hadoop and Big Data Hadoop has come a long way since its introduction as an open source project from Yahoo. It is moving into production from pilot/test stages at many firms. And the ecosystem of companies supporting it in one way or another is growing daily. It has some flaws, however, that are […]

Written By

AP

Andy Patrizio

Aug 5, 2015

6 minute read

Datamation content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

Also see: Hadoop and Big Data

Hadoop has come a long way since its introduction as an open source project from Yahoo. It is moving into production from pilot/test stages at many firms. And the ecosystem of companies supporting it in one way or another is growing daily.

It has some flaws, however, that are hampering the kinds of Big Data projects people can do with it. The Hadoop ecosystem uses a specialized distributed storage file system, called HDFS, to store large files across multiple servers and keep track of everything.

While this helps managed the terabytes of data, processing data at the speed of hard drives makes it prohibitively slow for handling anything exceptionally large or anything in real-time. Unless you were prepared to go to an all-SSD array – and who has that kind of money? – you were at the mercy of your 7,200 RPM hard drives.

The power of Hadoop is all centered around distributed computing, but Hadoop has primarily been used for batch processing. It uses the framework MapReduce to execute a batch process, oftentimes overnight, to get your answer. Because of this slow process, Big Data might have promised real-time analytics but it often couldn’t deliver.

Enter Spark. It moved the processing part of MapReduce to memory, giving Hadoop a massive speed boost. Developers claim it runs Hadoop up to 100 times faster in certain applications, and in the process opens up Hadoop to many more Big Data types of projects, due to the speed and potential for real-time processing.

Spark started as a project in the University of California, Berkeley AMPLab in 2009 and was donated as an open source project to the Apache Foundation in 2012. A company was spun out of AMPLab, called Databricks, to lead development of Spark.

Patrick Wendell, co-founder and engineering manager at Databricks, was a part of the team that made Spark at Berkeley. He says that Spark was focused on three things:

1) Speed: MapReduce was based on an old Google technology and is disk-based, while Spark runs in memory.

2) Ease of use: “MapReduce was really hard to program. Very few people wrote programs against it. Developers spent so much time trying to write their program in MapReduce and it was huge waste of time. Spark has a developer-friendly API,” he said. It supports eight different languages, including Phython, Java, and R.

3) Make something broadly compatible: Spark can run on Amazon EC2, Apache’s Mesos, and various cloud environments. It can read and write data to a variety of databases, like PostgreSQL, Oracle, MySQL and all Hadoop file formats.

“Many people have moved to Spark because they are performance-sensitive and time is money for them,” said Wendell. “So this is a key selling point. A lot of original Hadoop code was focused on off line batch processing, often run overnight. There, latency and performance don’t matter much.”

Because Spark is not a storage system, you can use your existing storage network and Spark will plug right into Hadoop and get going. Governance and security is taken care of. “We just speed up the actual crunching of what you are trying to do,” said Wendell. Of course, that’s also predicated on giving your distributed servers all the memory they will need to run everything in memory.

Prakash Nanduri, CEO of the analytics firm Paxata, said that Spark made Hadoop feasible for working in real time. “Now you have the ability to focus at real-time analytics as scale. The huge implication is suddenly you go from 10 use cases to 100 use cases and do it at a cost that is significantly lower than for traditional interactive analytic use cases,” he said.

Many of the cloud vendors that offer some kind of Hadoop solution, like Cloudera, Hortonworks, and MapR, are bundling Spark with Hadoop as a standard offering now, said Wendell.

At a recent Spark Summit, Toyota Motor offered an example of the speed Spark offers. It uses social media to watch for repair issues in addition to customer inquiries. The problem with the latter is people don’t care about surveys, so it shifted its emphasis to Twitter and Facebook. The company built an entire system on Spark to monitor social media to watch for keywords.

Its original customer experience app, done as a regular Hadoop batch job, would take 160 hours, or 6 days. The same job rewritten for Spark is completed in just four hours. The company also parsed the flood of input from social media and was able to filter out things like dealer promos, irrelevant material and incident reports involving Toyota products and reduced the amount of data to process by 50%.

Another use case is log processing and fraud detection, where speed is of the utmost, as banks, businesses and other financial and sales institutions need to move fast to catch fraudulent activity and act on the warnings.

“The business value you achieve is fundamentally derived through the apps. In the case of financial services, you need to be able to detect money laundering cases. You cannot find money laundering signals by running a batch process at night, it has to be in real time,” said Nanduri. “An app built on Spark can do the entire data set in real time and interactive speeds and get to the answer much faster.”

But Spark isn’t just about in-memory processing. Wendell said half of the performance gains come from running in memory and other half is from optimizations. “The other systems weren’t designed for latency so we improved on that a lot,” he said.

There is still more work to be done. Wendell said there is a big initiative underway with Databricks and Apache to further improve Spark performance, but he would not elaborate.

While it offers a standardized way to build highly distributed and interactive analytical apps, it still has a long way to go,” said Nanduri. “Spark lacks security and needs enhanced support for multiple concurrent users, so there is still some work to do.

Photo courtesy of Shutterstock.

Ethics and Artificial Intelligence: Driving Greater Equality

FEATURE | By James Maguire,
December 16, 2020
AI vs. Machine Learning vs. Deep Learning

FEATURE | By Cynthia Harvey,
December 11, 2020
Huawei’s AI Update: Things Are Moving Faster Than We Think

FEATURE | By Rob Enderle,
December 04, 2020
Keeping Machine Learning Algorithms Honest in the ‘Ethics-First’ Era

ARTIFICIAL INTELLIGENCE | By Guest Author,
November 18, 2020
Key Trends in Chatbots and RPA

FEATURE | By Guest Author,
November 10, 2020
Top 10 AIOps Companies

FEATURE | By Samuel Greengard,
November 05, 2020
What is Text Analysis?

ARTIFICIAL INTELLIGENCE | By Guest Author,
November 02, 2020
How Intel’s Work With Autonomous Cars Could Redefine General Purpose AI

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 29, 2020
Dell Technologies World: Weaving Together Human And Machine Interaction For AI And Robotics

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
October 23, 2020
The Super Moderator, or How IBM Project Debater Could Save Social Media

FEATURE | By Rob Enderle,
October 16, 2020
Top 10 Chatbot Platforms

FEATURE | By Cynthia Harvey,
October 07, 2020
Finding a Career Path in AI

ARTIFICIAL INTELLIGENCE | By Guest Author,
October 05, 2020
CIOs Discuss the Promise of AI and Data Science

FEATURE | By Guest Author,
September 25, 2020
Microsoft Is Building An AI Product That Could Predict The Future

FEATURE | By Rob Enderle,
September 25, 2020
Top 10 Machine Learning Companies 2021

FEATURE | By Cynthia Harvey,
September 22, 2020
NVIDIA and ARM: Massively Changing The AI Landscape

ARTIFICIAL INTELLIGENCE | By Rob Enderle,
September 18, 2020
Continuous Intelligence: Expert Discussion [Video and Podcast]

ARTIFICIAL INTELLIGENCE | By James Maguire,
September 14, 2020
Artificial Intelligence: Governance and Ethics [Video]

ARTIFICIAL INTELLIGENCE | By James Maguire,
September 13, 2020
IBM Watson At The US Open: Showcasing The Power Of A Mature Enterprise-Class AI

FEATURE | By Rob Enderle,
September 11, 2020
Artificial Intelligence: Perception vs. Reality

FEATURE | By James Maguire,
September 09, 2020

SEE ALL
DATA CENTER ARTICLES

AP

Andy Patrizio

Andy Patrizio is a freelance journalist based in southern California who has covered the computer industry for 20 years and has built every x86 PC he’s ever owned, laptops not included.

Spark: Lighting a Fire Under Hadoop

Andy Patrizio

Company

Categories

Spark: Lighting a Fire Under Hadoop

RELATED NEWS AND ANALYSIS

Andy Patrizio

Company

Categories