Also see: Hadoop and Big Data
Hadoop has come a long way since its introduction as an open source project from Yahoo. It is moving into production from pilot/test stages at many firms. And the ecosystem of companies supporting it in one way or another is growing daily.
It has some flaws, however, that are hampering the kinds of Big Data projects people can do with it. The Hadoop ecosystem uses a specialized distributed storage file system, called HDFS, to store large files across multiple servers and keep track of everything.
While this helps managed the terabytes of data, processing data at the speed of hard drives makes it prohibitively slow for handling anything exceptionally large or anything in real-time. Unless you were prepared to go to an all-SSD array – and who has that kind of money? – you were at the mercy of your 7,200 RPM hard drives.
The power of Hadoop is all centered around distributed computing, but Hadoop has primarily been used for batch processing. It uses the framework MapReduce to execute a batch process, oftentimes overnight, to get your answer. Because of this slow process, Big Data might have promised real-time analytics but it often couldn’t deliver.
Enter Spark. It moved the processing part of MapReduce to memory, giving Hadoop a massive speed boost. Developers claim it runs Hadoop up to 100 times faster in certain applications, and in the process opens up Hadoop to many more Big Data types of projects, due to the speed and potential for real-time processing.
Spark started as a project in the University of California, Berkeley AMPLab in 2009 and was donated as an open source project to the Apache Foundation in 2012. A company was spun out of AMPLab, called Databricks, to lead development of Spark.
Patrick Wendell, co-founder and engineering manager at Databricks, was a part of the team that made Spark at Berkeley. He says that Spark was focused on three things:
1) Speed: MapReduce was based on an old Google technology and is disk-based, while Spark runs in memory.
2) Ease of use: “MapReduce was really hard to program. Very few people wrote programs against it. Developers spent so much time trying to write their program in MapReduce and it was huge waste of time. Spark has a developer-friendly API,” he said. It supports eight different languages, including Phython, Java, and R.
3) Make something broadly compatible: Spark can run on Amazon EC2, Apache’s Mesos, and various cloud environments. It can read and write data to a variety of databases, like PostgreSQL, Oracle, MySQL and all Hadoop file formats.
“Many people have moved to Spark because they are performance-sensitive and time is money for them,” said Wendell. “So this is a key selling point. A lot of original Hadoop code was focused on off line batch processing, often run overnight. There, latency and performance don’t matter much.”
Because Spark is not a storage system, you can use your existing storage network and Spark will plug right into Hadoop and get going. Governance and security is taken care of. “We just speed up the actual crunching of what you are trying to do,” said Wendell. Of course, that’s also predicated on giving your distributed servers all the memory they will need to run everything in memory.
Prakash Nanduri, CEO of the analytics firm Paxata, said that Spark made Hadoop feasible for working in real time. “Now you have the ability to focus at real-time analytics as scale. The huge implication is suddenly you go from 10 use cases to 100 use cases and do it at a cost that is significantly lower than for traditional interactive analytic use cases,” he said.
Many of the cloud vendors that offer some kind of Hadoop solution, like Cloudera, Hortonworks, and MapR, are bundling Spark with Hadoop as a standard offering now, said Wendell.
At a recent Spark Summit, Toyota Motor offered an example of the speed Spark offers. It uses social media to watch for repair issues in addition to customer inquiries. The problem with the latter is people don’t care about surveys, so it shifted its emphasis to Twitter and Facebook. The company built an entire system on Spark to monitor social media to watch for keywords.
Its original customer experience app, done as a regular Hadoop batch job, would take 160 hours, or 6 days. The same job rewritten for Spark is completed in just four hours. The company also parsed the flood of input from social media and was able to filter out things like dealer promos, irrelevant material and incident reports involving Toyota products and reduced the amount of data to process by 50%.
Another use case is log processing and fraud detection, where speed is of the utmost, as banks, businesses and other financial and sales institutions need to move fast to catch fraudulent activity and act on the warnings.
“The business value you achieve is fundamentally derived through the apps. In the case of financial services, you need to be able to detect money laundering cases. You cannot find money laundering signals by running a batch process at night, it has to be in real time,” said Nanduri. “An app built on Spark can do the entire data set in real time and interactive speeds and get to the answer much faster.”
But Spark isn’t just about in-memory processing. Wendell said half of the performance gains come from running in memory and other half is from optimizations. “The other systems weren’t designed for latency so we improved on that a lot,” he said.
There is still more work to be done. Wendell said there is a big initiative underway with Databricks and Apache to further improve Spark performance, but he would not elaborate.
While it offers a standardized way to build highly distributed and interactive analytical apps, it still has a long way to go,” said Nanduri. “Spark lacks security and needs enhanced support for multiple concurrent users, so there is still some work to do.
Photo courtesy of Shutterstock.