For example, Spark has no file management and therefor must rely on Hadoop’s Distributed File System (HDFS) or some other solution. It is wiser to compare Hadoop MapReduce to Spark, because they’re more comparable as data processing engines.
As data science has matured over the past few years, so has the need for a different approach to data and its “bigness.” There are business applications where Hadoop outperforms the newcomer Spark, but Spark has its place in the big data space because of its speed and its ease of use. This analysis examines a common set of attributes for each platform including performance, fault tolerance, cost, ease of use, data processing, compatibility, and security.
The most important thing to remember about Hadoop and Spark is that their use is not an either-or scenario because they are not mutually exclusive. Nor is one necessarily a drop-in replacement for the other. The two are compatible with each other and that makes their pairing an extremely powerful solution for a variety of big data applications.
Hadoop is an Apache.org project that is a software library and a framework that allows for distributed processing of large data sets (big data) across computer clusters using simple programming models. Hadoop can scale from single computer systems up to thousands of commodity systems that offer local storage and compute power. Hadoop, in essence, is the ubiquitous 800-lb big data gorilla in the big data analytics space.
Hadoop is composed of modules that work together to create the Hadoop framework. The primary Hadoop framework modules are:
· Hadoop Common
· Hadoop Distributed File System (HDFS)
· Hadoop YARN
· Hadoop MapReduce
Although the above four modules comprise Hadoop’s core, there are several other modules. These include Ambari, Avro, Cassandra, Hive, Pig, Oozie, Flume, and Sqoop, which further enhance and extend Hadoop’s power and reach into big data applications and large data set processing.
Many companies that use big data sets and analytics use Hadoop. It has become the de facto standard in big data applications. Hadoop originally was designed to handle crawling and searching billions of web pages and collecting their information into a database. The result of the desire to crawl and search the web was Hadoop’s HDFS and its distributed processing engine, MapReduce.
Hadoop is useful to companies when data sets become so large or so complex that their current solutions cannot effectively process the information in what the data users consider being a reasonable amount of time.
MapReduce is an excellent text processing engine and rightly so since crawling and searching the web (its first job) are both text-based tasks.
The Apache Spark developers bill it as “a fast and general engine for large-scale data processing.” By comparison, and sticking with the analogy, if Hadoop’s Big Data framework is the 800-lb gorilla, then Spark is the 130-lb big data cheetah.
Although critics of Spark’s in-memory processing admit that Spark is very fast (Up to 100 times faster than Hadoop MapReduce), they might not be so ready to acknowledge that it runs up to ten times faster on disk. Spark can also perform batch processing, however, it really excels at streaming workloads, interactive queries, and machine-based learning.
Spark’s big claim to fame is its real-time data processing capability as compared to MapReduce’s disk-bound, batch processing engine. Spark is compatible with Hadoop and its modules. In fact, on Hadoop’s project page, Spark is listed as a module.
Spark has its own page because, while it can run in Hadoop clusters through YARN (Yet Another Resource Negotiator), it also has a standalone mode. The fact that it can run as a Hadoop module and as a standalone solution makes it tricky to directly compare and contrast. However, as time goes on, some big data scientists expect Spark to diverge and perhaps replace Hadoop, especially in instances where faster access to processed data is critical.
Spark is a cluster-computing framework, which means that it competes more with MapReduce than with the entire Hadoop ecosystem. For example, Spark doesn’t have its own distributed filesystem, but can use HDFS.
Spark uses memory and can use disk for processing, whereas MapReduce is strictly disk-based. The primary difference between MapReduce and Spark is that MapReduce uses persistent storage and Spark uses Resilient Distributed Datasets (RDDs), which is covered in more detail under the Fault Tolerance section.
There’s no lack of information on the Internet about how fast Spark is compared to MapReduce. The problem with comparing the two is that they perform processing differently, which is covered in the Data Processing section. The reason that Spark is so fast is that it processes everything in memory. Yes, it can also use disk for data that doesn’t all fit into memory.
Spark’s in-memory processing delivers near real-time analytics for data from marketing campaigns, machine learning, Internet of Things sensors, log monitoring, security analytics, and social media sites. MapReduce alternatively uses batch processing and was really never built for blinding speed. It was originally setup to continuously gather information from websites and there were no requirements for this data in or near real-time.
Ease of Use
Spark is well known for its performance, but it’s also somewhat well known for its ease of use in that it comes with user-friendly APIs for Scala (its native language), Java, Python, and Spark SQL. Spark SQL is very similar to SQL 92, so there’s almost no learning curve required in order to use it.
Spark also has an interactive mode so that developers and users alike can have immediate feedback for queries and other actions. MapReduce has no interactive mode, but add-ons such as Hive and Pig make working with MapReduce a little easier for adopters.
Both MapReduce and Spark are Apache projects, which means that they’re open source and free software products. While there’s no cost for the software, there are costs associated with running either platform in personnel and in hardware. Both products are designed to run on commodity hardware, such as low cost, so-called white box server systems.
MapReduce and Spark run on the same hardware, so where’s the cost differences between the two solutions? MapReduce uses standard amounts of memory because its processing is disk-based, so a company will have to purchase faster disks and a lot of disk space to run MapReduce. MapReduce also requires more systems to distribute the disk I/O over multiple systems.
Spark requires a lot of memory, but can deal with a standard amount of disk that runs at standard speeds. Some users have complained about temporary files and their cleanup. Typically these temporary files are kept for seven days to speed up any processing on the same data sets. Disk space is a relatively inexpensive commodity and since Spark does not use disk I/O for processing, the disk space used can be leveraged SAN or NAS.
It is true, however that Spark systems cost more because of the large amounts of RAM required to run everything in memory. But what’s also true is that Spark’s technology reduces the number of required systems. So, you have significantly fewer systems that cost more. There’s probably a point at which Spark actually reduces costs per unit of computation even with the additional RAM requirement.
To illustrate, “Spark has been shown to work well up to petabytes. It has been used to sort 100 TB of data 3X faster than Hadoop MapReduce on one-tenth of the machines.” This feat won Spark the 2014 Daytona GraySort Benchmark.
MapReduce and Spark are compatible with each other and Spark shares all MapReduce’s compatibilities for data sources, file formats, and business intelligence tools via JDBC and ODBC.
MapReduce is a batch-processing engine. MapReduce operates in sequential steps by reading data from the cluster, performing its operation on the data, writing the results back to the cluster, reading updated data from the cluster, performing the next data operation, writing those results back to the cluster and so on. Spark performs similar operations, but it does so in a single step and in memory. It reads data from the cluster, performs its operation on the data, and then writes it back to the cluster.
Spark also includes its own graph computation library, GraphX. GraphX allows users to view the same data as graphs and as collections. Users can also transform and join graphs with Resilient Distributed Datasets (RDDs), discussed in the Fault Tolerance section.
For fault tolerance, MapReduce and Spark resolve the problem from two different directions. MapReduce uses TaskTrackers that provide heartbeats to the JobTracker. If a heartbeat is missed then the JobTracker reschedules all pending and in-progress operations to another TaskTracker. This method is effective in providing fault tolerance, however it can significantly increase the completion times for operations that have even a single failure.
Spark uses Resilient Distributed Datasets (RDDs), which are fault-tolerant collections of elements that can be operated on in parallel. RDDs can reference a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat. Spark can create RDDs from any storage source supported by Hadoop, including local filesystems or one of those listed previously.
An RDD possesses five main properties:
· A list of partitions
· A function for computing each split
· A list of dependencies on other RDDs
· Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
· Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
RDDs can be persistent in order to cache a dataset in memory across operations. This allows future actions to be much faster, by as much as ten times. Spark’s cache is fault-tolerant in that if any partition of an RDD is lost, it will automatically be recomputed by using the original transformations.
By definition, both MapReduce and Spark are scalable using the HDFS. So how big can a Hadoop cluster grow?
Yahoo reportedly has a 42,000 node Hadoop cluster, so perhaps the sky really is the limit. The largest known Spark cluster is 8,000 nodes, but as big data grows, it’s expected that cluster sizes will increase to maintain throughput expectations.
Hadoop supports Kerberos authentication, which is somewhat painful to manage. However, third party vendors have enabled organizations to leverage Active Directory Kerberos and LDAP for authentication. Those same third party vendors also offer data encrypt for in-flight and data at rest.
Hadoop’s Distributed File System supports access control lists (ACLs) and a traditional file permissions model. For user control in job submission, Hadoop provides Service Level Authorization, which ensures that clients have the right permissions.
Spark’s security is a bit sparse by currently only supporting authentication via shared secret (password authentication). The security bonus that Spark can enjoy is that if you run Spark on HDFS, it can use HDFS ACLs and file-level permissions. Additionally, Spark can run on YARN giving it the capability of using Kerberos authentication.
Hadoop vs. Spark Summary
Upon first glance, it seems that using Spark would be the default choice for any big data application. However, that’s not the case. MapReduce has made inroads into the big data market for businesses that need huge datasets brought under control by commodity systems. Spark’s speed, agility, and relative ease of use are perfect complements to MapReduce’s low cost of operation.
The truth is that Spark and MapReduce have a symbiotic relationship with each other. Hadoop provides features that Spark does not possess, such as a distributed file system and Spark provides real-time, in-memory processing for those data sets that require it. The perfect big data scenario is exactly as the designers intended—for Hadoop and Spark to work together on the same team.
Photo courtesy of Shutterstock.