What Spark is and Why It Is Getting Popular

Once Thomas H. Davenport, who wrote the book "Competing on Analytics", said "data scientist is the sexiest job of the 21st century" in this article. It was published 5 years ago, and indeed, these days I meet people whose job is data science often. So it is true that data science is getting hot and this is why Spark is getting popular.

He described data science as "sexy". He could be right from a certain point of view. Analyzing/visualizing data and convincing people what to do next with data is so cool. However, most parts of data science job is still full of time consuming tasks such as formatting the input data for analysis and waiting for the model to put out the output. It takes hours.

What makes Spark different for data scientists? Because those data formatting and analyzing tasks require substantial amount of iterative operations. By using Spark, it can reduce much amount of meaningless time for just waiting your computer to show the output. This is why Spark has been widely used recently.

Spark vs Hadoop

According to the blog post https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html (it was posted 3 years ago though), Spark beated the benchmark record set by Hadoop "3 times faster with 10 times fewer machines". One of Spark's capabilities which attribute to its performance is in-memory processing, however, the new record was set with data on disks (HDFS) even without using in-memory cache at all.

"Spark can perform even better when supporting interactive queries of data stored in memory", this page says, "In those situations, there are claims that Spark can be 100 times faster than Hadoop’s MapReduce."

I often hear people say like "Spark is much faster than Hadoop. This is the reason people use it. It can be 100 times faster". I didn't know where the scale of "100 times faster" comes from. On the Apache Spark official top page https://spark.apache.org/, I found the performance difference is highlighted with the following performance data. So there is no reason to doubt it.


Spark's In-Memory Capability

In-memory processing is one of the most unique aspects of Spark. Thanks to that technology, Spark does not need to spend its time to read/write data on disks but instead uses data which is loaded to memory. To make it easy to understand the feature of Spark, let's compare it with Hadoop MapReduce/HDFS data processing mechanism.

In Hadoop, MapReduce is largely adopted for processing big amount of data distributed on a cluster. MapReduce enables us to process data parallely on a cluster, however, it is neccesary to write data to external storages (HDFS) for sharing it among cluster nodes. According to the following link, "Regarding storage system, most of the Hadoop applications, they spend more than 90% of the time doing HDFS read-write operations."

As shown in the image below, during an iteration loop using MapReduce, it is necessary to read/write data on disks every time an iterative operation occurs (pointed by red arrows). It results in significant overhead.


On the other hand, Spark uses memory instead of disks for loading data. As you can see in the image below, Spark needs to read data on disks for the first time, but during the iteration loop, all iterative read/write operations are occurred on memory. Only the last time data is written into disks. It reduce the overhead caused by disk read/write and improve the performance much faster.


Spark Cluster

The following image is the architecture of Spark Cluster.


Driver Program is the main() method in the Spark program. The driver coordinates all the processes on the cluster by SparkContext object.

SparkContext connexts to Cluster Manager (e.g. YARN, Mesos or Spark's standalone cluster manager) which allocates resources across the cluster and manage scheduling.

This is experimental yet, but Kubernetes is supported as Cluster Manager
Spark on Kubernetes: https://github.com/apache-spark-on-k8s/spark

Worker Node is the node actually execute Spark jobs and provide in-memory storage for cached RDD computation. Once processes are finished, the results are returned to the driver.

This is how these components work accordingly.

  1. A driver program creates SparkContext object upon the process start
  2. SparkContext connects a cluster manager to allocate resources
  3. Executors on worker nodes are acquired
  4. The driver program sends application code to executors
  5. SparkContext sends tasks to executors to run
  6. After the computation, executors return the result to the driver

According to the document http://spark.apache.org/docs/latest/cluster-overview.html, you need to note the following things.

  • Each driver program gets its own executor processes and run tasks in multiple threads. Because of this isolation, each driver needs to schedule only its own tasks. On the other hand, as each driver process is isolated, data is not allowed to be shared among drivers, or SparkContext object without writing data into external storages (and it means it may harm spark's in-memory driven performance).

  • As Spark works on a cluster, it is network sensitive. The closer each cluster components is, the better performance it has, and vice versa. According to the document, it is preferable to be on the same local network. In case remote accessibility is required, it recommends to use RPC protocol.

  • After completing computation, each executor returns the result to the driver. As a result, the driver needs to be reachable from executors with a proper port open which is configured by spark.driver.port.