Apache Spark is a unified, open-source, distributed data processing engine for big data. In this article, we will discuss about the Spark architecture, its distributed nature and how it achieves processing of high volume data.
Apache Spark is one of the active open source project used for developing data processing applications. It is the de facto tool for many software engineers, data scientist every day.
Spark is developed using
Scala, and it has apis in
R programming languages. It includes libraries for diverse tasks like batch processing, stream processing, machine learning and graph processing. And most importantly it can run anywhere from laptop to a cluster with thousands of nodes. This makes easy to start small with apache spark and increase the cluster size as the data grows.
Apache Spark Usecases
- Batch Processing
- Real-time processing (aka stream processing)
- Machine Learning
- Graph processing
- Unified APIs for different languages
- Unified APIs for batch and real-time processing
Apache Spark Architecture
Apache Spark includes mainly three distributed components. They are listed below
Apache Spark driver program is the controller of the execution of a Spark Application and maintains all the states of the Spark cluster like, the state and tasks of the executors processes. It interacts with the cluster manager in order to actually get physical resources and launch executors in the cluster.
In a nutshell, it is a process on a physical node which is responsible for maintaining the state of the application running on the cluster.
Executors are worker processes which takes the assigned tasks from driver program and executes as driver program commands.
Executors have one core responsibility - take the tasks assigned by the driver program, execute them, and report back their state (success or failure) and results. Each Spark Application has its own separate executor processes.
Before apache spark commands executors to run the tasks, it should also make sure the required resources are allocated for executor to run the task. But apache spark driver is not designed to allocate the resources. There are special applications to manage the cluster resources as and when needed. These applications are cluster managers.
When spark driver needs the resources to run a job, it requests the cluster manager for resources. The cluster manager continuously monitors the available resources and assigns to the right executor to run the task.
Over the course of Spark Application execution, the cluster manager will be responsible for managing the underlying machines that the application is running on.
Cluster managers which can be used with apache spark are,
Apache Spark Working
As we got around with the basic architectural of Apache Spark in the above section, now let’s dive deeper into its working.
In master node, you have the driver program, which drives your application. The application code implemented behaves as a driver program or if you are using the interactive shell (like,
pyspark), the shell acts as the driver program.
Inside the driver program, the first thing to do is the Spark Context creation. We can say, Spark context is a gateway to all the Spark functionalities. It is like to a database connection. Any command you execute in the database goes through the database connection. Similarly, anything we perform on Spark goes through Spark context.
Spark context works with the cluster manager to manage jobs and resources. The driver program and Spark context takes care of the job execution within the cluster. A job is split into multiple tasks which are distributed over the worker node. Anytime an RDD is created in Spark context, it can be distributed across various nodes and can be cached.
Worker nodes are the slave processes whose job is to execute the tasks as instructed by driver program. These tasks are then executed on the partitioned
RDDs in the worker node and returns the result to the Spark Context.
Spark Context takes the job, breaks the job in tasks and distribute them to the worker nodes. These tasks work on the partitioned RDD, perform operations, collect the results and return to the main Spark Context.
Upon increasing the number of workers, the jobs will be divided into more partitions and get executed in parallel over distributed system. It makes processing faster.
Increase in the number of workers, increases memory size. With increase in memory, we can cache the jobs to execute it faster.
Apache Spark Ecosystem
Spark comes packed with high-level libraries, including support for Python, Scala, Java, SQL, etc. These standard libraries increase the integrations in a complex workflow. Over this, it also allows various sets of services to integrate with it like SQL - Data Frames, MLlib, GraphX, Streaming and batch services etc. to increase its capabilities.
The Spark Core is the base engine for large-scale distributed parallel data processing. The additional libraries which are built on the top of the core allows diverse workloads for batch processing, streaming, SQL, and machine learning. It is responsible for memory management and fault recovery, scheduling, distributing and monitoring jobs on a cluster & interacting with storage systems.
Spark Streaming is the component of Spark which is used in real-time streaming data processing. It is a useful addition to the core Spark API. It enables high-throughput and fault-tolerant stream processing of live data streams.
Spark SQL is a module in Spark which integrates relational processing with Spark’s functional programming API. It supports querying data either via ANSI SQL. Using Spark SQL will be an easy transition from RDBMS tools where you can extend the boundaries of traditional relational data processing.
GraphX is the Spark API for graphs and graph-parallel computation. Thus, it extends the Spark RDD with a Resilient Distributed Property Graph. At a high-level, GraphX extends the Spark RDD abstraction by introducing the Resilient Distributed Property Graph (a directed multigraph with properties attached to each vertex and edge).
MLlib (Machine Learning)
MLlib stands for Machine Learning Library. Spark MLlib is used to perform machine learning in Apache Spark.
It is an R package that provides a distributed data frame implementation. It also supports operations like selection, filtering, aggregation but on large data-sets.
Fundamental Data Structure In Spark
RDD (Resilient Distributed Dataset) is the fundamental data structure in Spark Framework.
Resilient Distributed Dataset (RDD)
RDDs are the building blocks of any Spark application.
RDD Stands for:
- Resilient: Data model which is fault-tolerant and is capable of rebuilding data on failure
- Distributed: Spark Distributes data among the multiple nodes in a cluster
- Dataset: Collection of partitioned data with values
RDD is a layer of abstracted data over the distributed collection. And
RDDs are immutable in nature. This data model uses lazy transformations to processes data efficiently.
Working Of RDD
The data in an
RDD is split into chunks based on a key.
RDDs are highly resilient, which means, they can recover quickly from any failures as the same data chunks are replicated across multiple executor nodes. Even if one executor node fails, another will still process the data. This allows you to perform your functional calculations against your dataset very quickly by harnessing the power of multiple nodes.
RDD is created, it becomes immutable, which means, the object whose state cannot be modified after it is created, but they can surely be transformed.
Talking about the distributed environment, each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. Due to this, you can perform transformations and actions on the complete data in parallel. Also, spark takes care of data distribution effectively.
There are two ways to create
- By parallelizing an existing collection in your driver program
- By referencing a dataset in an external storage system, such as a shared file system, HDFS, etc
Spark Processing Model
Spark data processing model include
Transformations are the operations that are applied to create a new RDD. Further, transformations can be categorized into two
Narrow Transformations. Examples of narrow transformations are,
Wide Transformations. Examples of wide transformations are,
Actions are applied on an RDD to instruct Apache Spark to apply computation and pass the result back to the driver.
- Examples of Spark Actions are,