Spark Architecture

Apache Spark is an open-source distributed computing system that provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
It's designed to handle large-scale data processing and analytics tasks efficiently. Spark's architecture is structured to maximize performance and scalability.

Architecture

Driver Program

The driver program is the entry point of any Spark application.

The driver program is responsible for translating the user's code into tasks and distributing them to the executors.

The Driver Program is a process that runs the main() function of the application and creates the SparkContext object which represents the connection to the Spark cluster.

When a user submits a Spark application, the Driver Program receives the application code and any associated configurations. Rest of the technical stuffs could be taken by SparkContext.

The Driver Program interacts with the cluster manager (such as Apache Mesos or YARN or Kubernetes or Standalone) to request resources for executing tasks. It negotiates with the cluster manager to allocate executors on worker nodes based on the application's resource requirements.

Throughout the execution of the Spark application, the Driver Program monitors the progress of tasks and collects metrics such as execution time, resource usage, and task failures. It provides feedback to the user and reports the status of the application upon completion.

It acts as the central control unit that coordinates the distributed execution of tasks across the Spark cluster.

SparkContext

SparkContext is the main entry point to any Spark functionality in applications/jobs.

The purpose of SparkContext is to coordinate the spark applications, execution of operations, running as independent sets of processes on a cluster.

The SparkContext establishes a connection to the Spark cluster, whether it's running in standalone mode, on Apache Mesos, or Kubernetes on Hadoop YARN. It coordinates with the cluster manager to allocate resources and execute tasks.

It can be used to create RDDs, accumulators, and broadcast variables.

SparkContext is responsible for submitting Spark jobs to the cluster. It translates user code into a directed acyclic graph (DAG) of stages representing the computation to be performed. The DAG is then executed on the cluster by Spark executors.

To run on a cluster, the SparkContext connects to a different type of cluster managers and then perform the following tasks:

It acquires executors on nodes in the cluster.

Then, it sends your application code to the executors. Here, the application code can be defined by JAR or Python files passed to the SparkContext.

At last, the SparkContext sends tasks to the executors to run.

Cluster Manager

The role of the cluster manager is to allocate resources across applications. The Spark is capable enough of running on a large number of clusters.

It consists of various types of cluster managers such as Hadoop YARN, Apache Mesos, Kubernetes and Standalone Scheduler.

Here, the Standalone Scheduler is a standalone spark cluster manager that facilitates to install Spark on an empty set of machines.

Worker Node

The worker node is a slave node.

Its role is to run the application code in the cluster.

Worker machines are the machines where the actual work is happening.

It always reports available resources to master node for performing/executing tasks.

We normally start one spark worker daemon per worker node.

Worker node starts and monitors executors for the each spark program.

Executor

An executor is a process launched for an application on a worker node.

It read and write data to the external sources.

The master allocates the resources. Based on the resources allocation, Workers are used to create Executors. The driver can then use these executors to run its tasks.

Executors are only launched when a job execution starts on a worker node. Executors are responsible for running tasks and keeping the data in memory or disk.

Task

A task is a unit of work which will sent to executor. Normally, it is a command sent from the driver program to an executor. Then executor executes that tasks in a corresponding allocated partition.

Partitions are important because Spark will run one task for each partition. Spark attempts to set the number of partitions automatically unless you specify the number of partitions manually.

e.g. sc.parallelize (data, numPartitions).

Spark Driver & How it works

It will separate each process to execute user applications/code. It creates SparkContext to schedule jobs execution and negotiate the resources with cluster manager. Below are the components of Spark Driver,

  • RDD graph
  • DAGScheduler
  • TaskScheduler
  • SchedulerBackend

How Spark Driver works

User code interacts with Spark Driver for creating RDDs and performing series of transformations to achieve final result. These transformations of RDDs are then translated into DAG and submitted to Scheduler. Then with using DAGScheduler transformations split into stages of tasks, then send to TaskScheduler. TaskScheduler is responsible for sending tasks to the cluster, running them and it will retry if there any failures. Tasks run on workers and results then return to client. If we have multiple RDDs, then Transformations create dependencies between them. Dependencies means applying/performing function to the dataset. Functions are map, filter, union, join, groupBy, Shuffle, sortBy. SchedulerBackend is a Back-end interface which allows plugging in different implementations (Mesos, YARN, Standalone, local).

Then the actual data will be getting stored in worker node. BlockManager is used for putting and retrieving blocks both locally and remotely into various stores (memory, disk, and off-heap).

Search This Blog

Hadoop Zone