Job, Stage, Task in Spark

Job, Stage, Task

Job

  • A job in Spark represents a complete computation triggered by an action (e.g., collect(), count(), saveAsTextFile()).
  • When you call an action on a Spark RDD (Resilient Distributed Dataset) or DataFrame, Spark starts to execute the transformations defined in your code.
  • A job consists of one or more stages, which are formed based on the DAG (Directed Acyclic Graph) of transformations that need to be executed to fulfill the action's requirements.
  • Spark may optimize the execution plan by breaking the job into multiple stages to minimize data shuffling and improve performance.

Stage

  • In Apache Spark, a stage is a logical division of a Spark job's execution plan.
  • Stages are formed during the process of executing a Spark job, which involves transforming data through a series of RDD (Resilient Distributed Dataset) or DataFrame operations and actions. When an action is triggered, Spark analyzes the DAG (Directed Acyclic Graph) of transformations that need to be executed and breaks it down into stages.
  • There are two main types of stages in Spark:
    • Narrow (or pipelined) stages/transformation
    • Narrow stages are formed when the data transformations can be performed without needing to shuffle data across the network. 
    • Each partition of the input RDD or DataFrame is processed independently, and the output partitions of each transformation are computed by applying a function like map(), filter() to one or more input partitions. 
    • These actions can be executed in a single stage.
    • Narrow stages are usually faster to execute because they can be performed entirely within each executor without needing to exchange data between nodes.
    • Wide (or shuffle) stages/transformation
    • Wide stages are formed when the data transformations require shuffling data across the network.
    • They typically involve operations that require grouping, redistribution, or sorting of data, such as groupByKey(), sortByKey(), or join(). 
    • Spark breaks down the DAG at these shuffle boundaries and forms a new stage for each shuffle operation.
    • Wide stages tend to be slower to execute because they involve data exchange between nodes, which can incur network overhead.
  • Once the stages are formed, Spark's scheduler determines the optimal execution plan based on factors like data locality (doing computation on the node where data resides), task dependencies, and available resources. The stages are then scheduled for execution on the cluster, and tasks are assigned to individual executors to perform the computations.

Task

  • A task in Spark represents a unit of work that is executed on a single partition of the data by an executor in the cluster.
  • Tasks are the smallest executable units in Spark's execution model and are responsible for processing a portion of the input data.
  • Each task corresponds to processing a single partition of the input RDD or DataFrame and applies the transformations defined in the code to produce the desired output.
  • Tasks are scheduled and executed by Spark's scheduler on the available executors in the cluster, and they run in parallel to achieve high throughput and performance.
  • In summary, a job represents a complete computation triggered by an action, which is broken down into stages consisting of tasks that are executed in parallel on the data partitions. This hierarchical structure allows Spark to optimize the execution of distributed data processing tasks and achieve efficient and scalable performance.
By generally,
  • DAGScheduler is the one that divides the stages into a number of tasks. The DAGScheduler then passes the stage information to the cluster manager(YARN/Spark standalone) which triggers the task scheduler to run the tasks. Spark driver converts the logical plan to a physical execution plan. Spark jobs are executed in the pipelining method where all the transformation tasks are combined into a single stage.
How stages will work?
  • We will see how a simple word count works using Spark DAGScheduler towards stage and task.
    • Spark Scala
                        val data = sc.textFile(“data.txt”)
                        val result = data.flatMap(line => line.split(" "))
                                                .map(word => (word,1))
                                                .reduceByKey((a, b) => a + b)
                                                .collect()
    • Pyspark
                        data = sc.textFile("data.txt")
                        result = data.flatMap(lambda line: line.split(" ")) \
                                         .map(lambda word: (word, 1)) \
                                         .reduceByKey(lambda a, b: a + b) \
                                         .collect()
  • During this program, 2 stages are created by Spark because transformations ((textFile, flatMap, map), (reduceByKey)) is performed here. Totally 3 stages including action collect().
  • While transformation operation is done, shuffling needs to be performed because the data needs to be shuffled between 2 or more different partitions.
  • Hence for this, a stage is created and then another single stage for the transformation task(reduceByKey) is created.
  • Also internally these stages will be divided into tasks. In this example, each stage is divided into 2 tasks since there are 2 partitions that exist. Each partition runs an individual task.

Comments

Popular posts from this blog

Hive File Formats

HDFS Infographic

Why We Need Hadoop?