Transformation & Action in RDD

Transformation

  • Spark Transformation is a function that produces new RDD from the existing RDDs. 
  • It takes RDD as input and produces one or more RDD as output.
  • Each time it creates new RDD when we apply any transformation.
  • Thus, the so input RDDs, cannot be changed since RDD are immutable in nature.
  • Transformations are lazy in nature i.e., they get execute when we call an action. They are not executed immediately.
  • After the transformation, the resultant RDD is always different from its parent RDD. It can be smaller (e.g. filter, count, distinct, sample), bigger (e.g. flatMap(), union(), Cartesian()) or the same size (e.g. map).

Action

  • Transformations create RDDs from each other, but when we want to work with the actual dataset, at that point action is performed.
  • When the action is triggered after the result, new RDD is not formed like transformation.
  • Thus, Actions are Spark RDD operations that give non-RDD values.
  • The values of action are stored to drivers or to the external storage system. It brings laziness of RDD into motion.
  • An action is one of the ways of sending data from Executer to the driver. Executors are agents that are responsible for executing a task. While the driver is a JVM process that coordinates workers and execution of the task.

Comments

Popular posts from this blog

Hive File Formats

HDFS Infographic

Why We Need Hadoop?