Lazy Evaluation & DAG in Spark

Lazy Evaluation

  • Evaluation means executing the logic. Lazy means don't execute immediately until some action is performed. So code will execute when the action is performed.
  • Lazy evaluation means spark will not load or transform data unless an action is performed.
    • Load file into RDD
    • Filter the RDD
    • Count no of elements (Only now loading filtering happens)
  • Helps internally optimize operations and resource usage.
  • We can write chaining operations and We can watch out during the troubleshooting.
  • If you only used transformations like textFile then it will not execute immediately it will create DAG. Those transformations are compiled into DAG. DAG is a complex piece of module. 
  • You can perform bunch of transformations before invoking action. 
  • Until Action is performed the spark compiler checks whether required transformations code is compiled in DAG. 
  • DAG is graph so it will be having multiple paths, it will take any one path and execute the transformations whenever the action is called.

DAG

  • In Apache Spark, a DAG (Directed Acyclic Graph) is a fundamental concept that represents the logical execution plan of a Spark job.
  • The DAG is "Directed" because it shows the flow of data from one transformation to another, and it's "Acyclic" because there are no cycles or loops in the graph. This means that each transformation depends only on its immediate predecessors and not on itself or any transformations further downstream.
  • The DAG is built when you create a Spark DataFrame or RDD and apply transformations to it.
  • Spark's Catalyst optimizer analyzes this DAG to generate an optimized physical execution plan that minimizes data movement and maximizes parallelism.

Comments

Popular posts from this blog

Hive File Formats

Why We Need Hadoop?

Hive Data Types