RDD in Spark

RDD in Spark

What is RDD?

RDD - Resilient Distributed Datasets

Resilience: RDDs are resilient because they can recover automatically from failures. Spark achieves this by keeping track of the lineage of transformations applied to the base dataset, so it can recompute lost partitions due to node failures.

In spark, lineage is built using the RDD (Resilient Distributed Dataset) abstraction, which keeps track of all transformations applied to it. This information is used to recompute lost data in the event of a failure, ensuring fault tolerance without the need for data replication.

Distributed: RDDs are distributed across the nodes of the cluster.

Resilient Distributed Datasets (RDDs) are the primary data structure in Spark. This data structure process any Python, Java, Scala, or user-created object.

RDDs are partitioned across nodes in a cluster and can be operated on in parallel.

RDDs are reliable and memory-efficient when it comes to parallel processing.

By storing and processing data in RDDs, Spark speeds up MapReduce processes.

Difference between lineage and DAG

Lineage in Spark refers to the record of how an RDD is derived from other RDDs through a series of transformations. It represents the sequence of transformations applied to the base RDD to produce the current RDD.

Lineage is crucial for fault tolerance in Spark. If a partition of an RDD is lost due to node failure, Spark can recompute the lost partition by tracing back through the lineage graph and reapplying the transformations that led to the lost partition.

Spark maintains the lineage information internally for each RDD. When transformations are applied to an RDD, Spark records these transformations as dependencies, forming a lineage graph.

DAG, or Directed Acyclic Graph, represents the logical execution plan of a Spark job.

When Spark performs actions on RDDs, it builds up a DAG representing the sequence of transformations needed to compute the final result.

In summary, lineage is about maintaining fault tolerance by recording the history of RDD transformations, while DAG is about optimizing and scheduling computations by representing the logical execution plan of a Spark job.

RDD creation happens first, followed by the application of transformations, which in turn leads to the creation of the DAG and lineage information when actions are invoked on the RDD.

Why Do We Need RDDs in Spark?

RDDs address MapReduce's shortcomings in data sharing. When reusing/sharing data for computations, MapReduce requires writing to external storage (HDFS, Cassandra, HBase, etc.). The read and write processes between jobs/tasks consume a significant amount of memory.

Furthermore, data sharing between tasks is slow due to replication, serialization, and increased disk usage.

RDDs aim to reduce the usage of external storage systems by leveraging in-memory compute operation storage. This approach improves data exchange speeds between tasks by 10 to 100 times. If disk involves then 10 times faster and in-memory involves then 100 times faster while reusing/sharing data between jobs/tasks.

Speed is critical when working with large data volumes. Spark RDDs make it easier to train machine learning algorithms and handle large amounts of data for analytics.

RDD Characteristics

In-Memory Computation

RDDs leverage in-memory computation to speed up data processing. They store data in memory whenever possible, reducing the need to access data from disk, which can be much slower.

Immutable

Once created, RDDs are immutable, meaning their content cannot be changed.

Type Inferred

It refers to the automatic detection of the type of an expression towards data.

Lazy Evaluation

Transformations are lazy in nature i.e., they get execute when we call an action. They are not executed immediately.

Type Tolerance

RDDs can contain any type of Python, Java, or Scala objects. This flexibility allows developers to work with complex data types and structures.

Transformation and Action Operations

RDDs support two types of operations: transformations and actions. Transformations create new RDDs from existing ones (e.g., map, filter, reduce), while actions perform computations and return results to the driver program (e.g., collect, count, reduce).

For PairRDD,

Comments