Posts

Apache Spark Components

Image
Apache Spark is an open source framework for real-time processing. Also it is a powerful big data tool and used for various big data challenges.  Hadoop MapReduce is the best framework for processing data in batches. Spark could process data in real time and it is 100 times faster than Hadoop MapReduce in batch processing of large data sets.  Spark is able to achieve this speed through controlled partitioning. We could process that  partitioned  data with  minimal network traffic. Features Spark code can be written in Java, Scala, Python and R. Spark also supports and process structured and semi-structured data though Spark SQL. If it is absolutely necessary, then only spark will execute the code/function. This is called "Lazy Execution". Spark will be faster for processing the data because of its in-memory processing. Apache Spark is a separate tool and it has its own cluster. We can use Hadoop to implement Spark. Spark Components           • In addition to Map and Red

Transformation & Action in RDD

Image
Transformation Spark Transformation is a function that produces new RDD from the existing RDDs.  It takes RDD as input and produces one or more RDD as output. Each time it creates new RDD when we apply any transformation. Thus, the so input RDDs, cannot be changed since RDD are immutable in nature. Transformations are lazy in nature i.e., they get execute when we call an action. They are not executed immediately. After the transformation, the resultant RDD is always different from its parent RDD. It can be smaller (e.g. filter, count, distinct, sample), bigger (e.g. flatMap(), union(), Cartesian()) or the same size (e.g. map). Action Transformations create RDDs from each other, but when we want to work with the actual dataset, at that point action is performed. When the action is triggered after the result, new RDD is not formed like transformation. Thus, Actions are Spark RDD operations that give non-RDD values. The values of action are stored to drivers or to the external storage sys

Lazy Evaluation & DAG in Spark

Lazy Evaluation Evaluation means executing the logic. Lazy means don't execute immediately until some action is performed. So code will execute when the action is performed. Lazy evaluation means spark will not load or transform data unless an action is performed. Load file into RDD Filter the RDD Count no of elements (Only now loading filtering happens) Helps internally optimize operations and resource usage. We can write chaining operations and We can watch out during the troubleshooting. If you only used transformations like textFile then it will not execute immediately it will create DAG. Those transformations are compiled into DAG. DAG is a complex piece of module.  You can perform bunch of transformations before invoking action.  Until Action is performed the spark compiler checks whether required transformations code is compiled in DAG.  DAG is graph so it will be having multiple paths, it will take any one path and execute the transformations whenever the action is called.

Job, Stage, Task in Spark

Image
Job, Stage, Task Job A job in Spark represents a complete computation triggered by an action (e.g., collect(), count(), saveAsTextFile()). When you call an action on a Spark RDD (Resilient Distributed Dataset) or DataFrame, Spark starts to execute the transformations defined in your code. A job consists of one or more stages, which are formed based on the DAG (Directed Acyclic Graph) of transformations that need to be executed to fulfill the action's requirements. Spark may optimize the execution plan by breaking the job into multiple stages to minimize data shuffling and improve performance. Stage In Apache Spark, a stage is a logical division of a Spark job's execution plan. Stages are formed during the process of executing a Spark job, which involves transforming data through a series of RDD (Resilient Distributed Dataset) or DataFrame operations and actions. When an action is triggered, Spark analyzes the DAG (Directed Acyclic Graph) of transformations that need to be execu

Spark Execution Modes

Mode of Execution Mode of Execution determine where your app’s resources are physically located when you run your application/job.  Cluster Mode Client Mode Local Mode Cluster Mode In cluster mode, the Spark driver program runs within the Spark cluster itself.  The driver program is launched on one of the nodes in the cluster, typically on a master node managed by the cluster manager (e.g., YARN, Mesos, or Spark's standalone cluster manager).  When a Spark application is submitted in cluster mode, the driver program is launched on a cluster node, and the application code is executed within the cluster's resources. Both the driver and executor processes run within the Cluster. Suitable for production deployments where resources are managed centrally by a cluster manager. Client Mode In client mode, the Spark driver program runs on the machine that submits the Spark application (often referred to as the client machine).  The client machine typically resides outside of the Spark c

RDD in Spark

Image
What is RDD? RDD - Resilient Distributed Datasets Resilience : RDDs are resilient because they can recover automatically from failures. Spark achieves this by keeping track of the lineage of transformations applied to the base dataset, so it can recompute lost partitions due to node failures. In spark, lineage is built using the RDD (Resilient Distributed Dataset) abstraction, which keeps track of all transformations applied to it. This information is used to recompute lost data in the event of a failure, ensuring fault tolerance without the need for data replication. Distributed : RDDs are distributed across the nodes of the cluster. Resilient Distributed Datasets (RDDs) are the primary data structure in Spark. This data structure process any Python, Java, Scala, or user-created object. RDDs are partitioned across nodes in a cluster and can be operated on in parallel. RDDs are reliable and memory-efficient when it comes to parallel processing. By storing and processing data in RDDs,

Spark Architecture

Image
Apache Spark is an open-source distributed computing system that provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.  It's designed to handle large-scale data processing and analytics tasks efficiently. Spark's architecture is structured to maximize performance and scalability. Architecture Driver Program The driver program is the entry point of any Spark application.  The driver program is responsible for translating the user's code into tasks and distributing them to the executors.  The Driver Program is a process that runs the main() function of the application and creates the SparkContext object which represents the connection to the Spark cluster. When a user submits a Spark application, the Driver Program receives the application code and any associated configurations. Rest of the technical stuffs could be taken by SparkContext. The Driver Program interacts with the cluster manager (such as Apache Mesos or YARN or