Apache Spark

Apache Spark

February 24, 2024

What is Apache Spark?

Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics.

It was originally developed in 2009 in UC Berkeley's AMP Lab, and open sourced in 2010 as an Apache project.

Apache Spark is a single node or cluster computing platform on top of any storage layer for large scale data processing.

It is an End to End Analytics platform (Data Ingestion, ETL, Analytics and streaming).

Developed to overcome limitations of Hadoop/MapReduce.

Spark has several advantages compared to other big data and MapReduce technologies like Hadoop and Storm.

First of all, Spark gives us a comprehensive, unified framework to manage big data processing requirements with a variety of data sets that are diverse in nature (text data, graph data etc) as well as the source of data (batch v. real-time streaming data).

Major companies like Amazon, eBay, Yahoo uses Spark.

Why do we prefer Spark over Hadoop?

Hadoop as a big data processing technology has been around for 15 years and has proven to be the solution of choice for processing large data sets.

MapReduce is a great solution for one-pass computations, but not very efficient for use cases that require multi-pass computations and algorithms.

Each step in the data processing workflow has one Map phase and one Reduce phase and you'll need to convert any use case into MapReduce pattern to leverage this solution.

The Job output data between each step has to be stored in the distributed file system before the next step can begin. Hence, this approach tends to be slow due to replication & disk storage.

Also, Hadoop solutions typically include clusters that are hard to set up and manage.

It also requires the integration of several tools for different big data use cases (like Mahout for Machine Learning and Storm for streaming data processing).

If you wanted to do something complicated, you would have to string together a series of MapReduce jobs and execute them in sequence. Each of those jobs was high-latency, and none could start until the previous job had finished completely.

Spark allows programmers to develop complex, multi-step data pipelines using directed acyclic graph (DAG) pattern. It also supports in-memory data sharing across DAGs, so that different jobs can work with the same data.

Spark runs on top of existing Hadoop Distributed File System (HDFS) infrastructure to provide enhanced and additional functionality. It provides support for deploying Spark applications in an existing Hadoop v1 cluster (with SIMR – Spark-Inside-MapReduce) or Hadoop v2 YARN cluster or even Apache Mesos.

We should look at Spark as an alternative to Hadoop MapReduce rather than a replacement to Hadoop. It’s not intended to replace Hadoop but to provide a comprehensive and unified solution to manage different big data use cases and requirements.

Spark Components

spark driver
cluster administrators
worker nodes
executors

Comments