Apache Spark Components

March 17, 2024

Apache Spark is an open source framework for real-time processing. Also it is a powerful big data tool and used for various big data challenges.

Hadoop MapReduce is the best framework for processing data in batches. Spark could process data in real time and it is 100 times faster than Hadoop MapReduce in batch processing of large data sets. Spark is able to achieve this speed through controlled partitioning. We could process that partitioned data with minimal network traffic.

Features

Spark code can be written in Java, Scala, Python and R.

Spark also supports and process structured and semi-structured data though Spark SQL.

If it is absolutely necessary, then only spark will execute the code/function. This is called "Lazy Execution".

Spark will be faster for processing the data because of its in-memory processing.

Apache Spark is a separate tool and it has its own cluster. We can use Hadoop to implement Spark.

Spark Components

• In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing.

Let's consider we want to process one hundreds of millions of data. Whenever we try to process it using old systems then it will pick the files sequentially. Whenever we try to process it through RDBMS then it will pick the files randomly. When we try with HDFS then the files will be processed by parallel mechanism. We can process it better than HDFS using Apache Spark.

Spark Core

Spark Core is an engine used to process large set of data's. It is responsible for Memory management, fault recovery, Scheduling, distributing and monitoring jobs on a cluster.

Spark Streaming

Spark Streaming is the component of Spark which is used to process real-time streaming data.

Spark Streaming is used to stream real time data from various sources like Twitter, Stock Market, National Security, Telecom, Banking, Healthcare, etc and also perform powerful analytics to help business.

Spark SQL

Spark SQL is one of module in Spark which integrates relational processing with Spark's functional programming, It supports querying data either via SQL or via the Hive Query Language.

Spark SQL queries are integrated with Spark programs. Spark SQL allows us to query structured data inside Spark programs, using SQL.

GraphX

GraphX is the Spark API for graphs and graph-parallel computation.

MLlib (Machine Learning)

MLlib stands for Machine Learning Library. Spark MLlib is used to perform machine learning in Apache Spark.

Search This Blog

Hadoop Zone