Apache Spark Components
Apache Spark is an open source framework for real-time processing. Also it is a powerful big data tool and used for various big data challenges.
Hadoop MapReduce is the best framework for processing data in batches. Spark could process data in real time and it is 100 times faster than Hadoop MapReduce in batch processing of large data sets. Spark is able to achieve this speed through controlled partitioning. We could process that partitioned data with minimal network traffic.
Features
Hadoop MapReduce is the best framework for processing data in batches. Spark could process data in real time and it is 100 times faster than Hadoop MapReduce in batch processing of large data sets. Spark is able to achieve this speed through controlled partitioning. We could process that partitioned data with minimal network traffic.
Features
- Spark code can be written in Java, Scala, Python and R.
- Spark also supports and process structured and semi-structured data though Spark SQL.
- If it is absolutely necessary, then only spark will execute the code/function. This is called "Lazy Execution".
- Spark will be faster for processing the data because of its in-memory processing.
- Apache Spark is a separate tool and it has its own cluster. We can use Hadoop to implement Spark.
Spark Components
• In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing.
• In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing.
Let's consider we want to process one hundreds of millions of data. Whenever we try to process it using old systems then it will pick the files sequentially. Whenever we try to process it through RDBMS then it will pick the files randomly. When we try with HDFS then the files will be processed by parallel mechanism. We can process it better than HDFS using Apache Spark.
Spark Core
Spark Core is an engine used to process large set of data's. It is responsible for Memory management, fault recovery, Scheduling, distributing and monitoring jobs on a cluster.Spark Streaming
Spark Streaming is the component of Spark which is used to process real-time streaming data.Spark Streaming is used to stream real time data from various sources like Twitter, Stock Market, National Security, Telecom, Banking, Healthcare, etc and also perform powerful analytics to help business.
Spark SQL
Spark SQL is one of module in Spark which integrates relational processing with Spark's functional programming, It supports querying data either via SQL or via the Hive Query Language.Spark SQL queries are integrated with Spark programs. Spark SQL allows us to query structured data inside Spark programs, using SQL.
Comments
Post a Comment