Why We Need Apache Spark?

Impediments of Hadoop

  • Developed to overcome limitations of Hadoop/MapReduce. 
  • We should look at Spark as an alternative to Hadoop MapReduce rather than a replacement to Hadoop because of below reasons. 
    • Intermediate Results gathering
    • Processing Techniques - Almost Realtime
    • Polyglot
    • Deployment and Storage(Scalable)
    • Powerful Cache and Good Speed (In-memory)
    • Parallelize
    • Lazy Evaluation

Advantage of Spark over Other Frameworks

  • In-Memory Processing - Many times it is faster than other processing engine like MapReduce, tez, mesos, etc.
    • Keeps data In-Memory (RAM), as it will do In-memory processing.
    • Iterative algorithms are faster as data is not being written to disk between jobs.
    • Intermediate results will be maintained in in-memory. But In hadoop, it will keep intermediate results in disks.
  • Processing Technique
    • Sequencial Processing means, before RDBMS system introduced, FIFO technique. It will process data/job by sequencially.
    • Random Processing means, RDBMS will do this. It will pick the job/query randomly based on cost or some other factor.
    • To perform Batch Processing, we were using Hadoop MapReduce.
    • To perform Parallel Processing, we can use Yarn, Hive, Sqoop, HBase, etc. 
      • After the Hadoop system receives a job, it first divides all the input data of the job into several data blocks of equal size, and each Map task is responsible for processing a data block. All Map tasks are executed at the same time, forming parallel processing of data.
      • Spark uses Resilient Distributed Datasets (RDD) to perform parallel processing across a cluster or nodes. It allows developers to perform tasks on hundreds of machines in a cluster in parallel and independently. We can achieve this using 'parallelize' also.
    • Also, to perform Stream Processing, we were using Apache Storm / S4.
      • We can call it as Real Time Processing. It is a technique that involves ingesting a continuous data to storage system for quick analyze, filter, transform or enhance the data in real time.
    • Moreover, for Interactive Processing, we were using Apache Impala / Apache Tez.
      • Interactive processing means that the person needs to provide the computer with instructions whilst it is doing the processing. i.e the user 'interacts' with the computer to complete the processing.
      • For example, if you use an online system to book a hotel room, you will fill in a web form, submit it and it will come back to inform you of the room you have booked.
    • To perform Graph Processing, we were using Neo4j / Apache Giraph.
      • Graph analytics, also called network analysis, is the analysis of relations among entities such as customers, products, operations, and devices. Organizations leverage graph models to gain insights that can be used in marketing or for example for analyzing social networks. Many businesses work with graphs.
    • To perform Iterative Processing, Hadoop is not so efficient, as Hadoop does not support cyclic data flow (i.e. a chain of stages in which each output of the previous stage is the input to the next stage).
      • We can use Apache Spark to overcome this type of Limitations of Hadoop, as it accesses data from RAM instead of disk, which dramatically improves the performance of iterative algorithms that access the same dataset repeatedly. Spark iterates its data in batches. For iterative processing in Spark, we schedule and execute each iteration separately.

Hence there was no powerful engine in the industry, that can process the data both in real-time and batch mode. Also, there was a requirement that one engine can respond in sub-second and perform in-memory processing. 

Therefore, Apache Spark programming enters, it is a powerful open source engine. Since, it offers Real-Time Processing, Stream Processing, Interactive Processing, Graph Processing, In-Memory Processing, Iterative Processing as well as Batch Processing. Even with very fast speed, ease of use and standard interface. Basically, these features create the difference between Hadoop and Spark.

Spark is almost Real Time Processing but not fully. While real time processing, It will do batch process 80% of data and rest of 20% will be processed by next batch. Maybe in future, it will get implement as Real Time Processing.

  • Polyglot - Supports for many programming languages. Like Scala, Python, SQL, Java, R, ML.
  • We can consider Spark itself as processing engine or we can use Yarn, Mesos for processing. For storage Spark can utilize HDFS, S3 (AWS), LFS or any other cloud platform.
  • Parallelize in Spark
    • This allows Spark to distribute the data across multiple nodes, instead of depending on a single node to process the data. 
    • One important way to increase parallelism of spark processing is to increase the number of executors on the cluster. So that for increasing parallelism we can run jobs through multiple nodes using parallelism method.

  • Supports Lazy evolution of big data queries which helps with the optimization of overall data processing workflow. 
    • In Spark, Lazy Evaluation means that You can apply as many TRANSFORMATIONs as you want, but Spark will not start the execution of the process until an ACTION is called. So transformations are lazy but actions are eager.

Comments

Popular posts from this blog

Hive File Formats

Why We Need Hadoop?

Hive Data Types