Posts

Showing posts from October, 2022

Why We Need Apache Spark?

Impediments of Hadoop Developed to overcome limitations of Hadoop/MapReduce.  We should look at Spark as an alternative to Hadoop MapReduce rather than a replacement to Hadoop because of below reasons.  Intermediate Results gathering Processing Techniques - Almost Realtime Polyglot Deployment and Storage(Scalable) Powerful Cache and Good Speed (In-memory) Parallelize Lazy Evaluation Advantage of Spark over Other Frameworks In-Memory Processing - Many times it is faster than other processing engine like MapReduce, tez, mesos, etc. Keeps data In-Memory (RAM), as it will do In-memory processing. Iterative algorithms are faster as data is not being written to disk between jobs. Intermediate results will be maintained in in-memory. But In hadoop, it will keep intermediate results in disks. Processing Technique Sequencial Processing means, before RDBMS system introduced, FIFO technique. It will process data/job by sequencially. Random Processing means, RDBMS will do this. It will pick the

Why We Need Hadoop?

What is Hadoop? It is an open source framework or system which was fully developed by Java.  It is allow us to implement Big Data.  Hadoop can be used for storing and processing huge datasets with cluster of commodity hardware. Because its stores the data and processing in a distributed manner.  It made up of several components which entirely open source. Doug Cutting is the inventor of hadoop. He used elephant symbol. His child used to play with elephant toy. So he used that toy name as symbol and named for new invention. Why we need to go for Hadoop? 1. High Throughput and Low latency queries Less amount of time it will take to run/process real time queries or near real time queires. We can achieve high throughput in hadoop using below techniques.  Parallel Processing WORM Data Locality 2. Distributed computing A distributed computer system consists of multiple software components that are on multiple computers, but run as a single system.  The computers that are in a distributed sys