Why We Need Hadoop?

What is Hadoop?

  • It is an open source framework or system which was fully developed by Java. 
  • It is allow us to implement Big Data. 
  • Hadoop can be used for storing and processing huge datasets with cluster of commodity hardware. Because its stores the data and processing in a distributed manner. 
  • It made up of several components which entirely open source.

Doug Cutting is the inventor of hadoop. He used elephant symbol. His child used to play with elephant toy. So he used that toy name as symbol and named for new invention.

Why we need to go for Hadoop?

1. High Throughput and Low latency queries

  • Less amount of time it will take to run/process real time queries or near real time queires.
  • We can achieve high throughput in hadoop using below techniques. 
    • Parallel Processing
    • WORM
    • Data Locality

2. Distributed computing

  • A distributed computer system consists of multiple software components that are on multiple computers, but run as a single system. 
  • The computers that are in a distributed system can be physically close together and connected by a local network, or they can be geographically distant and connected by a wide area network.
  • Hadoop's distributed computing model processes big data fast. 
  • The more computing nodes you use, the more processing power you have.

3. Massive Parallel Processing 

  • Fast Data Processing
  • Massively Parallel Processing (MPP) is a processing paradigm where hundreds or thousands of processing nodes work on parts of a computational task in parallel. Each of these nodes run individual instances of an operating system.
  • After the Hadoop system receives a job, it first divides all the input data of the job into several data blocks of equal size, and each Map task is responsible for processing a data block. All Map tasks are executed at the same time, forming parallel processing of data.
  • Multiple Jobs will process run based on single data with reference of replication.

4. Horizontal Scalability

  • You can easily grow your system to handle more data simply by adding nodes. Little administration is required.
  • Horizontal scaling (aka scaling out) refers to adding additional nodes or machines to your infrastructure to cope with new demands. 
  • If you are hosting an application on a server and find that it no longer has the capacity or capabilities to handle traffic, adding a server may be your solution.

5. Fault tolerance

  • Data and application processing are protected against hardware failure. 
  • If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. 
  • Multiple copies of all data are stored automatically.

6. Data availability (Durability or Durable)

  • The high availability feature in Hadoop ensures the availability of the Hadoop cluster without any downtime, even in unfavorable conditions like any node failures(NameNode failure, DataNode failure, machine crash, etc) over cluster. It means if the machine crashes, data will be accessible from another path.

7. Hadoop is Flexible

  • Unlike traditional relational databases, you don’t have to preprocess data before storing it. You can store as much data as you want and decide how to use it later. That includes unstructured data like text, images and videos.

8. Cost Effective

  • Hadoop is comparatively cheap and cost effective way of handling the Big Data when compared with the other frameworks (spark, ElasticSearch, Presto, etc). 
  • low cost - The open source framework is free and uses commodity hardware to store large quantities of data. 

9. Data security

  • Another challenge centers around the fragmented data security issues, though new tools and technologies are surfacing. 
  • The Kerberos authentication protocol is a great step toward making Hadoop environments secure.

10. Replication Factor

  • By default the Replication Factor for Hadoop is set to 3.
  • In Hadoop, Maximum Replication factor is 512 times
  • We can check the replication of a file using below command,
    • hadoop fs -stat %r /path/to/file

11. Streaming Access Pattern

  • You know one slogan is there in java "write ones, run anywhere" -> running on any platform either Windows or Unix.
  • Hadoop also having one slogan like java "write ones, read any no of times but don’t try to change the content of file" when we are keeping data in HDFS. That’s what streaming access pattern is. 

12. We can't use Hadoop/HDFS in OLTP - OLTP systems only using ACID

  • Lack of OLTP Support.
    • Hadoop is based on sequential reads and doesn’t support updates, so it’s more useful for Online Analytical Processing or OLAP. Hive, an SQL-on-Hadoop tool based on MapReduce, does not support Online Transaction Processing, or OLTP, since that programming doesn’t do single-row operations. While other tools are not based on MapReduce, they still target analytical queries. HBase provides transactional functionality, but it’s not ACID compliant, so you can’t use it to guarantee reliable database transactions.

13. Sharding

  • By distributing the data among multiple machines.
  • Sharding is a method of splitting and storing a single logical dataset in multiple databases.
  • Like MongoDB, Hadoop's HBase database accomplishes horizontal scalability through database sharding.

14. Shared-nothing Architecture

  • It consists of multiple nodes that do not share resources (e.g., memory, CPU, and NIC (Network Interface Card) buffer queues). Requests are serviced by a single node, avoiding contention(bargain) among nodes.

15. Schema Independent

  • It means we can change the conceptual schema at one level without affecting the data at another level. It also means we can change the structure of a database without affecting the data required by users and programs.

16. Data locality

  • In Hadoop, Data locality is the process of moving the computation close to where the actual data resides on the node, instead of moving large data to computation. This minimizes network congestion and increases the overall throughput of the system.

17. Data Integrity

  • Data integrity is the overall accuracy, completeness, and consistency of data. 
  • Data Integrity in Hadoop is achieved by maintaining the checksum of the data written to the block. Whenever data is written to HDFS blocks , HDFS calculate the checksum for all data written and verify checksum when it will read that data.

18. Hadoop Setup

  • We can setup Hadoop for own practice purpose or enterprise version for business. Those setups classified like below,
    • Single node
    • Pseudo distributed
    • Multi node
  • Single node means running Name Node, Data Node in a single machine. 
  • Pseudo distributed means running Name Node in single machine and running all Data Node in another machine. 
  • Multi node means running Name Node in single machine and running each & every Data Node in separate machines.
  • The below services used to start the Hadoop Daemons through CLI,
    • Start-dfs.sh      => It will dfs services separately.
    • Start-yarn.sh    => It will yarn services separately.
    • Start-all.sh       => It will start all the services (Including DFS and YARN daemons).

Hadoop Distributions

  • Commercial : Cloudera, HortonWorks, MapR
  • Cloud : AWS (Elastic Map Reduce), Azure HDInsight, GCP 

Organizations using Hadoop

  • Facebook 
  • Yahoo! 
  • Amazon 
  • eBay 
  • American Airlines 
  • The New York Times 
  • Federal Reserve Board 
  • IBM 
  • Orbitz 
  • Many more.... partially 100 companies.

Core Components of Hadoop

We will discuss about below components in details.

  • HDFS
  • Map Reduce
  • YARN

Comments

Popular posts from this blog

Hive File Formats

Hive Data Types