HDFS Overview

HDFS
  • Hadoop Distributed File System which is specially designed file system.
  • Data storage framework.
  • It is storage layer which is designed for only storing data.
  • It is used to store huge amount of datasets with cluster of commodity hardware and with streaming access pattern.
Why we are saying HDFS is specially designed ?
  • File system means it is way of storing files and directories. Hard disk is having memory space like 500 GB. By default one block will be 4kb. Here I am storing 2kb of file then the remaining memory space will be wasted. It is a normal process for storing files which following in RDBMS.
  • In HDFS, by default each block will have 64 MB size. When I am storing 35 MB of data in a block then the remaining memory space will be used to store another files or directories. This process is called Sharding. So that only we are calling HDFS as specially designed file system. If we are wasting these remaining spaces then we need more systems to store the huge datasets. You can make it block size as 128 MB but default is 64 MB. Hadoop administrator should take care of changing the configurations.
  • It will support SRD.

SRD
  • If we are copying any files from one drive to another drive in windows then there will be no concern with that, it will get copy without any issue. Suppose if we are doing the same in HDFS, then the file will be splitted into blocks.
  • Already we know HDFS works like slogan 'WORM', Write Once Read Multiple times'. HDFS will work based on SRD like below,
             • Sharding
             • Distributed
              • Replication Factor

Sharding
  • HDFS will split the file into blocks based on size. We can increase/decrease the block size based on the file size. In version 1 the block size was 64MB and version 2 the block size is 128MB.
  • Let's consider we are copying test.txt (300MB) file into HDFS then it will split the total size into 3 blocks like Block 1 will hold 128MB, Block 2 will hold 128MB then rest 44MB will hold by Block 3.
  • If you have more no of blocks for single file even it will affect the performance. Suppose if you have 300TB of file then you have to increase size of the block for avoiding performance issues. You can increase like 512MB. If you want to change the block size you have to modify dfs.blocksize file.
Distributed
  • Each block distributed across the cluster. It based on parallel processing like instead of processing 2500 jobs for single file in single machine we can split it to 3 machines parallelly then it will be way to process.
Replication Factor
  • Each block will be replicated default 3 times.
            Min: 1
            Max: 512 blocks will be have in single data node.
  • If you want to change the count of replication you have to modify dfs.replication file. Parallel processing, data availability, fault tolerance these are all will be getting done using SDR.
Why we need HDFS ?
  • It is mainly for data analysis.
  • It is also durable.
  • We can append the data.
  • Mainly used for data availability.
  • Data storage framework.
  • It will store the input and output also.
Why we can't use HDFS in OLTP ?
  • In HDFS, update is not possible i.e, HDFS is "WORM".
  • Only MapR HDFS we can do read/write.
  • OLTP (On Line Transaction Processing) is always working like ACID. And Hadoop wont support ACID.
ACID

Atomicity: Do or nothing. Other words like, it is completing the transaction or return back to the original if transaction fails.

Consistency: Your account will get locked while you going for any transactions. Then lock will be released when transaction over. But HDFS mainly used for data availability. If we locked the table then user will not be able to view the data. That is the main reason they avoided OLTP for HDFS.

Isolation: Each and every transaction separated.

Durability: Data is long lasting. It will be available always.


Programming model and cluster resource management (JT + TT) will be taken by MapReduce.

Hadoop Setup
          
          1. Single node
          2. Pseudo distributed
          3. Multi node

Single node means running name node, data node in a single machine. Pseudo distributed means running name node in single machine and running all data node in another machine. Multi node means running name node in single machine and running each & every data node in separate machines.

The below services used to start the Hadoop Daemons through CLI,

Start-dfs.sh   -> It will dfs services separately.
Start-yarn.sh -> It will yarn services separately.
Start-all.sh    -> it will start all the services (Including DFS and YARN daemons).

Hadoop Daemons

HDFS daemons

          1. Name Node
          2. Data node
          3. Secondary Name Node

These 3 nodes deal with HDFS which used to store the data into HDFS.

V1 Map Reduce Daemons     Yarn Daemons (Map Reduce V2)

          4. Job Tracker                  4. Resource Manager
          5. Task Tracker                  5. Node Manager

These daemons are responsible for running the MapReduce jobs.

Start-dsf.sh service for starting dfs daemons and start-yarn.sh for starting yarn daemons separately. start-all.sh service for starting all above daemons.

Comments

Popular posts from this blog

Hive File Formats

Why We Need Hadoop?

Hive Data Types