HDFS Infographic

Already we know, the data will be getting stored in 3 data nodes (replication) for prevention of failures or corruption. It is useless when you storing the same data in same disk/rack as 3 times. Suppose that whole data replicated single disk/rack is corrupted then it won't be going to work out, It is useless to come and do replication process in HDFS. When we are storing all of these 3 replicated data in different data node but in same rack. It is also not helpful. Suppose if you lose whole rack, then won't be going to help. So we have to store it in different data node with different rack then it will be more helpful to protect the data. Here we are going to discuss about how to store the data in different data node with different rack and how to handle Fault Tolerance.

Replica Placement Strategy

Name node will follow Rack Awareness Policy algorithm for replicating data into 3 different data nodes. This Rack Awareness Policy follows Replica placement strategy. Here the replication will be based on placement strategy algorithm. This placement algorithm written by java. we can modify the algorithm with using our own customized code. Name Node is the only master who is responsible for allocating DN for writing data and also used to read the data from corresponding DN as per client request.



Here we have list of racks in single cluster. In each rack, we will have list of data nodes. First replica is simple as per Name node suggestion, our algorithm will choose any nearest DN where user/developer executing script for writing, then the remaining two will be taken into next or random rack. Placement Algorithm is based on 'Only one replica per DN and Max two replica per rack'. Replication is mainly used for to avoid data loss and multiple user can read data at a time.

Writing Data in HDFS Cluster

Let's consider, we are trying to write 300MB of data file in HDFS cluster. As a client, out script will approach name node first. Before that we have to give each block size (By default it will be either 64MB or 128MB or 512MB) and we have to intimate to make 3 copies (Replication) of each dataset what we are going to write. So this block size, replication factor will be decided and configured once we done with the infra setup of HDFS, no need to configure every time when we running each script. As we know already if file size is large then probably we have to increase the size of block. Even block size is smaller than file size then based on file size, number of blocks will be getting allocated for writing. Based on the file size we have to divide into that much of blocks. Then client will send request to Name node for writing these data’s into block.


Then name node will analyze all data nodes then finally it will give the 3 DN's for storing 300MB of data bcoz considering our block size is 128MB. Those 3 DN’s will be nearer to client because if it is nearer to client machine then it will be easy to write the data into HDFS. Client will started to store the data into first DN. Along with data it will send remaining 2 DN information/address to first DN. Because first DN done with the storing of data then it will forward that same data to second DN. Second DN will get the data along with third DN information. Then second DN will store the same data into third DN. Once all DN getting stored then it will send the acknowledgement to the NN. Data storage will be done by sequential method i.e, it will store the data one by one.


Every 30 secs or 10th heart beat of data node sends block report to the name node. When replication was done with those 3 DN's then those data nodes will send the heart beat to Name node. So the name node will store all blocks and data nodes information about stored data i.e, meta data. 

Reading Data from HDFS Cluster

When we need to read the file from HDFS, client have to send the request to name node. Then name node will reply all blocks information with data node information where the file getting stored. Now client will have all DN and block information about file where it is stored. Then they will download the file from DN which is nearest to client.


Suppose if that DN is dead or does't have the data or the data is corrupted then HDFS can handle this issues very elegantly. We will discuss about this issues in Fault tolerance session.

Fault tolerance

Part I: Types of Faults and their Detection

First failure is either name node or data node failure. Then second failure is communication failure, it is network issue. Then third failure is data corruption, generally when you are downloading some data from browser or data node to client machine, you may get something like checksum values. If checksum value is 0 then it is downloaded successfully without any error else it may some data missing/corruption in downloaded file. Checksum was designed by md5, pdp algorithm.


Fault 1

          • First failure is node failure; either it is name node or data node failure. If name node is failures then the entire cluster will be getting dead. So as we know already this is called Single Point of Failure (SPOF). Then we have to fix it manually or it will be handled by Stand by Name node/Secondary Name node.


          • Data nodes will send HEARTBEAT to name node for every 3 seconds. HEARTBEAT means it will have information about data storage, block details, CPU performance like that.

Fault 2

          • When client system have received list of DN and block information from name node for storing the data, client will send the data to DN's then the data node will be acknowledging to client as data received successfully without any error.

          • If acknowledgement is not received from data node to client then client will assumes as corresponding data node is dead. Here failure happened due to network slowness/issues.

          • Client will send data along with checksum to data node. If it downloaded successfully then the checksum is 1 else some missing/corruption data while downloading.


Fault 3

          • Even blocks are also sends BLOCK REPORT to name node for saying they are alive and which file data they stored. Those BLOCK REPORT's will be sends by data node. 

          • Before sending these block report, data node will check where all the blocks are having proper/working data or corrupted data, these are done by checksum value. If checksum is not 0 then it will consider those blocks data are corrupted. Then it won't send that particular block information (BLOCK REPORT) to Name node which is not having checksum as 0.

          • Name node will automatically came to know about corrupted blocks if they didn't shared the BLOCK REPORT.


Part II: Handling Reading Writing Failures

          • Client will write or copy the data into blocks and the blocks will sends the acknowledgement to the client as we know already. Suppose if block didn't send any acknowledgement to client then they will assume that corresponding DN is dead. So considering that DN as 'under replicated', Name node will take care of that under replicated data nodes later.

          • For reading, client will ask the information of storage of particular file, then name node sends the required blocks and DN information to the client. Client will check nearest DN. Suppose that DN is dead then it will move to next nearest node for collecting the data. Will discuss about how to handle data node failure in next part.


Part III: Handling Data node Failure

Here name node maintaining two table.

          1. List of blocks
                    Contains information about blocks where are all it is getting stored like DN informations.
       
          2. List of data nodes
                    It will have details about the all the blocks of all DN's.


          • If a block on a DN is corrupted, name node won't get any BLOCK REPORT, so name node will update to List of blocks table as corresponding block is getting corrupted.

          • If data node has died, then name node wont get HEARTBEATS from that DN, so name node will update to both tables as corresponding DN has died.

          • Suppose if there any under replicated block anyhow it wont send BLOCK REPORT to name node, so name node will ask some other DN to copy data from those failed under replicated blocks then it will share all the details. Then new DN will approach failed block/DN for copying the data. 

Comments

Popular posts from this blog

Hive File Formats

Why We Need Hadoop?

Hive Data Types