YARN

YARN Overview
  • YARN stands for Yet Another Resource Negotiator. 
  • The Yarn was introduced in Hadoop 2.x. 
  • It is purely for processing data and processing layer. It is called as Data Processing Framework (DPF).
  • Yarn allows different data processing engines like graph processing, interactive processing, stream processing as well as batch processing to run and process data stored in HDFS. 
  • Apart from resource management, Yarn is also used for job Scheduling.

YARN Architecture
  • Apache Yarn Framework consists of a master daemon known as “Resource Manager”, slave daemon called "Node Manager (one per slave node)". 
  • Resource Manager, Node Manager are the two daemons of YARN. 
  • Resource Manager contains in name node. 
  • Node Manager contains in data node. Each data node will have one separate Node Manager.

Resource Manager
  • In General view, Resource Manager (RM) is responsible for tracking the resources in a cluster, and scheduling applications (e.g., MapReduce jobs). 
  • Prior to Hadoop 2.x, the Resource Manager is the single point of failure in a YARN cluster. 
  • The High Availability feature adds redundancy in the form of an Active/Standby Resource Manager pair to remove this single point of failure.
  • Resource Manager has two Main components, 
    • Scheduler
    • Application Manager
  • Resource Manager knows where the slaves are located (Rack Awareness) and how many resources they have. It runs several services, the most important is the Resource Scheduler which decides how to assign the resources.
Scheduler
  • The scheduler is responsible for allocating the resources to the running application. 
  • The scheduler is used to schedule application/job.
  • The scheduler is pure scheduler it means that it performs no monitoring no tracking for the application/job and even doesn't guarantees about restarting failed job/tasks either due to application failure or hardware failures.
Application Manager
  • It is responsible for starting Application Masters (which resides in Node Manager) and for monitoring and restarting Application Masters on different nodes in case of failures.
  • It will coordinate all created Application Masters.
Node Manager (NM)
  • It is the slave daemon of Yarn. When it starts, it announces himself to the Resource Manager. Periodically, it sends an heartbeat to the Resource Manager by every 3 secs.
  • NM is responsible for containers monitoring their resource usage and reporting the same to the Resource Manager. 
  • Yarn Node Manager also tracks the health of the node on which it is running. 
  • Each Node Manager offers some resources to the cluster. Its resource capacity is the amount of memory and the number of vcores.
  • Node Manager has two Main components,
    • Application Master
    • Container (Memory)
Application Master (AM)
  • One application master runs per application. 
  • It negotiates resources from the resource manager if insufficient and works with the node manager. It Manages the application/job life cycle. 
  • It knows where the slaves are located (Rack Awareness) and how many resources they have. 
  • The Application Master is responsible for the execution of a single application. 
  • It asks for containers to the Resource Scheduler (Resource Manager) and executes specific programs (e.g., the main of a Java class) on the obtained containers. 
  • The Application Master knows the application logic and thus it is framework-specific. The MapReduce framework provides its own implementation of an Application Master.
  • For each job, one slave node will act as the Application Master, monitoring resources/tasks, etc. Application Master, to manage each MapReduce job and is terminated when the job completes. 
Containers
  • Container is a place where a YARN application/task(not the entire job) is run. 
  • It is available in each node. 
  • Containers are launched by Node Manager. 
  • Application Master negotiates container with the scheduler (one of the component of Resource Manager).
According to the size of input data, multiple input splits are created. The MR job need to process this whole data so multiple tasks are being created (map & reduce tasks). So for each input split will be processed by one task. Now how to run this task, is suggested by Resource manager. Resource manager knows which node manager is free and which is busy, its like principal of college and node manager are the class teacher of college and principal knows which teacher is free. So it asks node manager to run that task (small fraction of entire job) in the container i.e. memory area such that jvm. So the job is run as an application master inside the container.

The application startup process is the following,

          • A client submits an application to the Resource Manager
          • The Resource Manager allocates a container
          • The Resource Manager contacts the related Node Manager
          • The Node Manager launches the container
          • The Container executes the Application Master

MapReduce Job Execution on YARN


Whatever script or command you are running in any kind of Hadoop eco system tools, those programmings automatically converted as java module/jar by internally. After YARN layer will save the java program as jar file format then execute that jar (example, word count job) in CLI (Command Line Interface) for processing particular file/job, that time JVM run that job and get new application unique id from Resource manager by submitting that job into Resource manager.

Client JVM, once got the new application id from Resource manager then it will approach HDFS for copying required resources for that job. Required resources in the sense whether that input file available in mentioned HDFS path and is it a correct location? Then mentioned output file path available in HDFS or not?.

Once resource manager got job id and resource details from JVM then it will start to approach the Node Manager (data node) which has been stored that input file block. Suppose that corresponding data node CPU memory was full then it will approach another data node which has been stored that same data. As already we know we have 3 replications of same data. Resource manager asks node manager to start the MRAppMaster and create the container. MRAppMaster will check the resource whether program have enough resource to process the data. For example, if we have 4 tasks (3 map, 1 reduce) in program but we have only available 3 resources in the sense MRAppMaster will approach Resource Manager for negotiating resources from 3 to 4. Then Resource manager will provide enough resource.

Node manager will ask MRAppMaster to create the container. Container has been used to allocate the space for running that corresponding job. At the mean MRAppMaster will check with HDFS for counting exact input splits.

Then node manager will start the container, each container will be having one YarnChild to execute corresponding Map and Reduce tasks. After execution of each reduce task then YarnChild will return output to HDFS as part file. Then HDFS will store the output.

No of tasks = no of container = no of YarnChild

There will be only one MRAppMaster for a node manager. Every node manager will have more than one container; each container will have one YARN child. YARN child will execute those maps and reduce tasks and stores the output (part file) into HDFS as mentioned output (part) file path in the CLI.

Comments

Popular posts from this blog

Hive File Formats

HDFS Infographic

Why We Need Hadoop?