Home » Big Data

Hadoop | Mapreduce’s Engine: Job Tracker and Task Tracker

In this article, we are going to learn about the Mapreduce’s Engine: Job Tracker and Task Tracker in Hadoop.
Submitted by Akash Kumar, on October 14, 2018

Introduction

Above the filesystem, there comes the MapReduce Engine, which consists of one JobTracker, to which client applications submit MapReduce jobs.

The Job tracker basically pushes work out to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible.

Through the rack-aware file system, the JobTracker basically knows which node contains the data or which has the information about data, and which other machines are nearby and If the work cannot be hosted on the original node where the data resides or stays then the priority will be given to nodes in the same rack.

This basically helps in reducing the network traffic on the main backbone network.

The part of the job is rescheduled when a TaskTracker fails or times out. The TaskTracker on each node spawns a separate Java Virtual Machine process to prevent the TaskTracker itself from falling if the running job crashes its JVM.

A heartbeat is sent from the TaskTracker to the JobTracker every few minutes to check is status. The JobTracker and TaskTracker status and information is exposed by jetty and can be viewed from a web browser.

Limitation of this Approaches are as follows:

  1. The allocation of work to TaskTracker is very simple. In this, every tasktracker is provided with a no. of available slots. Every active map basically takes up one slot. Role of Job Tracker is basically to allocates work to the tracker nearest to the data within an available slot. There is no consideration of the current system load of the allocated system, and hence its actual availability.
  2. If one tasktracker is very slow. It can help to basically delay the entire MapReduce job-especially towards the end of a job, where everything can end up waiting for the slowest task. With speculative execution enabled, however, the single task can be executed on multiple slave nodes.

Scheduling

By default, Hadoop uses First In First Out scheduling technique, 5 scheduling priorities to schedule jobs from a work queue. In version 0.19 the job scheduler was refracted out of the Job Tracker while adding the ability to basically use an alternate scheduler such as Fair Scheduler or Capacity Scheduler.

Let us discuss fair scheduler and Capacity Scheduler...

1) Fair Scheduler

This scheduler was developed by Facebook. The Basic goal of developing this to provide fast response times for small jobs and QoS for production jobs.

Basic Concepts of this are as follows:

  1. Jobs are basically grouped into pools.
  2. In this type of scheduler each pool is basically assigned to a guaranteed minimum share.
  3. Excess capacity is basically split between the jobs.

Pools basically specify the minimum no. of map slots, reduce slots and a limit on running jobs.

2) Capacity Scheduler

This scheduler was basically designed by Yahoo. The capacity scheduler basically supports several features that are too supported by fair scheduler:

  1. Queues are allocated a fraction of total resource capacity.
  2. The free resource is allocated to queues beyond their total capacity.
  3. There is no preemption once a job starts running.
  4. Within a queue the job which has the highest priority has the access to the queue’s resources.



Comments and Discussions!

Load comments ↻






Copyright © 2024 www.includehelp.com. All rights reserved.