Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
What is MapReduce in Hadoop?
2.1.
MapReduce Phases
3.
What is Hadoop InputFormat?
3.1.
Types of InputFormat in MapReduce
3.1.1.
FileInputFormat
3.1.2.
TextInputFormat
3.1.3.
KeyValueTextInputFormat
3.1.4.
SequenceFileInputFormat
3.1.5.
SequenceFileAsTextInputFormat
3.1.6.
NlineInputFormat
3.1.7.
DBInputFormat
3.2.
Output Format in MapReduce
4.
Frequently Asked Questions
4.1.
What is the purpose of sorting and shuffling?
4.2.
Why does HDFS have fault tolerance?
4.3.
What is the distributed cache?
4.4.
What is speculative execution in Hadoop?
5.
Conclusion
Last Updated: Mar 27, 2024

MapReduce Types

Master Python: Predicting weather forecasts
Speaker
Ashwin Goyal
Product Manager @

Introduction

MapReduce is a computer technique that is used to create and process massive data sets. It includes unique formats for both input and output. It is important to remember that the availability and use of input and output formats are determined by the specific MapReduce types implementation as well as the tools or libraries used in your data processing environment.

MapReduce Types

This article introduces the concept of MapReduce types, specifically how data in diverse formats, ranging from simple text to structured binary objects, is employed.

What is MapReduce in Hadoop?

MapReduce is an important component of Hadoop. MapReduce is a massively parallel data processing system that processes dispersed data in a quicker, scalable, and fault-tolerant manner. By breaking work into independent sub-tasks, MapReduce can process a massive volume of data in parallel. MapReduce programs developed in languages such as Java, Ruby, Python, and C++ can be run by Hadoop.  

The MapReduce algorithm divides the processing into two tasks: map and reduce. The reduction task occurs after the map task has been done, as the term MapReduce implies. Each task has key-value pairs as input and output, which the programmer can customize. Map translates one collection of data into another, where individual items are broken down into key-value pairs. The reduce task then takes the output of a map as input in the form of key-value pairs and merges it into a smaller collection of key-value pairs.

MapReduce Phases

The MapReduce program is executed in three main phases: the mapping phase, the shuffling and sorting phase, and the reducing phase.

  • Map phase: This is the program's first phase. This phase consists of two steps: splitting and mapping. For efficiency, the input file is separated into smaller equal portions known as input splits. Because Mappers only understand (key, value) pairs, Hadoop employs a RecordReader that uses TextInputFormat to convert input splits into key-value pairs. 
     
  • Shuffle and sorting phase: Shuffle and sort are intermediary phases in MapReduce. The Shuffle process aggregates all Mapper output by grouping important Mapper output values, and the value is appended to a list of values. So, the Shuffle output format will be a map <key, List<list of values>>.  The Mapper output key will be combined and sorted.
     
  • Reduce phase: The result of the shuffle and sorting phases is sent as input into the Reducer phase, which processes the list of values. Each key could be routed to a distinct Reducer. Reducers can set the value, which is then consolidated in the final output of a MapReduce job and saved in HDFS as the final output.
Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

What is Hadoop InputFormat?

Hadoop InputFormat provides the input specification for Map-Reduce job execution. InputFormat specifies how to divide and read input files. InputFormat is the initial phase in MapReduce job execution. It is also in charge of generating input splits and separating them into records. One of the core classes in MapReduce is InputFormat, which provides the following functionality: 

  • InputFormat determines which files or other objects to accept as input.
     
  • It also specifies data splits. It specifies the size of each Map task as well as the potential execution server.
     
  • The RecordReader is defined by Hadoop InputFormat. It is also in charge of reading actual records from input files.
     

Example

An example of MapReduce types of InputFormat is given below:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class Text_input_output_example {
  
  public static class MyMapper extends Mapper<LongWritable, Text, Text, Text> {
    
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
      // Put the map logic here.
      context.write(new Text("OutputKey"), value);
    }
  }
  
  public static void main(String[] args) throws Exception {
    // Construct a Hadoop configuration.
    Configuration conf = new Configuration();
    
    Job job = Job.getInstance(conf, "Text_input_output_example");
    job.setJarByClass(Text_input_output_example.class);
    
    // TextInputFormat should be used as the input format class.
    job.setInputFormatClass(TextInputFormat.class);
    
    job.setMapperClass(MyMapper.class);
  
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(Text.class);
    
    // TextOutputFormat should be used as the output format class.
    job.setOutputFormatClass(TextOutputFormat.class);
    
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);
    
    TextInputFormat.addInputPath(job, new Path("input_dir"));
    TextOutputFormat.setOutputPath(job, new Path("output_dir"));
    
    // Wait for the job to finish before exiting.
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

 

The code above uses the TextInputFormat to read input data from plain text files. You can specify your map logic in the MyMapper class, which is the mapper implementation. The result is saved as plain text files because the output format is set to TextOutputFormat.

Also see, Types of information system

Types of InputFormat in MapReduce

In Hadoop, there are various MapReduce types for InputFormat that are used for various purposes.  Let us now look at the MapReduce types of InputFormat:

FileInputFormat

It serves as the foundation for all file-based InputFormats. FileInputFormat also provides the input directory, which contains the location of the data files. When we start a MapReduce task, FileInputFormat returns a path with files to read. This InpuFormat will read all files. Then it divides these files into one or more InputSplits.

TextInputFormat

It is the standard InputFormat. Each line of each input file is treated as a separate record by this InputFormat. It does not parse anything. TextInputFormat is suitable for raw data or line-based records, such as log files. Hence:

  • Key: It is the byte offset of the first line within the file (not the entire file split).  As a result, when paired with the file name, it will be unique.
     
  • Value: It is the line's substance. It does not include line terminators.
     

KeyValueTextInputFormat

It is comparable to TextInputFormat. Each line of input is also treated as a separate record by this InputFormat. While TextInputFormat treats the entire line as the value, KeyValueTextInputFormat divides the line into key and value by a tab character ('/t'). Hence:

  • Key: Everything up to and including the tab character.
     
  • Value:  It is the remaining part of the line after the tab character.
     

SequenceFileInputFormat

It's an input format for reading sequence files. Binary files are sequence files. These files also store binary key-value pair sequences. These are block-compressed and support direct serialization and deserialization of a variety of data types. Hence Key & Value are both user-defined.

SequenceFileAsTextInputFormat

It is a subtype of SequenceFileInputFormat. The sequence file key values are converted to Text objects using this format. As a result, it converts the keys and values by running 'tostring()' on them. As a result, SequenceFileAsTextInputFormat converts sequence files into text-based input for streaming. 

NlineInputFormat

It is a variant of TextInputFormat in which the keys are the line's byte offset. And values are the line's contents. As a result, each mapper receives a configurable number of lines of TextInputFormat and KeyValueTextInputFormat input. The number is determined by the magnitude of the split. It is also dependent on the length of the lines. So, if we want our mapper to accept a specific amount of lines of input, we use NLineInputFormat. 

N- It is the number of lines of input received by each mapper.

Each mapper receives exactly one line of input by default (N=1).

Assuming N=2, each split has two lines. As a result, the first two Key-Value pairs are distributed to one mapper. The second two key-value pairs are given to another mapper.

DBInputFormat

Using JDBC, this InputFormat reads data from a relational Database. It also loads small datasets, which might be used to connect with huge datasets from HDFS using multiple inputs. Hence:

  • Key: LongWritables
     
  • Value: DBWritables.

Output Format in MapReduce

The output format classes work in the opposite direction as their corresponding input format classes. The TextOutputFormat, for example, is the default output format that outputs records as plain text files, although key values can be of any type and are converted to strings by using the toString() method. The tab character separates the key-value character, but this can be changed by modifying the separator attribute of the text output format.

SequenceFileOutputFormat is used to write a sequence of binary output to a file for binary output. Binary outputs are especially valuable if they are used as input to another MapReduce process. 

DBOutputFormat handles the output formats for relational databases and HBase. It saves the compressed output to a SQL table.  

Also check out - Phases of Compiler

Frequently Asked Questions

What is the purpose of sorting and shuffling?

Sorting and shuffling is in charge of producing a unique key and a set of values. Sorting is the process of creating comparable keys in one location. Shuffling refers to the process of sorting and sending the mapper's intermediate output to the reducers.

Why does HDFS have fault tolerance?

Because it replicates data among DataNodes, HDFS is fault-tolerant. A block of data is duplicated on three DataNodes by default. The data blocks are saved in several DataNodes. Data can still be obtained from other data nodes if one node fails.

What is the distributed cache?

A distributed cache is a method that allows data from a disc to be cached and made available to all worker nodes. When a MapReduce program runs, instead of reading data from the disc every time, it will retrieve data from the distributed cache. 

What is speculative execution in Hadoop?

If a DataNode executes a task slowly, the master node can redundantly execute the identical operation on another node. The first assignment completed will be accepted, and the second task will be killed. As a result, speculative execution is advantageous if you work in an environment with a high workload.

Conclusion

This article explains the concepts of MapReduce types in Hadoop with various types of input and output formats, along with some frequently asked questions related to the topic. I hope this article MapReduce types was beneficial and that you learned something new. To have a better understanding of the topic, you can further refer to MapReduce FundamentalsHadoop MapReduce, and the Foundational behaviors of MapReduce.

For more information, refer to our Guided Path on Coding Ninjas Studio to upskill yourself in PythonData Structures and AlgorithmsCompetitive ProgrammingSystem Design, and many more! 

Head over to our practice platform, Coding Ninjas Studio, to practice top problems, attempt mock tests, read interview experiences and interview bundles, follow guided paths for placement preparations, and much more! 

Happy Learning Ninja!

Previous article
MapReduce Architecture
Next article
MapReduce Combiner
Live masterclass