What is Hadoop InputFormat?
Hadoop InputFormat provides the input specification for Map-Reduce job execution. InputFormat specifies how to divide and read input files. InputFormat is the initial phase in MapReduce job execution. It is also in charge of generating input splits and separating them into records. One of the core classes in MapReduce is InputFormat, which provides the following functionality:
-
InputFormat determines which files or other objects to accept as input.
-
It also specifies data splits. It specifies the size of each Map task as well as the potential execution server.
-
The RecordReader is defined by Hadoop InputFormat. It is also in charge of reading actual records from input files.
Example
An example of MapReduce types of InputFormat is given below:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Text_input_output_example {
public static class MyMapper extends Mapper<LongWritable, Text, Text, Text> {
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// Put the map logic here.
context.write(new Text("OutputKey"), value);
}
}
public static void main(String[] args) throws Exception {
// Construct a Hadoop configuration.
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Text_input_output_example");
job.setJarByClass(Text_input_output_example.class);
// TextInputFormat should be used as the input format class.
job.setInputFormatClass(TextInputFormat.class);
job.setMapperClass(MyMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
// TextOutputFormat should be used as the output format class.
job.setOutputFormatClass(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
TextInputFormat.addInputPath(job, new Path("input_dir"));
TextOutputFormat.setOutputPath(job, new Path("output_dir"));
// Wait for the job to finish before exiting.
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
You can also try this code with Online Java Compiler
Run Code
The code above uses the TextInputFormat to read input data from plain text files. You can specify your map logic in the MyMapper class, which is the mapper implementation. The result is saved as plain text files because the output format is set to TextOutputFormat.
Also see, Types of information system
Types of InputFormat in MapReduce
In Hadoop, there are various MapReduce types for InputFormat that are used for various purposes. Let us now look at the MapReduce types of InputFormat:
FileInputFormat
It serves as the foundation for all file-based InputFormats. FileInputFormat also provides the input directory, which contains the location of the data files. When we start a MapReduce task, FileInputFormat returns a path with files to read. This InpuFormat will read all files. Then it divides these files into one or more InputSplits.
TextInputFormat
It is the standard InputFormat. Each line of each input file is treated as a separate record by this InputFormat. It does not parse anything. TextInputFormat is suitable for raw data or line-based records, such as log files. Hence:
-
Key: It is the byte offset of the first line within the file (not the entire file split). As a result, when paired with the file name, it will be unique.
-
Value: It is the line's substance. It does not include line terminators.
KeyValueTextInputFormat
It is comparable to TextInputFormat. Each line of input is also treated as a separate record by this InputFormat. While TextInputFormat treats the entire line as the value, KeyValueTextInputFormat divides the line into key and value by a tab character ('/t'). Hence:
-
Key: Everything up to and including the tab character.
-
Value: It is the remaining part of the line after the tab character.
SequenceFileInputFormat
It's an input format for reading sequence files. Binary files are sequence files. These files also store binary key-value pair sequences. These are block-compressed and support direct serialization and deserialization of a variety of data types. Hence Key & Value are both user-defined.
SequenceFileAsTextInputFormat
It is a subtype of SequenceFileInputFormat. The sequence file key values are converted to Text objects using this format. As a result, it converts the keys and values by running 'tostring()' on them. As a result, SequenceFileAsTextInputFormat converts sequence files into text-based input for streaming.
NlineInputFormat
It is a variant of TextInputFormat in which the keys are the line's byte offset. And values are the line's contents. As a result, each mapper receives a configurable number of lines of TextInputFormat and KeyValueTextInputFormat input. The number is determined by the magnitude of the split. It is also dependent on the length of the lines. So, if we want our mapper to accept a specific amount of lines of input, we use NLineInputFormat.
N- It is the number of lines of input received by each mapper.
Each mapper receives exactly one line of input by default (N=1).
Assuming N=2, each split has two lines. As a result, the first two Key-Value pairs are distributed to one mapper. The second two key-value pairs are given to another mapper.
DBInputFormat
Using JDBC, this InputFormat reads data from a relational Database. It also loads small datasets, which might be used to connect with huge datasets from HDFS using multiple inputs. Hence:
-
Key: LongWritables
- Value: DBWritables.
Output Format in MapReduce
The output format classes work in the opposite direction as their corresponding input format classes. The TextOutputFormat, for example, is the default output format that outputs records as plain text files, although key values can be of any type and are converted to strings by using the toString() method. The tab character separates the key-value character, but this can be changed by modifying the separator attribute of the text output format.
SequenceFileOutputFormat is used to write a sequence of binary output to a file for binary output. Binary outputs are especially valuable if they are used as input to another MapReduce process.
DBOutputFormat handles the output formats for relational databases and HBase. It saves the compressed output to a SQL table.
Also check out - Phases of Compiler
Frequently Asked Questions
What is the purpose of sorting and shuffling?
Sorting and shuffling is in charge of producing a unique key and a set of values. Sorting is the process of creating comparable keys in one location. Shuffling refers to the process of sorting and sending the mapper's intermediate output to the reducers.
Why does HDFS have fault tolerance?
Because it replicates data among DataNodes, HDFS is fault-tolerant. A block of data is duplicated on three DataNodes by default. The data blocks are saved in several DataNodes. Data can still be obtained from other data nodes if one node fails.
What is the distributed cache?
A distributed cache is a method that allows data from a disc to be cached and made available to all worker nodes. When a MapReduce program runs, instead of reading data from the disc every time, it will retrieve data from the distributed cache.
What is speculative execution in Hadoop?
If a DataNode executes a task slowly, the master node can redundantly execute the identical operation on another node. The first assignment completed will be accepted, and the second task will be killed. As a result, speculative execution is advantageous if you work in an environment with a high workload.
Conclusion
This article explains the concepts of MapReduce types in Hadoop with various types of input and output formats, along with some frequently asked questions related to the topic. I hope this article MapReduce types was beneficial and that you learned something new. To have a better understanding of the topic, you can further refer to MapReduce Fundamentals, Hadoop MapReduce, and the Foundational behaviors of MapReduce.
For more information, refer to our Guided Path on Coding Ninjas Studio to upskill yourself in Python, Data Structures and Algorithms, Competitive Programming, System Design, and many more!
Head over to our practice platform, Coding Ninjas Studio, to practice top problems, attempt mock tests, read interview experiences and interview bundles, follow guided paths for placement preparations, and much more!
Happy Learning Ninja!