Introduction
MapReduce is a framework that helps us write applications to process large amounts of data in parallel on huge clusters of commodity hardware in an authentic manner. The MapReduce algorithm consists of two essential tasks, Map and Reduce.
The Map takes a set of data and turns it into a different set of data, in which individual elements are divided into tuples. While Reduce takes the output from a map as an input and merges those data tuples into a smaller set of tuples. As the name suggests, the Map job is done before the Reduce task.
Now, let's discuss some of the MapReduce interview Questions with answers. Let's get started!
Basic MapReduce Interview Questions
1. What is Hadoop MapReduce?
MapReduce is the programming example that permits huge scalability across thousands of servers in a Hadoop cluster. It is basically a process layer in Hadoop. In MapReduce, there are two processes: Reducer and Mapper. The Mapper processes the input data, which is in the form of a file/directory that resides in HDFS. And reducer takes a moderate pair of keys and values produced by Map.
2. What is Mapper in MapReduce?
Mapper is the user-defined program that directs the input split in key/value pairs according to the code design.
3. What is the purpose of Mapper in Hadoop?
Mapper helps to convert the input split into (key, value) pairs. For each data block on HDFS, there will be a mapper.
4. What is the difference between HDFS block and input split?
HDFS block is the physical section of the disk that contains a minimum amount of data that can be read or written. While InputSplit is the logical section of data generated by the InputFormat specified in the MapReduce job configuration.
5. What is Combiner in Hadoop MapReduce?
Combiner, also known as semi-reducer, is an optional class to merge the map-out records using the same key. The main role of the combiner is to accept input from Map Class and pass those (key, value) pairs to the reducer class.
6. Comparison between MapReduce and Spark.
Criteria | Spark | MapReduce |
Standalone mode | Can work independently | Require Hadoop |
Processing speeds | Exceptional speed | Good |
Ease of use | APIs for Java, Python, and Scala. | Require extensive Java program |
Versatility | Not optimized for machine learning and real-time applications | Optimized for machine learning and real-time applications |
7. What is Shuffling and Sorting in MapReduce?
Sorting and shuffling are two important processes operating in parallel during the working of the Mapper & reducer. Shufflin is the process of transferring data from the Mapper to the reducer. MapReduce automatically sortsMapReduce automatically sorts the output key and value pairs between the reduce & map phases before moving to the reducer.