Table of contents
1.
Introduction
2.
Getting to the Roots of MapReduce
3.
What is the mechanism behind MapReduce
3.1.
Map() Procedure
3.2.
Reduce() Procedure
4.
Usage
5.
Performance
6.
MapReduce's Benefits
7.
MapReduce's Constrains
8.
MapReduce in action
9.
Frequently Asked Questions
9.1.
What are the main components of MapReduce?
9.2.
What Is Map Reduce Concept?
9.3.
What is the difference between Map and Reduce?
10.
Conclusion
Last Updated: Mar 27, 2024

MapReduce Fundamentals

Author Palak Mishra
1 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

While big data has dominated the news in recent months, significant computing issues have existed since the dawn of the computer age. People discovered problems that were too large for the system to handle each time a newer, faster, higher-capacity computer system was introduced. With the introduction of local-area networks, the industry began to focus on combining networked systems' compute and storage capabilities to solve more extensive and significant problems.

A solution to significant data challenges relies on the distribution of computing- and data-intensive applications. New technological approaches were required to achieve the most reliable distribution at scale. One of these new approaches is MapReduce. MapReduce is a programming framework that allows programmers to create programs that process large amounts of unstructured data in parallel across a distributed group of processors.

Getting to the Roots of MapReduce

MapReduce was created as a programming model that could be used in various situations. Some of the first implementations met parallel execution, fault tolerance, load balancing, and data manipulation. The initiative was given the name MapReduce by the project's engineers because it combines two capabilities from existing functional computer languages: map and reduces.

MapReduce was created by Google engineers to solve a specific practical problem.

As a result, it was created as a programming model and its implementation — in other words, a reference implementation. The reference implementation was created to show the concept's practicality and effectiveness and ensure that the computer industry widely adopted the model. Other MapReduce implementations have been developed over time and are available as open-source and commercial products.

What is the mechanism behind MapReduce

 

 

Source: image

Petabytes of data and thousands of nodes make up a typical MapReduce Framework. Here's a quick rundown of the MapReduce Procedures, which use the servers' massive resources.

Map() Procedure

In this infrastructure, there is always a master node that accepts input. The master node divides the input into smaller sub-inputs or sub-problems as soon as it receives it. Worker nodes are assigned to these sub-problems. After that, a worker node processes them and performs the necessary analysis. When the worker node has finished working on this sub-problem, it returns it to the master node.

Reduce() Procedure

All worker nodes return the master node the answer to the sub-problem they were given. The master node gathers the answers and reassembles them into the solution to the original big problem assigned to the master node.

The Map () and Reduce () procedures are performed in parallel and independently by this Framework. All of the Map() procedures can run in parallel, and once each worker node has completed its task, it can send it back to the master code, which will compile everything into a single answer. When used on a large amount of data, this procedure can be highly effective (Big Data).

There are five steps to this Framework:

  • Preparing Map() Input
  • Executing User-Provided Map() Code
  • Shuffle Map Output to Reduce Processor
  • Executing User-Provided Reduce Code
  • Producing the Final Output
     

The MapReduce Framework's Dataflow is shown below:

  • Input Reader
  • Map Function
  • Partition Function
  • Compare Function
  • Reduce Function
  • Output Writer  

Usage

Several major players in the eCommerce industry are now using Hadoop to process large amounts of data (Borthakur, 2009). Hadoop is used in search processing by Amazon, Yahoo, and Zvents. Using Hadoop, they can determine search intent from text input and use statistical analysis to improve future searches. Facebook, Yahoo, ContextWeb, Joost, and Last to process logs and mine for clickstream data. FM uses Hadoop.

The activity and profitability of a website's end users are recorded in clickstream data. Hadoop is being used by Facebook and AOL in their data warehouse to effectively store and mine the massive amounts of data they collect. Hadoop is being used by the New York Times and Eyelike to store and analyze video and image data.

You don't need a significant investment in infrastructure to use MapReduce. Hadoop is also available as a pay-per-use service from several public cloud providers. Amazon provides an elastic MapReduce service. MapReduce is now available in IBM Blue Cloud from Google and IBM.

 

Performance

According to a recent study, Hadoop is 3.1 to 6.1 times slower than two state-of-the-art parallel database systems in performing various analytical tasks (A. Pavlo, 2009). The ability of MapReduce to support elastic scalability, i.e., the allocation of more compute nodes from the cloud to speed up computation, is where it shines. Dean et al. published preliminary performance results on a Google File System implementation of Map Reduce (a proprietary distributed file system at Google). This cluster had about 1800 machines, each with 2GHx Intel Xeon processors with Hyper-Threading enabled, 4GB of memory, and 320 GB of disk.

They ran two commands: a distributed grep command that searched one terabyte of data for a rare string and a sort command that sorted one terabyte of data. The grep command took 150 seconds to complete, while the sort took 1000 seconds. RDBMS technologies would need structured data and massive parallel storage to match these results.

 

MapReduce's Benefits

MapReduce is thought to have several advantages over DBMS parallelism. In the industry, many of these benefits are still being debated.

  • Scalability: The high scalability of MapReduce is probably its most important feature. Hadoop is said to be able to scale to tens of thousands of nodes (Anand, 2008). The ability to scale horizontally to a large extent can be attributed to a combination of distributed file systems and the philosophy of running worker processes close to the data rather than moving the data to the procedures.

 

  • Flexibility: Hadoop makes accessing data from various sources and types easier.

 

  • Simple Coding Model: With MapReduce, the programmer doesn't have to worry about parallelism, distributed data passing, or other complexities that would otherwise be encountered. This makes coding much more accessible and reduces the time it takes to create analytical routines.

 

  • Supports Unstructured Data: For big data, unstructured data is data that does not adhere to a predetermined format. If structured data accounts for 20% of all data available to businesses, the remaining 80% is unstructured data. The majority of the data you'll come across is unstructured data. However, until recently, the technology didn't allow for much more than manual data storage and analysis. MapReduce can handle any type of data structure that fits into this model. Processes simple key-value pairs. Images, meta-data, large files, and other types of  Data fall into this category. In comparison to DBMS, MapReduce makes it much. Easier for programmers to work with irregular data.

 

  • Fault Tolerance: MapReduce is very fault-tolerant due to its highly distributed nature. MapReduce jobs typically survive hardware failures thanks to the distributed file systems that MapReduce supports and the controller process (Dean & Ghemawat, 2004).
     As a computer system grows in size, it contains more hardware and software, which means more things can go wrong. On the plus side, it means there's a better chance of automatically overcoming a single point of failure.

 

  • Speed: Hadoop allows massive amounts of data to be processed thanks to parallel processing and minimal data movement quickly.

 

MapReduce's Constrains

Rather than classifying certain situations as MapReduce disadvantages, we prefer to say that MapReduce may not be the best solution in some cases. This is true regardless of the programming model. Some scenarios where MapReduce fails are:

  • Real-time data processing – While the MR model can handle large amounts of data stored in a database, it can't handle streaming data.

 

  • This model may not be ideal if you need to process your data repeatedly for many iterations.

 

  • It is unnecessary to install multiple servers or perform parallel processing if you can get the same results on a standalone system without using multiple threads.

MapReduce in action

This is a fundamental MapReduce example. The key principles remain the same regardless of the amount of data you need to analyze.

Assume you have five files, each of which contains two columns (in Hadoop terms, a key and a value) representing a city and the temperature recorded in that city for each measurement day. The key is the city, and the value is the temperature. Consider the following scenario: (Toronto, 20). You want to find the maximum temperature for each city across the data files using all of our gathered information (note that each file might have the same city represented multiple times).

You can break this down into five map tasks using the MapReduce framework, with each mapper working on one of the five files. The mapper task examines the data and returns the highest temperature for each city.

For the data above, for example, the output of one mapper task might look like this: (Whitby, 25) (Toronto, 20) (New York, 22) (Rome, 33)

Assume the following intermediate results were produced by the other four mapper tasks (working on the other four files not shown here):

(Toronto, 18) (Whitby, 27) (New York, 32) (Rome, 37) (Toronto, 32) (Whitby, 20) (New York, 33) (Rome, 38) (Toronto, 22) (Whitby, 19) (New York, 20) (Rome, 31) (Toronto, 31) (Rome, 30)

All five of these output streams would be fed into reducing tasks, which would combine the input results and output a single value for each city, resulting in the following final result set: (Whitby, 27) (New York, 33) (Toronto, 32) (Whitby, 27) (Rome, 38).

You can think of a map and reduce tasks like a Roman census, where the census bureau would send people to each city in the empire. Each census taker in each city would be assigned the task of counting the population in that city and reporting their findings to the capital.

To determine the empire's overall population, the results from each city would be combined into a single count (sum of all cities). This parallel mapping of people to cities and then combining (reducing) the results are far more efficient than sending a single person to count every person in the empire.

Frequently Asked Questions

What are the main components of MapReduce?

The JobTracker and TaskTracker are the two main components of a MapReduce job. JobTracker - In MapReduce, it is the master that creates and runs the jobs. It runs on the name node and assigns TaskTrackers to the job.

What Is Map Reduce Concept?

MapReduce is a Hadoop programming model or pattern for accessing large amounts of data stored in the Hadoop File System (HDFS). It is an essential component of the Hadoop framework's operation.

What is the difference between Map and Reduce?

In general, "map" refers to converting a set of inputs into an equal number of outputs, whereas "reduce" refers to the transformation of a group of inputs into a smaller number of outputs.

Conclusion

In this blog, we extensively discuss the walkthrough of how MapReduce Fundaments works. MapReduce is currently used by legacy applications and Hadoop native tools such as Sqoop and Pig. We hope that this blog has helped you enhance your knowledge regarding the subject of the Fundamentals of MapReduce and how it can be used to process large amounts of data.

The knowledge never stops, have a look at more related articles: Data WarehouseMongoDB, AWS, and many more. To learn more, see Operating SystemUnix File SystemFile System Routing, and File Input/Output.

Refer to our Guided Path on Coding Ninjas Studio to upskill yourself in Data Structures and AlgorithmsCompetitive ProgrammingJavaScriptSystem Design, and many more! If you want to test your competency in coding, you may check out the mock test series and participate in the contests hosted on Coding Ninjas Studio! But if you have just started your learning process and are looking for questions asked by tech giants like Amazon, Microsoft, Uber, etc; you must look at the problemsinterview experiences, and interview bundle for placement preparations.

Do upvote our blogs if you find them helpful and engaging!

Happy Learning!

Live masterclass