Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Last Updated: Mar 27, 2024
Difficulty: Easy

Demand meets solutions

Leveraging ChatGPT - GenAI as a Microsoft Data Expert
Speaker
Prerita Agarwal
Data Specialist @
23 Jul, 2024 @ 01:30 PM

Introduction

A distributed system, also known as distributed computing or distributed databases, is a collection of independent modules situated on various machines that communicate messages in order to accomplish common goals.

As a result, the distributed system will appear to the end-user as if it were a single interface or computer. The objective is that by working jointly, the system may maximize information and expertise while preventing failures so that even if one system fails, the service remains available.

In other words, distributed processing refers to the fact that the analytics are not performed on a single server. You run it in parallel on several machines. A distributed configuration typically consists of a master who manages the entire process and slaves who perform the actual work. To process data in parallel, the master must load it, cut it up into bits, and send them to the slaves for processing. The statistics (calculations) you specified are then performed by each slave.

Image source: distributed system

Need for Distributed Computing

The explanation is that there is a limit to how far a system's configuration may be extended, however, clustering multiple systems results in fantastic configurations.

Image source: distributed machines 

It's the polar opposite of what we refer to as "virtualization." We assume that many virtual systems make up a real system in virtualization, whereas many real systems make up a virtual system in distributed computing.

  1. In parallel computing, all computers may have access to a common memory that allows them to communicate with one another. Each CPU in distributed computing has its own private memory (distributed memory). Messages are passed between the processors to exchange information.
  2. Distributed computing allows information to be shared among multiple people or systems. Distributed computing allows one machine's application to tap into the processing power, memory, or storage of another.
  3. Although distributed computing may boost the effectiveness of a stand-alone application, this is rarely the reason for distributing an application. Some applications, such as word processing, may not benefit at all from distribution. In many cases, a specific issue may be demand distribution. Distribution is a natural fit for a company that wants to collect information across multiple locations. In other cases, distribution can improve performance or availability. If software must run on a PC and must perform lengthy computations, distributing these computations to faster machines may allow performance to be improved.
Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

How Distributed Computing works?

The following are the most important functions of distributed computing:

  • Sharing of resources - whether it is hardware, software, or data.
  • Openness refers to how open the software is intended to be developed and shared with others.
  • Concurrency is the ability of multiple machines to perform the same function at the same time.
  • Scalability refers to how computing and processing capabilities add and subtract when distributed across multiple machines.
  • Fault tolerance entails how easily and quickly failures in system components can be detected and recovered.
  • Transparency refers to how much access a node has to locate and interact with other devices in the cluster.

 

Modern distributed methods have emerged to include autonomous processes that may run on the same physical machine but interact with each other by exchanging messages.

Types of Distributed System Architectures

Distributed applications and processes are typically built using one of the four architecture types listed below:

  1. Client-server architecture:
    Originally, distributed systems architecture included a server as a shared resource, such as a printer, database, or web server. It had multiple clients (for example, computer users) who decided when to use a common resource, how to use and display them, change data, and send it back to the webserver. Code repositories, such as git, are an example of where the knowledge is placed on the development companies committing changes to the code.

 

With the advent of web applications, distributed systems architecture has developed into:

  1. Three-tier architecture: In this architecture, clients are no longer required to be smart and can instead rely on a middle layer to handle processing and decision-making. This category includes the majority of the first web applications. The middle tier could be referred to as an agent because it receives requests from clients, which could be stateless, processes data, and then forwards it to the servers.
  2. Multi-tier architecture: Corporate web services pioneered the development of n-tier or multi-tier system architectures. These popularised application servers, contain business rules and interact with both data and presentation tiers.
  3. Peer-to-peer architecture: There is no centralized or special machine in this architecture doing the legwork and smart work. All decision-making and obligations are distributed among the machines involved, with each capable of acting as a client or server. Blockchain is an excellent example of this.

What is Big Data?

In a nutshell, Big Data is data that is extremely large and cannot be processed using standard tools. And in order to process such data, we need a distributed architecture. This information may be supervised or unsupervised.

In general, we divide data-handling issues into three categories:

Volume: We refer to a problem as Volume when it is connected to how we would market such massive amounts of data. Facebook, for example, handles more than 500 TB of data per day. Facebook has storage capabilities of 300 PB.

Velocity: When we try to handle a large number of requests per second, we refer to this as Velocity. As the number of requests received by Facebook or Google per second increases, so do the problems.

Variety: If the problem at hand or the data we are handling is complex, we refer to it as a variety problem.

Image source: Big Data

Distributed computing and Big Data

Because large amounts of data cannot be stored on a single system, numerous systems with specific memories are used in big data.

Big Data is defined as a massive dataset or set of massive datasets that cannot be filtered by traditional systems. Big Data has evolved into its own subject, encompassing the study of various tools, techniques, and frameworks rather than just data. MapReduce is a framework for developing applications that aid in the processing of large amounts of data on a large cluster of hardware.

Distributed computing with MapReduce

The MapReduce framework, along with the Hadoop file system HDFS, has been a key component of the Hadoop ecosystem since its beginnings.

Google used MapReduce to evaluate stored HTML content on websites by counting all HTML tags, words, and word combinations (for instance headlines). The outcome was being used to generate the Google Search page ranking. That's when everyone started boosting his website for Google searches. It was the birth of serious search engine optimization. That's the year 2004 at the time.

MapReduce, on the other side, processes data in two phases. There are two phases: the map phase and the reduce phase.

The framework is reading data from HDFS during the map phase. Each dataset is referred to as an input record. Then comes the reducing phase. The actual computation is completed and the results are saved during the reduce phase. The memory target can be a database, HDFS, or something else. The magic of MapReduce is in how the map-reduce phases are implemented and how they interact with one another.

The map as well as reduced phases are run concurrently. That is, users have numerous map phases (mappers) and reduce phases (reducers) on your cluster machines that can run simultaneously. A picture describing the above point is shown below: 

Why MapReduce?

Traditional data storage and retrieval systems rely on a centralized server. Standard database servers cannot handle such massive amounts of data. Furthermore, a centralized system provides too much of a bottleneck when processing multiple files at the same time.

Such bottlenecks are addressed by MapReduce. MapReduce will divide the task into small chunks and process each one separately by assigning them to different systems. After all of the parts have been processed and analyzed, the result of each workstation is collected in a single location, and an output dataset for the specific issue is prepared.

Big data demand meets solutions

In the late 1990s, search engine and Internet companies such as Google, Yahoo!, and Amazon.com were ready to broaden their business models by leveraging low-cost computing and storage hardware. These businesses required a new generation of software technologies to facilitate them monetizing the massive amounts of data they were collecting from customers. They needed to be able to process and analyze this data in real-time. And hence the demand is meeting the solution every time. 

Frequently Asked Questions

How are Hadoop and Big Data related?

When we communicate about Big Data, we also talk about Hadoop. Hadoop is an open-source schema for storing, processing, and managing clusters of unstructured data to gain insights and knowledge. This is how Hadoop and Big Data are linked to one another.

Mention the core methods of Reducer.

A Reducer's primary methods are as follows: 
1. setup(): setup is a method used to configure multiple metrics for the reducer.
2. reduce(): The primary operation of the reducer is reduce. This method's specific function includes defining the task that must be worked on for a distinguishable set of values that share a key.
3. cleanup(): This function is used to clean up or delete any momentary files or data left over from the reduce() task.

Explain the core components of Hadoop.

Hadoop's Core Components:
1. HDFS (Hadoop Distributed File System) – HDFS is Hadoop's primary storage system. HDFS is used to store a large amount of data. It is primarily intended for storing large datasets on commodity hardware.
2. Hadoop MapReduce – MapReduce is the Hadoop layer in charge of data processing. It submits a request to process structured and unstructured data already stored in HDFS. It is capable of parallel processing a large volume of data by dividing it into separate tasks. Processing is divided into two stages: Map and Reduce. The map is a stage in which data blocks are perused and made accessible to executors (computers/nodes/containers) for processing. Reduce is the stage in which all processed data is gathered and compiled.
3. YARN – YARN is the framework that is used to handle data in Hadoop. YARN manages resources and provides multiple data processing engines such as real-time broadcasting, data science, and batch processing.

Conclusion 

To sum up this blog, we talked about distributed computing, its importance, how it works, and the various types. We also talked about Big Data and how it relates to distributed computing. We also learned how MapReduce is used in distributed computing and why MapReduce is the only option. Finally, we discussed the matter of demand meeting solutions.

Refer to our guided paths on Coding Ninjas Studio to upskill yourself in Data Structures and AlgorithmsCompetitive ProgrammingJavaScriptSystem Design, and many more! If you want to test your competency in coding, you may check out the mock test series and participate in the contests hosted on Coding Ninjas Studio! But if you have just started your learning process and looking for questions asked by tech giants like Amazon, Microsoft, Uber, etc; you must have a look at the problemsinterview experiences, and interview bundle for placement preparations.

Nevertheless, you may consider our paid courses to give your career an edge over others!

Do upvote our blogs if you find them helpful and engaging!

Happy Learning!

Topics covered
1.
Introduction
2.
Need for Distributed Computing
3.
How Distributed Computing works?
4.
Types of Distributed System Architectures
5.
What is Big Data?
6.
Distributed computing and Big Data
7.
Distributed computing with MapReduce
7.1.
Why MapReduce?
8.
Big data demand meets solutions
9.
Frequently Asked Questions
9.1.
How are Hadoop and Big Data related?
9.2.
Mention the core methods of Reducer.
9.3.
Explain the core components of Hadoop.
10.
Conclusion