Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
What is Parallelizing?
3.
Identifying Parallelism Possibilities in Pandas
3.1.
Applying functions
3.2.
Groupby operations
3.3.
Computation-intensive transformations
4.
Implementing Parallelization
4.1.
1. Using multiprocessing
4.2.
Python
4.3.
2. Using dask.dataframe with Custom Functions
4.4.
Python
4.5.
3. Using swifter
4.6.
Python
4.7.
4. Using concurrent.futures
4.8.
Python
5.
Frequently Asked Questions
5.1.
What is parallelization in Pandas? 
5.2.
How to parallelization to dask.dataframe? 
5.3.
Does Concurrent.futures combined with Pandas groupby? 
6.
Conclusion
Last Updated: Mar 27, 2024
Easy

Parallelizing Your Pandas Workflow

Author Dhruv Rawat
0 upvote

Introduction

Pandas is a popular Python library used for data manipulation and analysis. It has various powerful functions to clean, transform, and analyse data efficiently. However, as datasets grow bigger, the processing time to perform operations also increases. 
 

Parallelizing Your Pandas Workflow


To overcome this situation, parallelizing the Pandas workflow can be a saviour because it enhances performance and helps in reducing execution time.

In this article, we will explore the concept of parallelization and how to apply it to Pandas. So let us start.

What is Parallelizing?

Parallelizing is a simple technique that is used to speed up the computations of large data sets. Parallelising achieves this speed of execution because it divides large computations into smaller chunks that can be run simultaneously. It utilises the power of multiple cores of the CPU. The multiprocessing module is commonly used in Python to achieve parallelism.

There are two types of processing:

  1. Serial processing - performs operations one by one, waiting for one operation to complete before starting the next one
     
  2. Parallel processing - performs multiple operations concurrently, using multiple cores of the CPU

Identifying Parallelism Possibilities in Pandas

We need to know that not all operations performed in Pandas are meant or suitable for parallelization because some operations inherently involve dependencies or sequential computations that are not easy to parallelize. 

However, there are some situations where parallelism can be effectively used, such as:

Applying functions

The apply() function allows applying a custom function to each row or column of a DataFrame. This is a computationally expensive operation, so with the help of parallelizing, we can significantly speed up the process. 

The swifter library helps in the automation of the process of applying functions in a parallelized manner.

Groupby operations

The groupby() operation allows to group DataFrame rows together based on a common value. This is a computationally expensive operation, so parallelizing it is beneficial. 

The dask library provides a parallelized version of the groupby() operation. It allows us to work with larger-than-memory datasets by use of parallel processing.

Computation-intensive transformations

Let's suppose we have a DataFrame with complex calculations. So we split into independent tasks by using parallelism, then we can significantly improve the rate of computation because now we can utilise the CPU cores effectively. 

The concurrent.futures module helps in achieving this in Python.

Implementing Parallelization

Let's dive into the practical implementation of parallelization in Pandas.

1. Using multiprocessing

The multiprocessing module, which is an open-source module, provides a simple way to parallelize operations. It uses the pool object, which helps in distributing the input data and mapping it to the target function for Processing. 

Here is an example of applying a function in parallel using Pool:
 

  • Python

Python

import pandas as pd
from multiprocessing import Pool

# Function to process a row of data
def process_row(row):

# Perform some computation on the row
# computation
processed_row = row[1] # Extract the data from the row
processed_row['Discounted_Amount'] = processed_row['Transaction_Amount']*0.9

# Example computation
return processed_row

# Entry point of the script
if __name__ == '__main__':

# Read the large dataset from a CSV file
data = pd.read_csv('large_dataset.csv')

num_cores = 4 # Number of CPU cores to utilize

# Create a Pool of worker processes
with Pool(num_cores) as pool:

# Apply the process_row function to each row using parallel processing
processed_data = pool.map(process_row, data.iterrows())

# Convert the list of processed data back to a DataFrame
result_df = pd.DataFrame(processed_data)
You can also try this code with Online Python Compiler
Run Code


In the above code, the Discounted_Amount column is computed by applying a 10% discount to the Transaction_Amount column. 

2. Using dask.dataframe with Custom Functions

Dask is an open-source library that has built-in parallelism, and in addition to that, we can also use custom functions with dask.dataframe.map_partitions() to perform operations in parallel across partitions of the dataset.

Dask functions use a task graph which is designed so that the values that are dependent on previous values are computed at various levels.

The code below will read the language.csv file into a Dask DataFrame called df. After that, print the first few rows from the DataFrame.

Finally, a new DataFrame called df2 containing the names of all the customers with an ID of 1 is created and computation will happen in the DataFrame.
 

  • Python

Python

import dask.dataframe as pd

#reading
df = pd.read_csv('language.csv')
df.head()

print(" ---+---+----")

# computation and querying
df2 = df[df.id == 1].name
df2.compute()
You can also try this code with Online Python Compiler
Run Code

 

Output:

output of dask.dataframe


The output of the code displays the names of the language with an ID = 1.

3. Using swifter

The swifter library simplifies the process of applying functions using parallelization. It automatically determines the optimal method of parallelization for your system.
 

  • Python

Python

import pandas as pd
import swifter

# Sample function to apply
def process_row(row):
# Perform some computation on the row
return row

data = pd.read_csv('large_dataset.csv')

# Apply the function in parallel using swifter
result_df = data.swifter.apply(process_row, axis=1)
You can also try this code with Online Python Compiler
Run Code

4. Using concurrent.futures

The concurrent.futures module provides a high-level interface for asynchronously executing callables.

It can be used to run multiple tasks simultaneously, which can be useful for speeding up computation.
 

The concurrent.futures module has two main classes which are:
 

  • ThreadPoolExecutor: This class executes tasks using a pool of threads
     
  • ProcessPoolExecutor: This class executes tasks using a pool of processes
     

To use the concurrent.futures module, first, we need to create an executor object. After this task is assigned to the executor, which then executes the tasks asynchronously. 

Look at the below code example to understand it better:
 

  • Python

Python

import time
import concurrent.futures

def factorial(n):
"""Calculates the factorial of a number."""
if n == 0:
return 1
else:
return n * factorial(n - 1)

def main():
start_time = time.time()

executor = concurrent.futures.ThreadPoolExecutor(max_workers=4)

futures = []
for i in range(5):
futures.append(executor.submit(factorial, i))

results = []
for future in futures:
results.append(future.result())

end_time = time.time()

print('The results are:', results)
print('The total time is:', end_time - start_time)

if __name__ == '__main__':
main()
You can also try this code with Online Python Compiler
Run Code

 

Output:

output of concurrent.futures


The above code parallelize the calculation of the factorials of the numbers 0 to 4. It creates   a thread pool with 4 workers and submits the tasks to the thread pool. 

The results of the tasks are then collected and then displayed as output.

Frequently Asked Questions

What is parallelization in Pandas? 

Parallelization in Pandas is a process of executing tasks concurrently with the use of multiple CPU cores in order to improve the efficiency and speed of data processing.

How to parallelization to dask.dataframe? 

We can use the dask.dataframe.map_partitions() with custom functions to perform operations in parallel across dataset partitions. This helps in improving efficiency.

Does Concurrent.futures combined with Pandas groupby? 

Yes, the concurrent.futures module can be combined with the Pandas groupby() function. This is very useful for speeding up the computation of groupby operations when dealing with large datasets.

Conclusion

Congratulations, you did a fantastic job!!. This article has covered the concept of parallelization and how to apply it to Pandas. It has also discussed the various ways to implement parallelzation. At last, some frequently asked questions are discussed.

Here are some more related articles:
 

Check out The Interview Guide for Product Based Companies and some famous Interview Problems from Top Companies, like AmazonAdobeGoogle, etc., on Coding Ninjas Studio.

Also, check out some of the Guided Paths on topics such as Data Structure and AlgorithmsCompetitive ProgrammingOperating SystemsComputer Networks, DBMSSystem Design, etc., as well as some Contests, Test SeriesInterview Bundles, and some Interview Experiences curated by top Industry Experts only on Coding Ninjas Studio.

We hope you liked this article.
 

"Have fun coding!”

Live masterclass