Pandas is a popular Python library used for data manipulation and analysis. It has various powerful functions to clean, transform, and analyse data efficiently. However, as datasets grow bigger, the processing time to perform operations also increases.
To overcome this situation, parallelizing the Pandas workflow can be a saviour because it enhances performance and helps in reducing execution time.
In this article, we will explore the concept of parallelization and how to apply it to Pandas. So let us start.
What is Parallelizing?
Parallelizing is a simple technique that is used to speed up the computations of large data sets. Parallelising achieves this speed of execution because it divides large computations into smaller chunks that can be run simultaneously. It utilises the power of multiple cores of the CPU. The multiprocessing module is commonly used in Python to achieve parallelism.
There are two types of processing:
Serial processing - performs operations one by one, waiting for one operation to complete before starting the next one
Parallel processing - performs multiple operations concurrently, using multiple cores of the CPU
Identifying Parallelism Possibilities in Pandas
We need to know that not all operations performed in Pandas are meant or suitable for parallelization because some operations inherently involve dependencies or sequential computations that are not easy to parallelize.
However, there are some situations where parallelism can be effectively used, such as:
Applying functions
The apply() function allows applying a custom function to each row or column of a DataFrame. This is a computationally expensive operation, so with the help of parallelizing, we can significantly speed up the process.
The swifter library helps in the automation of the process of applying functions in a parallelized manner.
Groupby operations
The groupby() operation allows to group DataFrame rows together based on a common value. This is a computationally expensive operation, so parallelizing it is beneficial.
The dask library provides a parallelized version of the groupby() operation. It allows us to work with larger-than-memory datasets by use of parallel processing.
Computation-intensive transformations
Let's suppose we have a DataFrame with complex calculations. So we split into independent tasks by using parallelism, then we can significantly improve the rate of computation because now we can utilise the CPU cores effectively.
The concurrent.futures module helps in achieving this in Python.
Implementing Parallelization
Let's dive into the practical implementation of parallelization in Pandas.
1. Using multiprocessing
The multiprocessing module, which is an open-source module, provides a simple way to parallelize operations. It uses the pool object, which helps in distributing the input data and mapping it to the target function for Processing.
Here is an example of applying a function in parallel using Pool:
Python
Python
import pandas as pd from multiprocessing import Pool
# Function to process a row of data def process_row(row):
# Perform some computation on the row # computation processed_row = row[1] # Extract the data from the row processed_row['Discounted_Amount'] = processed_row['Transaction_Amount']*0.9
# Example computation return processed_row
# Entry point of the script if __name__ == '__main__':
# Read the large dataset from a CSV file data = pd.read_csv('large_dataset.csv')
num_cores = 4 # Number of CPU cores to utilize
# Create a Pool of worker processes with Pool(num_cores) as pool:
# Apply the process_row function to each row using parallel processing processed_data = pool.map(process_row, data.iterrows())
# Convert the list of processed data back to a DataFrame result_df = pd.DataFrame(processed_data)
You can also try this code with Online Python Compiler
In the above code, the Discounted_Amount column is computed by applying a 10% discount to the Transaction_Amount column.
2. Using dask.dataframe with Custom Functions
Dask is an open-source library that has built-in parallelism, and in addition to that, we can also use custom functions with dask.dataframe.map_partitions() to perform operations in parallel across partitions of the dataset.
Dask functions use a task graph which is designed so that the values that are dependent on previous values are computed at various levels.
The code below will read the language.csv file into a Dask DataFrame called df. After that, print the first few rows from the DataFrame.
Finally, a new DataFrame called df2 containing the names of all the customers with an ID of 1 is created and computation will happen in the DataFrame.
The output of the code displays the names of the language with an ID = 1.
3. Using swifter
The swifter library simplifies the process of applying functions using parallelization. It automatically determines the optimal method of parallelization for your system.
Python
Python
import pandas as pd import swifter
# Sample function to apply def process_row(row): # Perform some computation on the row return row
data = pd.read_csv('large_dataset.csv')
# Apply the function in parallel using swifter result_df = data.swifter.apply(process_row, axis=1)
You can also try this code with Online Python Compiler
The concurrent.futures module provides a high-level interface for asynchronously executing callables.
It can be used to run multiple tasks simultaneously, which can be useful for speeding up computation.
The concurrent.futures module has two main classes which are:
ThreadPoolExecutor: This class executes tasks using a pool of threads
ProcessPoolExecutor: This class executes tasks using a pool of processes
To use the concurrent.futures module, first, we need to create an executor object. After this task is assigned to the executor, which then executes the tasks asynchronously.
Look at the below code example to understand it better:
Python
Python
import time import concurrent.futures
def factorial(n): """Calculates the factorial of a number.""" if n == 0: return 1 else: return n * factorial(n - 1)
The above code parallelize the calculation of the factorials of the numbers 0 to 4. It creates a thread pool with 4 workers and submits the tasks to the thread pool.
The results of the tasks are then collected and then displayed as output.
Frequently Asked Questions
What is parallelization in Pandas?
Parallelization in Pandas is a process of executing tasks concurrently with the use of multiple CPU cores in order to improve the efficiency and speed of data processing.
How to parallelization to dask.dataframe?
We can use the dask.dataframe.map_partitions() with custom functions to perform operations in parallel across dataset partitions. This helps in improving efficiency.
Does Concurrent.futures combined with Pandas groupby?
Yes, the concurrent.futures module can be combined with the Pandas groupby() function. This is very useful for speeding up the computation of groupby operations when dealing with large datasets.
Conclusion
Congratulations, you did a fantastic job!!. This article has covered the concept of parallelization and how to apply it to Pandas. It has also discussed the various ways to implement parallelzation. At last, some frequently asked questions are discussed.