Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
What is a groupby function in pandas?
2.1.
Syntax
3.
Parameters of groupby
3.1.
Return Value
4.
Example of groupby
4.1.
Pandas groupby() on Two or More Columns
5.
Groupby Operations
5.1.
Aggregation
5.2.
Transformation
5.3.
Filtering
6.
Aggregation with Pandas groupby
7.
Pandas groupby for Data Cleaning
7.1.
Detecting Duplicates
7.2.
Removing Duplicates
8.
Frequently Asked Questions
8.1.
What does groupby do in Pandas?
8.2.
How to use groupby in DataFrame Pandas?
8.3.
What is the groupby condition in Pandas?
8.4.
How do I aggregate multiple functions in Pandas groupby?
8.5.
Can I apply different functions to different columns in Pandas groupby?
8.6.
Can I group multiple columns in Pandas groupby?
8.7.
What kind of functions can I apply to a Pandas groupby object?
9.
Conclusion
Last Updated: Mar 27, 2024
Easy

Pandas Dataframe .groupby Method

Author Abhay Rathi
0 upvote
Master Python: Predicting weather forecasts
Speaker
Ashwin Goyal
Product Manager @

Introduction

Hey Ninjas! The most commonly used function in Pandas is 'groupby'. It allows the grouping of data based on one or more columns. Further, it can perform various aggregation functions on the grouped data. It is widely used in various industries, from finance to healthcare. This article will dive into details of Pandas's 'groupby' function. 

Pandas dataframe .groupby

What is a groupby function in pandas?

Pandas is a popular data manipulation library in Python. It provides various functions to manipulate and analyze data. The 'groupby' function in Pandas allows the grouping of a DataFrame. We can group them by one or more columns. 

The 'groupby' function is commonly used in data analysis. It is used to gain insights into the relationship between variables.

The groupby() function works in the following manner:

1. You specify the columns you want to group your data on. 

2. Then Pandas will group your data accordingly. 

3. Various aggregation functions can then be performed on the grouped data. Like calculating each group's sum, average, maximum, minimum, or count. 

Syntax

The syntax for 'groupby()' is as follows:

DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, observed=False, **kwargs)

 

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

Parameters of groupby

Now we will discuss each parameter one by one:

Parameter Description
by The column(s) to group by.
axis The axis to group along(0 for rows, 1 for columns)
level If the axis is multi index, level helps to group by a particular level.
as_index Whether to use the group keys as the index of the resulting DataFrame.
sort Whether to sort the resulting groups by the group keys.
group_keys Whether to add the grouping keys as a new column to each group.
squeeze If the groups have only one column, return a Series instead of a DataFrame.
observed If True, only shows observed values for categorical/grouping columns.

 

Return Value

The groupby() function in pandas returns a special grouped object that doesn't display the actual data content but provides a way to apply various aggregation functions or operations on the grouped data. You typically follow it with an aggregation function or operation to obtain meaningful results.

Read More, Python for Data Science

Example of groupby

Now let's look at an example of using ‘groupby’ in pandas. Suppose we have a DataFrame containing students' data, with columns for date, name and marks.

Code

import pandas as pd

students = pd.DataFrame({
    'date': ['2002-11-01', '2002-11-02', '2002-11-01', '2002-11-04', '2002-11-01', '2002-11-06'],
    'name': ['Abhay', 'Abhay', 'Ninja', 'Coder', 'Ninja', 'Raj'],
    'marks': [78, 98, 78, 91, 88, 54]
})


To group this DataFrame by date and name, and then calculate the total marks for each group, we can use the ‘groupby’ function as follows:

grouped_df = students.groupby(['date', 'name'], as_index=False).sum()

print(grouped_df)


Output

	date		name 	marks
0	2002-11-01	Abhay	78
1	2002-11-01	Ninja	166
2	2002-11-02	Abhay	98
3	2002-11-04	Coder	91
4	2002-11-06	Raj		54

 

In this example, we used 'groupby' to group the DataFrame by two columns (date and name) and then used the SUM function to calculate the total marks for each student. We can also specify as_index=False to prevent Pandas from using the group key as an index in his results DataFrame.

Pandas groupby() on Two or More Columns

You can group data based on one or more columns using the groupby() function from the pandas library. Here's an example of using groupby() with multiple columns:  

Syntax

Below is the syntax to groupby two or more columns.

DataFrame.groupby(['column1', 'column2', 'column3', .....])

Groupby Operations

Groupby operations consist of aggregation and transformation operations within each group created by the groupby function. Following are some of the operations:

Aggregation

Aggregation involves the computation of data statistics ( for eg. sum, min, max, median, etc. ) for all the groups.

Syntax

df.groupby('Category')['Value'].sum()

 

Transformation

The transform function is applied to each group, and results are broadcast back to the original DataFrame.

Syntax

df['Mean_Value'] = df.groupby('Category')['Value'].transform('mean')

 

Filtering

Filtering operation is used to filter groups based on some defined condition.

Syntax

df.groupby('Category').filter(lambda x: x['Value'].sum() > 50)

Aggregation with Pandas groupby

Aggregation is the process of summarizing the grouped data. It is done by calculating the summary for each group. The summary could be sum, mean, median etc. Pandas 'groupby' function makes it easy to perform aggregation on a DataFrame. 

Here's an example of how we can perform aggregation with 'groupby()':

Suppose we have a DataFrame containing data for a company. It has columns of key and data only.

Code

import pandas as pd

datas = pd.DataFrame({
    'key': ['A', 'B', 'A', 'C', 'D'],
    'data': [19, 20, 21, 22, 23]
})


To summarize the sum of each key, we can group the DataFrame by 'data' using the 'groupby' function. Then we will use the sum function to calculate the sum.

Here’s a visualization of the steps:

Visualazation of aggregation of data
grouped_df = datas.groupby('key').sum()

print(grouped_df)


After grouping the key and data we get:

Output

key		data
A		40
B		20
C		22
D		23

 

In the above code, we used the ‘groupby’ function to group the DataFrame by the ‘key’ column. Then we used the ‘sum’ to calculate the sum for each group.

Note: In place of sum(), you can also use .mean(), .mode(), .median() to find mean, mode or median respectively.

Pandas groupby for Data Cleaning

Pandas ‘groupby’ function can be very useful for data-cleaning tasks. Let's understand by coding some examples. We will understand by detecting duplicates and removing duplicates.

Detecting Duplicates

The ‘groupby’ function allows us to group a DataFrame by columns that may contain duplicate values. We use the size() method to count the number of occurrences of each group. If the group count is greater than 1, then it means there are duplicates.

Here’s an code of detecting duplicates.

Code

import pandas as pd

Ninjas = pd.DataFrame({
    'name': ['ninja1', 'ninja21', 'ninja11', 'ninja21', 'ninja31', 'ninja1'],
    'age': [30, 30, 35, 40, 20, 30],
    'gender': ['F', 'M', 'M', 'M', 'F', 'F'],
    'skills': [60, 60, 70, 80, 40, 60]
})


In the above code we can see that the first and last row are duplicates.

In this example, we grouped the DataFrame by all columns and counted the occurrences of each group using size(). Then we used reset_index() to convert the resulting series to a DataFrame with column names. Finally, we selected any group with a count greater than 1 using a boolean mask and printed the results.

# Group the DataFrame by all columns
# And count the occurrences of each group
grouped = Ninjas.groupby(['name', 'age', 'gender', 'skills']).size().reset_index(name='count')

# Print any group that has a count greater than 1
duplicates = grouped[grouped['count'] > 1]
print(duplicates)


Now, we can see the duplicate row.

Output

duplicate row ouput

Removing Duplicates

Another common data cleaning task is removing duplicates. Let's say we have a DataFrame with duplicate rows.

Code

Ninjas = pd.DataFrame({
    'name': ['ninja1', 'ninja21', 'ninja11', 'ninja21', 'ninja31', 'ninja1'],
    'age': [30, 30, 35, 40, 20, 30],
    'gender': ['F', 'M', 'M', 'M', 'F', 'F'],
    'skills': [60, 60, 70, 80, 40, 60]
})


In the above example we can see the first and last row are duplicate rows.

We can use ‘groupby’ to identify and remove duplicate rows:

# Identify duplicate rows
duplicates = Ninjas.groupby(['name', 'age', 'gender', 'skills']).size().reset_index(name='count')
duplicates = duplicates[duplicates['count'] > 1]

# Remove duplicate rows
Ninjas.drop_duplicates(subset=['name', 'age', 'gender', 'skills'], inplace=True)
print(Ninjas)


Now the duplicate values are dropped.

Output

	name		age		gender	skills
0	ninja1		30		F		60
1	ninja21		30		M		60
2	ninja11		35		M		70
3	ninja21		40		M		80
4	ninja31		20		F		40

 

Also read, Convert String to List Python

Frequently Asked Questions

What does groupby do in Pandas?

In Pandas, groupby is used to group data in a DataFrame based on a specific column's values. It allows you to perform operations on these grouped data subsets efficiently.

How to use groupby in DataFrame Pandas?

To use groupby in a Pandas DataFrame, you specify a column by which you want to group your data. Then, you can apply aggregation functions like sum or mean to analyze each group's data.

What is the groupby condition in Pandas?

The groupby condition in Pandas is the column or criterion by which you want to group and organize your data. It defines how the data will be grouped for analysis.

How do I aggregate multiple functions in Pandas groupby?

To use the .agg() method, apply it directly to the groupby object. Provide a dictionary with columns as keys and desired aggregations as values, which can be single words or lists of words.

Can I apply different functions to different columns in Pandas groupby?

We can use various functions on separate things within a group object using the agg and apply functions.

Can I group multiple columns in Pandas groupby?

You can group data by more than one column and find statistics for each group by giving a list of column names to the groupby function.

What kind of functions can I apply to a Pandas groupby object?

It depends on your data analysis requirements. You can apply a wide range of functions to a Pandas ‘groupby’ object. Sum, mean, min, max, count, median, std, var, apply, and transform are some of the most commonly used functions. You can also define your custom functions. Then use the apply method to apply them to the ‘groupby’ object.

Conclusion

In summary, Pandas ‘groupby’ is a powerful data analysis and manipulation tool. It allows you to easily group data by one or more columns. You can apply functions to each group to perform various data aggregation, transformation, and filtering tasks. Pandas ‘groupby’ helps you gain meaningful insights from your data and make informed decisions whether your dataset is small or large. 

Also learn about:

You can refer to our Guided Path on Coding Ninjas Studio to upskill yourself in Data Structures and AlgorithmsCompetitive ProgrammingSystem Design, and many more!

To learn more about Python.

Happy Coding!

Previous article
Python cURL
Next article
Image Processing with Python
Live masterclass