Do you think IIT Guwahati certified course can help you in your career?
No
Introduction
Pandas is a fast, flexible, powerful, and easy-to-use open-source library that provides data structures, such as Dataframe and Series, for storing structured data and methods for their analysis and manipulation.
In this article, you will learn about the groupby() and count() functions in Pandas with the help of examples. Both of these functions can be used together, i.e., you can chain them for counting valid data points in each group of the grouped dataframe. It is useful for performing counting based on a grouping criterion such as category labels or timestamps.
Before getting started, let’s look at a brief introduction to Pandas.
A brief about Pandas
Pandas is an open-source Python library that provides data manipulation and analysis tools for working with structured data. It is built on top of the NumPy library and offers data structures for efficiently working with tabular data.
To follow along with the examples, you should install the latest version of Pandas.
Now that you know about Pandas, we will move on to the groupby() function.
What is groupby() in Pandas?
The groupby() function is used for grouping a dataframe or series based on the values of one or more columns.
Based on the data structure you used, this function can return an object of the following data types:-
Here, we created a dataframe with 3 columns - Name, Department, and Salary, used the groupby() function, and then printed the groups using the .groups attribute. Then we applied the mean operation on the grouped data based on the salary values and stored the result in a new dataframe.
Let’s look at the different parameters the groupby() function can accept:-
by: This parameter specifies the columns or keys by which the dataframe should be grouped. It accepts a single column name or a list of column names.
level: If a dataframe has a multi-level columns index, this parameter specifies the levels on which the grouping has to be performed.
axis: This parameter specifies the axis along which the group should be performed. It is 0 by default, meaning rows are grouped, but you can also use 1 to group columns.
as_index: This parameter controls whether the grouping columns should become the index of the resulting dataframe. It accepts boolean values, and the default is true.
sort: Specifies whether to sort the resulting groups by the group keys. It accepts boolean values, and the default is true.
squeeze: If the resulting data only has a single group, this parameter specifies if the function should return a series instead of a dataframe.
In the following section, you will learn about the count() function in Pandas.
What is count() in Pandas?
The count() function is used for counting the number of non-null values in each column of a dataframe or series. It is a quick way to find how many valid entries exist in each column. This function returns a series, where each element is the count of valid values under each column of the input dataframe.
Here, we created a dataframe with 3 columns - Name, Age, and Salary. Age and Salary columns have null values, which are represented with NaN in Pandas. The count() function returns a series containing the count of valid entries in each column.
The count() function only accepts the following 2 parameters:-
axis: This parameter specifies the axis along which the count should be calculated. It accepts 0 (default) for counting along columns and 1 for counting along rows.
numeric_only: If this parameter is set to true, this function excludes non-numeric columns while calculating the count.
In the next section, we will look at the examples illustrating use cases of chaining groupby() and count() functions.
Chaining groupby() and count() in Pandas
The following are some examples where we have used the groupby() and count() functions together:-
Categorical Data Analysis
You can count the number of occurrences of each category in a column.
Python
Python
import pandas as pd
data = {
'Category': ['A', 'B', 'A', 'B', 'A', 'A'],
'Value': [10, 20, 15, 25, 30, 35]
}
df = pd.DataFrame(data)
print(df)
print("\n")
category_counts = df.groupby('Category').count()
print(category_counts)
You can also try this code with Online Python Compiler
Here we grouped the data based on the values of the Category column and then used the count() function on the grouped data to count the number of occurrences of each unique category.
Product Inventory Analysis
If you have data about product inventory levels over time, you can use groupby() to group the data by ProductID and then use count() to calculate the number of days each product was in stock.
If you have a log of events, you can use the groupby() function to group the data by Category and then use count() to calculate the number of times that event occurred.
In Python, NaN stands for “Not a Number, " a special floating point value used to represent undefined or unrepresentable numerical values. In Pandas, NaN represents missing values that arise from concatenation operations performed on dataframes.
What is an index in a Pandas series?
An index is used for uniquely identifying each element in a series. It can be a simple integer, or you can specify informative labels for each element. The labels are immutable, which means they cannot be changed after being assigned to an element.
What is a multi-level column index in Pandas?
It is a way to represent data in a dataframe using multiple levels of column labels, meaning each column has sub-columns. Using Pandas, you can perform various operations on dataframes containing multi-level column indices, such as slicing, stacking, aggregation, etc.
Conclusion
In this article, you learned about the open-source Python library - Pandas. We discussed its features and two important functions it provides - groupby() and count().
Go through the following articles to learn more about Pandas:-