Do you think IIT Guwahati certified course can help you in your career?
No
Introduction
Hey, coders. Are you interested in data analysis and want to know more about pandas pivot table? Understanding the pandas pivot table is necessary when working with data in Python. This article will make learning easier about creating the pandas pivot tables, their properties, and how you can use them in different cases.
The pandas pivot table is crucial to data analysis. Using pivot tables will help you to organize your data within a DataFrame. By the end, you'll know the pandas pivot table well. So let's explore this topic more.
Understanding Pandas Pivot Table
The pandas pivot table is used forrearranging data in Python. It will allow you to group your data by one or more columns. The pandas pivot tables will make it simple for you to identify different patterns. You can customize the rows and columns in a pandas pivot table according to your needs. You can apply aggregation functions such as count, sum, mean etc, to your data. The pandas pivot tables will help you easily handle large datasets with many columns. You can filter and sort your data in a pandas pivot table.
The DataFrame you want to use for the pandas pivot table.
values
The columns are to be aggregated.
index
The columns to be used for the index (rows).
columns
The columns to be used for the columns in the pandas pivot table.
fill_value
The value to replace missing values with.
margins
A boolean indicates whether to include subtotals or grand totals.
dropna
A boolean indicates whether to exclude rows or columns that contain missing values.
There are also other optional parameters that you can use to customize your pandas pivot table.
aggfunc
The function for aggregation (e.g. sum, count, mean, max etc.).
margins_name
A string to use as the label for the column or row having a total.
Return Value
It is used to create a DataFrame that resembles an Excel pivot table, where the pivot table's hierarchical levels are stored as MultiIndex structures in both the DataFrame's index and columns.
Creating Pandas Pivot Table
Here is an example of pivot tables in pandas:
We are creating this table using the pandas library in Python. The table in this example has a DataFrame based on the values 'coder' and 'language' columns. The coders are shown in the rows. The languages in which they code are shown in the columns. The output table shows the number of questions each coder solves in each language. The values in the table are the sums of the 'questions_solved.'
This table has three columns:'coder', 'language', and 'questions_solved'. The pivot table will use the 'coder' column as the rows. The pivot table will use the 'language' column in the columns. The pivot table will use the 'questions_solved' column as the value. The 'aggfunc' parameter specifies the values in the table you add.
You can try many operations to manipulate and filter the pandas pivot table we created in the above example. Here are a few code lines to add:
Filter Pandas Pivot Table
You can filter the pivot table to only show rows where the total number of questions solved is greater than or equal to 100.
CODE
import numpy as np
import pandas as pd
# First, create a sample DataFrame
data = {
'coder': ['Ninja1', 'Ninja2', 'Ninja3', 'Ninja1', 'Ninja2', 'Ninja3'],
'language': ['Python', 'Python', 'C++', 'Java', 'Python', 'C++'],
'questions_solved': [50, 75, 100, 25, 80, 60]
}
df = pd.DataFrame(data)
# Now create a pivot table using the pandas `pivot_table` function
pivot_table = pd.pivot_table(df, values='questions_solved', index='coder', columns='language', aggfunc=np.sum)
#Finally, creating the filtered table
filtered_table = pivot_table[pivot_table.sum(axis=1) >= 100]
print(filtered_table)
You can also try this code with Online Python Compiler
The filtering condition is that the sum of values for each row (specified by axis=1) in the pivot_table DataFrame must be greater than or equal to 100. The sum() function calculates the sum of values across each row in the pivot_table.
Sort Pandas Pivot Table
You can sort the pivot table by the total number of questions solved for each coder.
CODE
import numpy as np
import pandas as pd
# First, create a sample DataFrame
data = {
'coder': ['Ninja1', 'Ninja2', 'Ninja3', 'Ninja1', 'Ninja2', 'Ninja3'],
'language': ['Python', 'Python', 'C++', 'Java', 'Python', 'C++'],
'questions_solved': [50, 75, 100, 25, 80, 60]
}
df = pd.DataFrame(data)
# Now create a pivot table using the pandas `pivot_table` function
pivot_table = pd.pivot_table(df, values='questions_solved', index='coder', columns='language', aggfunc=np.sum)
#Finally, creating the sorted table
sortedone_table = pivot_table.sort_values(by=pivot_table.sum(axis=1), ascending=False)
print(sortedone_table)
You can also try this code with Online Python Compiler
Here pivot_table.sum(axis=1) calculates the sum of values for each coder. Then sort_values() will sort a DataFrame by a column you will specify. by=pivot_table.sum(axis=1) shows that we are sorting the rows according to the sum of values for each row. ascending=False shows that we are sorting in descending order.
Add New Column
You can also add a new column to the pivot table that shows the percentage of questions solved for each language by each coder.
CODE
import numpy as np
import pandas as pd
# First, create a sample dataframe
data = {
'coder': ['Ninja1', 'Ninja2', 'Ninja3', 'Ninja1', 'Ninja2', 'Ninja3'],
'language': ['Python', 'Python', 'C++', 'Java', 'Python', 'C++'],
'questions_solved': [50, 75, 100, 25, 80, 60]
}
df = pd.DataFrame(data)
# Now creating a pivot table using the pandas `pivot_table` function
pivot_table = pd.pivot_table(df, values='questions_solved', index='coder', columns='language', aggfunc=np.sum)
total_questions_solved = pivot_table.sum().sum()
pivot_table['Python %'] = pivot_table['Python'] / total_questions_solved
pivot_table['Java %'] = pivot_table['Java'] / total_questions_solved
pivot_table['C++ %'] = pivot_table['C++'] / total_questions_solved
print(pivot_table)
You can also try this code with Online Python Compiler
To calculate the percentage of questions solved for each language, we divide the number of questions solved by the total number of questions for all languages. We add new columns to the pivot_table DataFrame: Python %, Java %, and C++ %. Then finally, we print the table.
Aggregating Functions
There are many aggregation functions that you can use with the aggfunc parameter.
aggfunc = ‘?’
Function
Description
mean()
It calculates the mean value of the group.
count()
It counts the number of non-null values in the group.
min()
It calculates the minimum value in the group.
max()
It calculates the maximum value in the group.
median()
It calculates the median value in the group.
std()
It calculates the standard deviation of the values in the group.
var()
It calculates the variance of the values in the group.
prod()
It calculates the product of the values in the group.
sum()
It calculates the sum of the values in the group.
You can also set multiple aggregation functions for a column in a pandas pivot table by passing a list of functions to the aggfunc parameter.
CODE
import numpy as np
import pandas as ppt
# First, create a sample data frame
data = {
'coder': ['Ninja1', 'Ninja2', 'Ninja3', 'Ninja1', 'Ninja2', 'Ninja3'],
'language': ['Python', 'Python', 'C++', 'Java', 'Python', 'C++'],
'questions_solved': [50, 75, 100, 25, 80, 60]
}
df = ppt.DataFrame(data)
# First of all, define the aggregation functions you want
aggregation_functions = {'questions_solved': ['sum', 'mean']}
# Now create the pivot table
pivot_table = ppt.pivot_table(df, values='questions_solved', index='coder', columns='language', aggfunc=aggregation_functions)
# Finally, print your pivot table
print(pivot_table)
You can also try this code with Online Python Compiler
Margins can give you an easy way to get a data summary. It will be quicker for you to compare the values in different rows and columns. You can use the margins parameter in the pivot_table function to add margins to a pandas pivot table. To add margins, you can set the margins parameter to True in the pivot_table() function.
CODE
import numpy as np
import pandas as ppt
# First, create a sample data frame
data = {
'coder': ['Ninja1', 'Ninja2', 'Ninja3', 'Ninja1', 'Ninja2', 'Ninja3'],
'language': ['Python', 'Python', 'C++', 'Java', 'Python', 'C++'],
'questions_solved': [50, 75, 100, 25, 80, 60]
}
df = ppt.DataFrame(data)
# Now create the pivot table
pivot_table = ppt.pivot_table(df, values='questions_solved', index='coder', columns='language', aggfunc='sum', margins=True)
# Finally print the pivot table
print(pivot_table)
You can also try this code with Online Python Compiler
The output table will give you a more detailed view. You can quickly analyse and compare each 'coder' performance at different levels. The index level 'coder' groups the data by the unique values in the 'coder' column. While the index level 'level' further divides the data within each 'coder' group by the unique values in the 'level' column.
Fill Missing Values
You can fill the missing values in a pivot table in python using the fillna() method.
CODE
import pandas as pd
import numpy as np
# Sample data
data = {
'Category': ['A', 'A', 'B', 'B', 'C', 'C'],
'Value': [10, 20, 15, np.nan, 30, np.nan]
}
# Create a DataFrame
df = pd.DataFrame(data)
# Create a pivot table
pivot_table = df.pivot_table(values='Value', index='Category', aggfunc=np.sum)
# Fill missing values with a specific value (e.g., 0)
pivot_table_filled = pivot_table.fillna(0)
print(pivot_table_filled)
You can also try this code with Online Python Compiler
In this example, we first created a pivot table using pivot_table(), and then we used fillna(0) to fill in missing values with zeros. You can replace 0 with any other value you prefer to fill in the missing data.
Create a Custom Aggregation Function
You can define your custom aggregation function. When creating a pandas pivot table, you can pass your custom function as the aggfunc parameter.
CODE
import numpy as np
import pandas as ppt
# First, create a DataFrame
sample = {
'coder': ['Ninja1', 'Ninja2', 'Ninja3', 'Ninja1', 'Ninja2', 'Ninja3'],
'language': ['Python', 'Python', 'C++', 'Java', 'Python', 'C++'],
'questions_solved': [50, 75, 100, 25, 80, 60]
}
df = ppt.DataFrame(sample)
# Define a custom aggregation function whatever your want
def sum_of_squares(x):
return np.sum(x**2)
# Now create the pivot table
pivot_table = ppt.pivot_table(df, values='questions_solved', index='coder', columns='language', aggfunc=sum_of_squares)
# Finally, print the pivot table
print(pivot_table)
You can also try this code with Online Python Compiler
We defined a custom aggregation function sum_of_squares. This function takes an input array x and will return the sum of squares. The output pivot table will have the sum of squares of the questions_solved values for each coder and language combo.
Difference between groupby() V/S pandas pivot_table()
Here are some differences between groupby() and pandas pivot_table():
groupby()
pandas pivot_table()
It groups data by one or more columns and applies a function to each group.
It aggregates data in a pivot table way with indexes and columns.
It gives a DataFrame or Series output.
It gives a DataFrame output.
It only works with a single index.
It can have multiple levels of indexes.
It can only apply one function to each group.
It can apply multiple aggregation functions.
It is faster for simple grouping tasks.
It is better for more complex tasks.
Pandas pivot tables have various real-life applications. It includes sales data analysis, HR data analysis, marketing campaign analysis, financial reporting, and supply chain analysis
Pros of Pandas Pivot Table
Here are some pros of using pandas pivot table:
It gives you a user-friendly way to summarize and analyze large datasets.
It is very customizable. It will allow you to control aggregation functions, columns, and indexes.
You can easily group data with multiple columns or levels.
It has valuable features such as adding subtotal rows/columns and sorting functions.
You can also manage missing or incomplete data.
It will provide you with an excellent visual presentation of data. It will be quicker to interpret than raw data.
Cons of Pandas Pivot Table
Here are some cons of using pandas pivot table:
Sometimes it can be slower with massive datasets and complex aggregations.
It requires sound technical knowledge of Pandas syntax and data manipulation.
Sometimes it can result in a loss of precision for tiny datasets.
It is not so helpful with non-numeric data.
It may be less efficient with real-time analysis and continuous data processing.
Frequently Asked Questions
Are there any limitations of the pandas pivot table?
It is challenging to handle vast datasets with millions of rows. It has limitations in handling complex multi-level indices. There are some restrictions on performing custom calculations on subsets of the data.
Can you use the pandas pivot table with non-numerical data?
Nonetheless, should you employ a pivot_table with non-numeric values, there is a possibility of encountering a DataError: No numeric types available for aggregation.
What data science tools or libraries work well with pandas pivot table?
NumPy, Matplotlib, Seaborn, and Jupyter Notebook are some data science tools that work well. You can use these tools to manipulate, plot, and visualise data.
Can you group data by multiple columns in the pandas pivot table?
You can make a Pandas pivot table using several columns, and it gives you a new table. You can change how it looks by adjusting the columns.
Conclusion
This article discusses the pandas pivot table's usage, definition, pros, and cons. It explains how to add margins, create custom functions, and more. The article also includes examples to help understand how to use the pandas pivot table in different cases. Refer to other pandas and Python articles to improve your understanding of the pandas pivot table.