Do you think IIT Guwahati certified course can help you in your career?
No
Introduction
Pandas, the popular data manipulation library in Python, offers a powerful feature known as MultiIndex. This feature allows you to work with complex hierarchical and multi-dimensional data more efficiently and intuitively.
In this article, we'll discuss Pandas MultiIndex and explore how it can enhance your data analysis capabilities.
Before we start, let us first pay our attention to the Indexes in Pandas.
Understanding Indexes in Pandas
In simple words, an index in Pandas is a labeled array that allows you to uniquely identify and access rows or elements within a DataFrame or Series. You can compare Indexes with the primary key in a table.
Having said that, it serves as a reference, similar to the row numbers in a spreadsheet, but with more flexibility and functionality. An index can be thought of as a guide that helps you navigate and retrieve data efficiently.
Or, you can also take the reference from the below image to understand the Indexing:
Here, the indexes represent the addresses of the houses. So, by knowing the address of any house you can easily go there. Similarly, to access data you need an index.
What is a MultiIndex?
A MultiIndex is an advanced indexing method in Pandas that enables you to assign multiple index levels to a DataFrame or Series. This is particularly useful when you're dealing with data that possesses more than one dimension or categorical hierarchy.
Think of it as organizing your data into layers, similar to a spreadsheet with rows and columns.
Let's start by explaining MultiIndex using examples for both row and column indices.
MultiIndex for Rows
Suppose you have a dataset containing information about sales transactions for different products in different regions, with the following columns:
For creating the multiIndex for rows, we use set_index() method. The set_index() method in pandas is used to set one or more columns as the row index of a DataFrame.
Let us break down the syntax and understand each parameter.
keys: This is a list of column names or an array of values that will be used as the new index
drop: This is a boolean value that determines whether the columns used as the new index will be dropped from the DataFrame. By default, this is set to True
append: This is a boolean value that determines whether the columns used as the new index will be appended to the existing index. By default, this is set to False
inplace: This is a boolean value that determines whether the operation will be performed in place. By default, this is set to False, which means that a new DataFrame will be created
verify_integrity: This is a boolean value that determines whether the new index will be checked for duplicates. By default, this is set to False
Let us now utilize the above method to create the mutliIndex for rows using ‘Region’ and ‘Date’ Columns.
Python
Python
# Create a MultiIndex using 'Region' and 'Date' columns
df.set_index(['Region', 'Date'], inplace=True)
You can also try this code with Online Python Compiler
You can access data using these indices. To access the data, we can use loc[] accessor. The loc[] accessor in pandas is used to select rows and columns by label(s).
Here, suppose we want to access the sales for the Region North on '2023-01-01' date.
MultiIndexing can be applied to both rows and columns simultaneously, providing a flexible way to represent and analyze multi-dimensional data in pandas. It allows you to perform operations like grouping, pivoting, and slicing data efficiently across multiple levels of the index.
Ways to Create the MultiIndex
There are many ways to create a MultiIndex in pandas. One we have seen with set_index() method and index parameter in dataframe. But, we have some other powerful ways to create the mulltiIndex:
Here are some of the most common ways:
Using the MultiIndex.from_tuples() method
The MultiIndex.from_tuples() method is a way to create a MultiIndex from a list of tuples.
tuples: This is a list of tuples, where each tuple represents a single row or column in the MultiIndex
names: This is a list of names for the levels of the MultiIndex. If not specified, the levels will be named "level_0", "level_1", and so on
levels: This is a list of lists, where each list represents the values for a single level of the MultiIndex. If not specified, the levels will be inferred from the tuples argument
verify_integrity: This is a boolean value that determines whether the MultiIndex will be checked for duplicates. By default, this is set to False
Example : Let us create a tuple having some data regarding the Country and its respective city.
The MultiIndex.from_product() method is used to create a MultiIndex by taking the Cartesian product of multiple iterables (e.g., lists or arrays) to form index levels.
Each unique combination of values from these iterables forms a unique index label.
Confused? Don’t worry, we will do this together.
Let us one example to understand its working:
Example
Suppose we have a list of colors and the sizes available for those colors. Now, we need to sell the colors as per the sizes. So, we will create one dataframe that will have the prices of them and will use the MultiIndex.from_product() method for Cartesian products.
Python
Python
import pandas as pd
# Create lists representing levels of the MultiIndex
colors = ['Red', 'Green', 'Blue']
sizes = ['Small', 'Medium', 'Large']
# Create a MultiIndex from the Cartesian product of these lists
We can also sort the data in a multiindex dataframe which helps in making the data more organized. Sorting data can help us to arrange the data in a multiindex dataframe at every level which eventually makes it easier to access, analyze and visualize the data. We can choose to sort the data by single or multiple levels.
To sort a multindex dataframe, we can use the method .sort_index().
Let us have a look examples which demonstrate the same.
Consider a dataset containing information about sales transactions for different products in different regions.
In this example, we defined a multiindex on levels ‘Region’ and ‘Sales’, and then we used the method .sort_index on the data frame to sort the data by ‘Sales’ in ascending order.
Sorting by Multiple levels
import pandas as pd
# Sample sales data
data = {
'Product': ['A', 'B', 'A', 'B', 'A'],
'Region': ['North', 'North', 'South', 'South', 'East'],
'Date': ['2023-01-01', '2023-01-02', '2023-01-01', '2023-01-02', '2023-01-01'],
'Sales': [100, 50, 75, 100, 90]
}
df = pd.DataFrame(data)
# Create a MultiIndex using 'Region' and 'Sales' columns
df.set_index(['Region', 'Sales'], inplace=True)
# Sort the data frame based on 'Sales' in ascending order and then based on 'Region' in descending order
sorted_df_by_sales_and_region = df.sort_index(level=['Sales', 'Region'], ascending=[True, False])
print(sorted_df_by_sales_and_region)
You can also try this code with Online Python Compiler
In this example, we used the method .sort_index on the data frame to sort the data by multiple levels i.e. sort the data by ‘Sales’ in ascending order, and then by ‘Region’ in descending order.
Advantages of Pandas MultiIndex
There are several advantages of Pandas MultiIndex, some of which are discussed below:
Indexing and Slicing with MultiIndex
Once you have a MultiIndex DataFrame, indexing and slicing become more versatile. You can now access data at various levels using .loc[] and .iloc[].
Aggregation and Grouping
MultiIndex can significantly simplify aggregation and grouping operations. You can group data by one or more levels and compute summary statistics or perform custom aggregation functions.
Reshaping and Stacking
MultiIndex makes reshaping and stacking operations more intuitive. You can pivot your data between the index levels and columns using methods like .unstack() and .stack().
Disadvantages of Pandas MultiIndex
Multiindexing in Pandas offers a great set of advantages, but on the other hand there are some cons and challenges which needs to be weighed against the advantages it offers to decide if it perfectly fits for your use cases.
Some of the cons and challenges of Pandas Multiindex are discussed below:
Increased memory usage
Multiindexing can lead to an increase in memory usage, as using multiple levels of indexing requires more memory to store the corresponding index data, and thus it might not scale well on systems with limited amount of memory where you need to deal with large datasets.
Increased complexity
Multiindexing in some cases might overcomplicate your data structure as it adds complexity to your code when dealing with multiple index levels. It becomes crucial to assess if the added complexity is justified with the advantages of multiindexing for your use cases.
Decreased performance
Multiindexing supports aggregation and grouping operations, however with the added complexity due to multiple index levels, they also lead to additional overhead for Pandas to perform operations like grouping, reshaping etc, as it needs to analyse multiple index levels.
Error prone and less readability
With the introduction of increased levels of complexity in multiindex dataframes, the chances of errors in the analysis increase substantially. The complexity of index structure may lead to less readability, as the code base evolves which eventually can lead to errors in selecting, indexing or manipulating the data.
Frequently Asked Questions
How do I convert MultiIndex columns to Pandas single index columns?
To revert the index of the dataframe from multi-index to a single index, you can use the Pandas In-built function, i.e, reset_index().
How do you set an index on multiple columns in pandas?
In pandas, you can set an index on multiple columns by passing a list of column names to the set_index() method of a DataFrame. This will create a MultiIndex, also known as a hierarchical index, with the specified columns as its levels.
How do you sort MultiIndex columns by level in Pandas?
For sorting the MultiIndex columns by level in Pandas, we can use the Pandas In-built method, i.e., sortlevel() method in Pandas.
Conclusion
Pandas MultiIndex is a powerful tool that unlocks the potential of working with multi-dimensional and hierarchical data in Python. By leveraging MultiIndex, you can create more structured, organized, and meaningful DataFrames, simplifying tasks such as indexing, slicing, aggregation, and reshaping.