Table of contents
1.
Introduction
2.
Understanding Imputation in Pandas
3.
Prerequisite : Import Pandas 
4.
Sample Dataframe 
4.1.
Python
5.
How to fill the missing Values?
5.1.
Using fillna() method 
5.2.
Mode Imputation 
6.
Sample Dataframe for Numerical Data
6.1.
Python
6.2.
Using Mode Imputation 
6.3.
Using Mean Imputation
6.4.
Using Median Imputation 
7.
When to use which method? 
8.
Frequently Asked Questions
8.1.
When should I remove missing data instead of imputing it?
8.2.
What factors should I consider when choosing an imputation method?
8.3.
Can I impute missing data for categorical variables?
8.4.
How do I impute missing data for time series data in Pandas?
9.
Conclusion
Last Updated: Mar 27, 2024
Easy

Imputation in Pandas

Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

In the world of data analysis and machine learning, missing data is a common issue that needs to be addressed

Now, why even worry about the missing data? 

To answer this question, there are some cons of having the missing data. Such as, Missing data can lead to biased results, reduced model accuracy, and hindered decision-making. 

To tackle this problem, Pandas, a popular data manipulation library in Python, provides various techniques for data imputation. 

Imputation in pandas

In this article, we'll explore what data imputation is and dive into different methods for handling missing data using Pandas.

Understanding Imputation in Pandas

Sometimes, in a dataframe it happens that at a certain place there will be no value at all. When you import those kinds of data into the memory, by default pandas makes those values null. So, null values are those values which have no value at all in that space. So, whenever we feed the data to our machine learning model, then we will have to make sure that there should be no null value present in the dataset.

To overcome this, we have Imputation in Pandas. 

As we have discussed above, Data imputation is the process of replacing or filling in missing or incomplete data with estimated or generated values.

  • It's an essential step in data preprocessing, as most machine learning algorithms cannot handle missing data
     
  • Imputing data allows us to maintain the integrity of our dataset and enables us to draw meaningful insights from it


Let us now see the working of it: 

Prerequisite : Import Pandas 

Before running this step, make sure you have pandas installed in your machine. If not, you can follow this article. 

To import Pandas, you have to write the following code: 

>> import pandas as pd 

Sample Dataframe 

We will be using the dataset of tennisPlay that contains some characteristics and one column to conclude whether someone should play tennis or not. 

You can run the below code to use the dataframe.

  • Python

Python

import pandas as pd

data = {

'Outlook': ['Sunny', 'Sunny', 'Overcast', 'Rain', 'Rain', 'Rain', 'Overcast', 'Sunny', 'Sunny', 'Rain', 'Sunny', 'Overcast', 'Overcast', 'Rain'],

'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', None, 'Mild', None, 'Mild', None, 'Mild', 'Hot', 'Mild'],

'Humidity': ['High', 'High', 'High', 'High', None, 'Normal', 'Normal', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'High'],

'Wind': ['Weak', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Weak', None, 'Strong', 'Strong', 'Weak', 'Strong'],

'Play Tennis': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No']

}

df = pd.DataFrame(data)

def highlight_none(val):

if val is None:

return 'background-color: yellow'

else:

return ''



# Apply the styling function to the DataFrame

styled_df = df.style.applymap(highlight_none)



# Display the styled DataFrame

styled_df
You can also try this code with Online Python Compiler
Run Code


Result 

Sample dataframe

We have some missing values in our dataframe which we need to fix. 

Now, how can we fill them with the customised values? 

How to fill the missing Values?

There are a couple of methods to do so: 

Using fillna() method 

The fillna() function replaces the NaN or null values with the value passed as a parameter on it. 

Example

df.fillna('I am now Filled', inplace=True)
df


Output 

Sample dataframe after fillna


As you can see, the None values are now filled with the “ I am now filled”. 

This is the beauty of the fillna() method

What do you think about this approach? Is it worth using random values in place of missing values? No right. So, the value must be having some relevance with the data. 

Having said that, let us perform some meaningful calculations: 

Mode Imputation 

You impute missing categorical values with the mode, which is the most frequently occurring category in the column.

#mode imputation
mode_value = df['Temperature'].mode()[0] # Get the mode of the column
df['Temperature'].fillna(mode_value, inplace=True) # Replace missing values with the mode
df


Output 

Mode Imputation on a single column

We have only imputed the values for the Column: Temperature. Let us now perform for all the columns which have None values. 
 

# Fill missing values in all columns with the mode of each respective column
df = df.fillna(df.mode().iloc[0])
df


Output 

Mode Imputation on all column

Isn’t it amazing?

You’ll be more amazed to know that we have several choices in the case of numerical data. So, let us explore some methods which we can use while dealing with the numerical data.

Sample Dataframe for Numerical Data

This DataFrame contains four numerical columns ('Age', 'Income', 'Education_years', and 'Savings') with missing values represented as None

  • Python

Python

import pandas as pd

data = {

'Outlook': ['Sunny', 'Sunny', 'Overcast', 'Rain', 'Rain', 'Rain', 'Overcast', 'Sunny', 'Sunny', 'Rain', 'Sunny', 'Overcast', 'Overcast', 'Rain'],

'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', None, 'Mild', None, 'Mild', None, 'Mild', 'Hot', 'Mild'],

'Humidity': ['High', 'High', 'High', 'High', None, 'Normal', 'Normal', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'High'],

'Wind': ['Weak', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Weak', None, 'Strong', 'Strong', 'Weak', 'Strong'],

'Play Tennis': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No']

}

df = pd.DataFrame(data)

def highlight_none(val):

if val is None:

return 'background-color: yellow'

else:

return ''



# Apply the styling function to the DataFrame

styled_df = df.style.applymap(highlight_none)



# Display the styled DataFrame

styled_df
You can also try this code with Online Python Compiler
Run Code


Output 

Sample Dataframe for Numerical Data

Here, NaN represents Not a value. 

Let us now perform some meaningful calculations to fill the missing values. 

Using Mode Imputation 

It will be the same as we have used in the string data. So, let us redo it. 

# Fill missing values in all columns with the mode of each respective column
df = df.fillna(df.mode().iloc[0])
df


Output

Using Mode Imputation

Using Mean Imputation

You can use the mean( average) to fill in the missing values. 

Here’s how you can achieve it: 

# Impute missing values with the mean of each respective column
df.fillna(df.mean(), inplace=True)


Output 

Using Mean Imputation

Using Median Imputation 

Median can also be a good option to replace the missing values. 

Here’s how you can achieve it: 

# Impute missing values with the median of each respective column
df.fillna(df.median(), inplace=True)


Output 

Using Median Imputation

When to use which method? 

The choice between using the median, mode, or mean for imputation in pandas (or any data analysis tool) depends on the nature of your data and the problem you're trying to solve. 

Each of these measures has its own use cases:

  • Use the mean when dealing with numerical data that is normally distributed and not heavily influenced by outliers
     
  • Use the median when dealing with numerical data that is skewed, contains outliers, or when you want a measure of central tendency that is robust to extreme values
     
  • Use the mode when dealing with categorical data to impute missing values or when you want to find the most common category

Frequently Asked Questions

When should I remove missing data instead of imputing it?

You should consider removing missing data when:

  • It represents a small proportion of the dataset and will not significantly impact your analysis
  • The missing data is random and does not introduce bias
  • Imputation would not be meaningful or appropriate for your analysis

What factors should I consider when choosing an imputation method?

When choosing an imputation method, consider factors such as the nature of your data, the extent of missingness, the distribution of the data, and the requirements of your analysis or machine learning task. Different methods may be more suitable for different scenarios.

Can I impute missing data for categorical variables?

Yes, you can impute missing data for categorical variables. The methods may vary, such as imputing with the most frequent category (mode) or using custom logic based on the specific categorical data.

How do I impute missing data for time series data in Pandas?

For time series data, you can use methods like forward fill (ffill), backward fill (bfill), or interpolation to impute missing values while considering the time order of the data.

Conclusion

Data imputation is a critical step in the data preprocessing pipeline when dealing with missing data. Remember that the choice of imputation method should depend on the nature of your data, the extent of missingness, and the specific requirements of your analysis or machine learning task. 

By mastering the art of data imputation in Pandas, you can ensure that your data remains robust and reliable, leading to more accurate and insightful analyses.

Here are some more related articles:

Check out The Interview Guide for Product Based Companies and some famous Interview Problems from Top Companies, like AmazonAdobeGoogle, etc., on Coding Ninjas Studio.

Also, check out some of the Guided Paths on topics such as Data Structure and AlgorithmsCompetitive ProgrammingOperating SystemsComputer Networks, DBMSSystem Design, etc., as well as some Contests, Test SeriesInterview Bundles, and some Interview Experiences curated by top Industry Experts only on Coding Ninjas Studio.

We hope you liked this article.

"Have fun coding!”

Live masterclass