Do you think IIT Guwahati certified course can help you in your career?
No
Introduction
In the world of data analysis and machine learning, missing data is a common issue that needs to be addressed.
Now, why even worry about the missing data?
To answer this question, there are some cons of having the missing data. Such as, Missing data can lead to biased results, reduced model accuracy, and hindered decision-making.
To tackle this problem, Pandas, a popular data manipulation library in Python, provides various techniques for data imputation.
In this article, we'll explore what data imputation is and dive into different methods for handling missing data using Pandas.
Understanding Imputation in Pandas
Sometimes, in a dataframe it happens that at a certain place there will be no value at all. When you import those kinds of data into the memory, by default pandas makes those values null. So, null values are those values which have no value at all in that space. So, whenever we feed the data to our machine learning model, then we will have to make sure that there should be no null value present in the dataset.
To overcome this, we have Imputation in Pandas.
As we have discussed above, Data imputation is the process of replacing or filling in missing or incomplete data with estimated or generated values.
It's an essential step in data preprocessing, as most machine learning algorithms cannot handle missing data
Imputing data allows us to maintain the integrity of our dataset and enables us to draw meaningful insights from it
Let us now see the working of it:
Prerequisite : Import Pandas
Before running this step, make sure you have pandas installed in your machine. If not, you can follow this article.
To import Pandas, you have to write the following code:
>> import pandas as pd
Sample Dataframe
We will be using the dataset of tennisPlay that contains some characteristics and one column to conclude whether someone should play tennis or not.
We have some missing values in our dataframe which we need to fix.
Now, how can we fill them with the customised values?
How to fill the missing Values?
There are a couple of methods to do so:
Using fillna() method
The fillna() function replaces the NaN or null values with the value passed as a parameter on it.
Example
df.fillna('I am now Filled', inplace=True)
df
Output
As you can see, the None values are now filled with the “ I am now filled”.
This is the beauty of the fillna() method.
What do you think about this approach? Is it worth using random values in place of missing values? No right. So, the value must be having some relevance with the data.
Having said that, let us perform some meaningful calculations:
Mode Imputation
You impute missing categorical values with the mode, which is the most frequently occurring category in the column.
#mode imputation
mode_value = df['Temperature'].mode()[0] # Get the mode of the column
df['Temperature'].fillna(mode_value, inplace=True) # Replace missing values with the mode
df
Output
We have only imputed the values for the Column: Temperature. Let us now perform for all the columns which have None values.
# Fill missing values in all columns with the mode of each respective column
df = df.fillna(df.mode().iloc[0])
df
Output
Isn’t it amazing?
You’ll be more amazed to know that we have several choices in the case of numerical data. So, let us explore some methods which we can use while dealing with the numerical data.
Sample Dataframe for Numerical Data
This DataFrame contains four numerical columns ('Age', 'Income', 'Education_years', and 'Savings') with missing values represented as None.
Let us now perform some meaningful calculations to fill the missing values.
Using Mode Imputation
It will be the same as we have used in the string data. So, let us redo it.
# Fill missing values in all columns with the mode of each respective column
df = df.fillna(df.mode().iloc[0])
df
Output
Using Mean Imputation
You can use the mean( average) to fill in the missing values.
Here’s how you can achieve it:
# Impute missing values with the mean of each respective column
df.fillna(df.mean(), inplace=True)
Output
Using Median Imputation
Median can also be a good option to replace the missing values.
Here’s how you can achieve it:
# Impute missing values with the median of each respective column
df.fillna(df.median(), inplace=True)
Output
When to use which method?
The choice between using the median, mode, or mean for imputation in pandas (or any data analysis tool) depends on the nature of your data and the problem you're trying to solve.
Each of these measures has its own use cases:
Use the mean when dealing with numerical data that is normally distributed and not heavily influenced by outliers
Use the median when dealing with numerical data that is skewed, contains outliers, or when you want a measure of central tendency that is robust to extreme values
Use the mode when dealing with categorical data to impute missing values or when you want to find the most common category
Frequently Asked Questions
When should I remove missing data instead of imputing it?
You should consider removing missing data when:
It represents a small proportion of the dataset and will not significantly impact your analysis
The missing data is random and does not introduce bias
Imputation would not be meaningful or appropriate for your analysis
What factors should I consider when choosing an imputation method?
When choosing an imputation method, consider factors such as the nature of your data, the extent of missingness, the distribution of the data, and the requirements of your analysis or machine learning task. Different methods may be more suitable for different scenarios.
Can I impute missing data for categorical variables?
Yes, you can impute missing data for categorical variables. The methods may vary, such as imputing with the most frequent category (mode) or using custom logic based on the specific categorical data.
How do I impute missing data for time series data in Pandas?
For time series data, you can use methods like forward fill (ffill), backward fill (bfill), or interpolation to impute missing values while considering the time order of the data.
Conclusion
Data imputation is a critical step in the data preprocessing pipeline when dealing with missing data. Remember that the choice of imputation method should depend on the nature of your data, the extent of missingness, and the specific requirements of your analysis or machine learning task.
By mastering the art of data imputation in Pandas, you can ensure that your data remains robust and reliable, leading to more accurate and insightful analyses.