Example of Label Encoding
Here is a basic example of label encoding in Python:
from sklearn.preprocessing import LabelEncoder
# Sample data
categories = ['dog', 'cat', 'rabbit', 'cat', 'dog', 'rabbit']
# Initialize LabelEncoder
label_encoder = LabelEncoder()
# Fit and transform the data
encoded_labels = label_encoder.fit_transform(categories)
print("Original Categories:", categories)
print("Encoded Labels:", encoded_labels)

You can also try this code with Online Python Compiler
Run Code
Output:
Original Categories: ['dog', 'cat', 'rabbit', 'cat', 'dog', 'rabbit']
Encoded Labels: [1 0 2 0 1 2]
Explanation:
- The LabelEncoder assigns unique numerical values to each category:
- "cat" = 0
- "dog" = 1
- "rabbit" = 2
- This encoding allows algorithms to process categorical data more efficiently.
Example of Label Encoding (Iris Dataset)
Let’s see how label encoding is applied to a real dataset, the Iris dataset.
from sklearn.datasets import load_iris
from sklearn.preprocessing import LabelEncoder
import pandas as pd
# Load Iris dataset
data = load_iris()
iris_df = pd.DataFrame(data.data, columns=data.feature_names)
iris_df['species'] = data.target_names[data.target]
# Initialize LabelEncoder
label_encoder = LabelEncoder()
iris_df['encoded_species'] = label_encoder.fit_transform(iris_df['species'])
print(iris_df[['species', 'encoded_species']].head())

You can also try this code with Online Python Compiler
Run Code
Output:
species encoded_species
0 setosa 0
1 setosa 0
2 setosa 0
3 setosa 0
4 setosa 0
Explanation:
- The species column is encoded into numerical values.
- Machine learning algorithms can now process this encoded column.
Creating the Dataset
To demonstrate label encoding in action, let's create a sample dataset. We'll use the popular Pandas library to create a DataFrame containing categorical data.
Example
import pandas as pd
# Create a sample dataset
data = {
'color': ['red', 'green', 'blue', 'green', 'red', 'blue'],
'size': ['small', 'medium', 'large', 'medium', 'small', 'large'],
'material': ['wood', 'metal', 'plastic', 'wood', 'metal', 'plastic']
}
# Create a DataFrame
df = pd.DataFrame(data)
print(df)
Output:
color size material
0 red small wood
1 green medium metal
2 blue large plastic
3 green medium wood
4 red small metal
5 blue large plastic
In this example, we create a dictionary called `data` that contains three categorical variables: color, size, & material. We then use the pd.DataFrame() function to create a DataFrame `df` from the `data` dictionary.
The resulting DataFrame has three columns representing the categorical variables, & each row represents an observation or data point. With our dataset ready, we can now apply label encoding to transform the categorical variables into numerical form.
Limitation of Label Encoding
Although label encoding is simple and effective, it has limitations, especially when dealing with ordinal relationships or nominal scales.
- Ordinal Misinterpretation: Algorithms may misinterpret encoded values as having a meaningful order, which can lead to biased results.
- Inconsistent Results: Using different encoding orders across datasets may yield inconsistent outcomes.
Example for Limitation of Label Encoding
Example
from sklearn.preprocessing import LabelEncoder
# Categories with no inherent order
categories = ['red', 'green', 'blue']
# Initialize LabelEncoder
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(categories)
print("Original Categories:", categories)
print("Encoded Labels:", encoded_labels)

You can also try this code with Online Python Compiler
Run Code
Output
Original Categories: ['red', 'green', 'blue']
Encoded Labels: [2 1 0]
Here, "blue" is encoded as 0, "green" as 1, and "red" as 2. However, there is no inherent order among the colors, so the model might wrongly infer relationships between these numbers.
Nominal Scale
Nominal data represents categories without any order or ranking. For example:
- Categories: "Apple," "Banana," "Cherry"
- Encoded Labels: [0, 1, 2]
In such cases, label encoding is often unsuitable as it can imply a non-existent order among the categories.
Ordinal Scale
Ordinal data represents categories with a meaningful order. For example:
- Categories: "Low," "Medium," "High"
- Encoded Labels: [0, 1, 2]
In this case, label encoding works well since the order aligns with the numerical values.
Label Encoding Using the Scikit-learn Library
The scikit-learn library provides a straightforward way to perform label encoding. It ensures consistency and ease of implementation.
Example
from sklearn.preprocessing import LabelEncoder
# Sample data
animals = ['dog', 'cat', 'rabbit']
# Initialize LabelEncoder
label_encoder = LabelEncoder()
# Fit and transform the data
encoded_animals = label_encoder.fit_transform(animals)
print("Encoded Values:", encoded_animals)
print("Mapping:", dict(zip(label_encoder.classes_, range(len(label_encoder.classes_)))))

You can also try this code with Online Python Compiler
Run Code
Output:
Encoded Values: [2 0 1]
Mapping: {'cat': 0, 'dog': 2, 'rabbit': 1}
Explanation:
The LabelEncoder method is efficient for converting categories to integers. The mapping helps understand how the categories were encoded.
Label Encoding Using Category Codes
For pandas users, the category data type provides a simple method to encode labels using cat.codes.
Example
import pandas as pd
# Sample data
data = {'fruits': ['apple', 'banana', 'cherry', 'apple', 'banana']}
fruits_df = pd.DataFrame(data)
# Convert to categorical and use category codes
fruits_df['encoded'] = fruits_df['fruits'].astype('category').cat.codes
print(fruits_df)

You can also try this code with Online Python Compiler
Run Code
Output:
fruits encoded
0 apple 0
1 banana 1
2 cherry 2
3 apple 0
4 banana 1
Explanation:
- The astype('category') method converts the column to categorical type.
- The cat.codes method provides integer encoding for the categories.
Frequently Asked Questions
What is label encoding in Python?
Label encoding converts categorical data into numerical values so that machine learning algorithms can process it. It assigns a unique integer to each category.
What are the limitations of label encoding?
Label encoding can misrepresent relationships in nominal data and create misleading patterns when categories lack a meaningful order.
How can I perform label encoding in Python?
You can use libraries like scikit-learn’s LabelEncoder or pandas’ category type with cat.codes.
Conclusion
Label encoding is a fundamental technique in data preprocessing for converting categorical data into numerical format. While it is effective, it is essential to understand its limitations, particularly when working with nominal data. Using tools like scikit-learn or pandas, you can implement label encoding easily and consistently.
Recommended Readings: