1.
Introduction
2.
Difference between categorical and numerical data
3.
Types of categorical data
4.
Operations on Categorical data
4.1.
Data Creation and Basic Functions
4.2.
Working with categories
4.3.
Sorting and Ordering
4.4.
Comparisons
4.5.
Missing data
5.
FAQs
6.
Key Takeaways
Last Updated: Mar 27, 2024

# Categorical Data  - Intro & Hands-On

soham Medewar
1 upvote
Master Python: Predicting weather forecasts
Speaker
Ashwin Goyal
Product Manager @

### Introduction

The data that can be divided into groups is called categorical data. The example of categorical is age group, sex, race, etc. Analysis of categorical data is generally done using data tables.

A two-way table presents categorical data by counting the number of observations that fall into each group for two variables, divided into rows and divided into columns.

### Difference between categorical and numerical data

Categorical data can be referred to as a data type that can be stored and identified based on the name or the label given to them, whereas numerical data is stored in the form of numbers. Categorical data is qualitative data, and numerical data is quantitative data. There are two types of numerical data, i.e., discrete and continuous data. Categorical data is divided into ordinal and nominal data. Categorical data is visualized only through bar graphs and pie charts, but numerical data can be visualized through bar graphs, pie charts, scatter plots, etc.

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

### Types of categorical data

Categorical data is divided into two types:

• Ordinal Data: The ordering of the data matter. Classes have an inherent order.
• Nominal Data: The ordering of the data doesnâ€™t matter. Classes do not have an inherent order.

### Operations on Categorical data

I will be using the pandas library for performing operations on categorical data.

#### Data Creation and Basic Functions

â—Ź Let us create Categorical data using pd.Series() by specifying â€śdtype=categoryâ€ť.

â—Ź Getting the unique values from the dataset.

â—Ź Getting the count of every category from the dataset.

â—Ź Creating categorical data using pandas DataFrame where each column is of category type.

â—Ź Using describe function to get details of each column in the dataset.

â—Ź Getting information of datatypes of each column from the dataset.

#### Working with categories

â—Ź Renaming the categories from the dataset.

â—Ź Adding and removing categories to the dataset.

Adding the â€śfiveâ€ť category to the dataset.

Removing the â€śthreeâ€ť category from the dataset.

#### Sorting and Ordering

If the categorical data is ordered, then the data has a specific meaning and various operations can be performed. If the data is unordered then .min()/ .max() operations cannot be performed, it will give type error.

â—Ź Unordered data

Sorting the unordered data.

â—Ź Ordered data

Sorting the ordered data

Performing the .min() and .max() operation on ordered data.

â—Ź Reordered data

Reordering the categories is done by using the Categorical.reorder_categories() and the Categorical.set_categories() methods.

No new categories are allowed for using the Categorical.reorder_categories() method, and old categories must be included in the new categories. This will make the sort order the same as the categories order.

The below code will sort the categorical data in the â€śBâ€ť < â€śCâ€ť < â€śAâ€ť < â€śDâ€ť order.

â—Ź Multi-Column Sorting

Consider a dataset having two categorical columns sorting the data according to both the columns.

Creating the dataframe.

First, sorting the data with respect to the "B" column and then to the "A" column.

#### Comparisons

Comparing categorical data with other objects is possible in three cases:

1. Comparing equality (== or !=) of the categorical data to a list like object of same length.
2. All types of comparisons (==, !=, >=, <=, >, <) of categorical data with another categorical data (for ordered dataset and same categories).
3. Comparing categorical with scalar quantity.

We will create a three series dataset to illustrate comparisons.

â—Ź Comparing categorical data with another categorical data and a scalar of the same categories and ordering.

â—Ź Equality comparisons

#### Missing data

In pandas we use np.nan to represent the missing values.

When working with categorical codes missing values will have value -1.

Nan values are not counted in categoricalâ€™s category.

Methods to work with categorical data.

• isna(): returns false if the data is not null and true if data is null
• fillna(â€śaâ€ť): Fills all the nan values in the categorical data with â€śaâ€ť(here a can be any categorical data).
• dropna(): Drops all the nan values from the dataset.

### FAQs

1. How do you identify categorical data?
Calculate the difference between the number of unique values in the data set and the total number of values in the data set. Calculate the difference as a percentage of the total number of values in the data set. If the percentage difference is 90% or more, then the data set is composed of categorical values.

2. What is â€ścategoricalsâ€ť in pandas?
â€śCategoricalsâ€ť are a pandas data type corresponding to categorical variables in statistics.

3. What is hot encoding in python?
A one-hot encoding is a representation of categorical variables as binary vectors.

4. What type of data is categorical?
Categorical data is a type of data that can be stored into groups or categories with the aid of names or labels. This grouping is usually made according to the data characteristics and similarities of these characteristics through a method known as matching.

5. What are the two types of categorical data?
There are two types of categorical variables, nominal and ordinal. A nominal variable has no intrinsic ordering to its categories. For example, gender is a categorical variable having two categories (male and female) with no intrinsic ordering to the categories. An ordinal variable has a clear ordering.