Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
Categorical Data
2.1.
Mean
2.2.
Median
2.3.
Mode
3.
Measures of central tendency of Categorical data
4.
Implementation
5.
Frequently Asked Question
6.
Key Takeaways
Last Updated: Mar 27, 2024

Categorical Data - Measure of Central Tendency

Author Mayank Goyal
1 upvote
Master Python: Predicting weather forecasts
Speaker
Ashwin Goyal
Product Manager @

Introduction

Measurement of central tendency is a summary statistic representing the center point or typical value of a dataset. As such, measures of central tendency are also known as measures of central location. They are also classed as summary statistics. We can think of it as the measure of data to cluster around a central value. In statistics, the three most used measures of central tendency are the mean, the median, and the mode.

 

Choosing the best measure of central tendency is not a piece of cake. It depends on the type of data. In this post, we will explore these measures of central tendency, show you how to calculate central tendency for categorical data. So before diving into the calculation, first look into categorical data.

 

Categorical Data

The accuracy of a machine learning model not only depends on the algorithm we choose and the hyperparameters but also on how we feed and process different types of features to the model. Machine learning models mainly accept numerical variables; preprocessing the categorical variables becomes necessary. We need to transform these categorical variables to numbers such that the model can understand and extract valuable information.

 

Categorical variables are usually represented as 'categories' or 'strings' and are finite. Some of the few examples are

  • The city where people live: Delhi, Mumbai, Ahmedabad, Bangalore, etc.
  • The highest degree of a person: High school, Diploma, Bachelor's, Masters's, Ph.D.
  • A student's grades: A, A+, B+, B-, B.

 

In the above examples, the categories have definite possible values only. There are two kinds of categorical data:

  • Ordinal Data: The classes have an inherent order,i.e., ordering matters.
  • Nominal Data: The classes do not have an inherent order, i.e., the order does not matter.

 

In Ordinal data, while encoding, we should retain the order in which the category is given. For example, the highest degree a person possesses in the above example provides vital information about his qualification. 

 

While encoding Nominal data, the presence or absence of data matters. In such a case, no order is present. For example, the city a person resides. For the data, it is essential to retain where a person resides. We do not have to give importance to order or sequence. It is equal whether a person lives in Delhi or Mumbai.

 

For encoding categorical data, we have a python package, category_encoders.

 

Now we know what's categorical data, let us look into the different measures of central tendency.

 

Mean

It is one of the most common, well-known measures of central tendency. Mean can be applied in both continuous and discrete data. Calculation of mean is pretty simple, and it is the sum of different values of observations divided by the number of observations.

 

In the normal distribution, the mean is the center of the data. While in a skewed distribution, the mean can miss the mark. The problem occurs because outliers have a significant impact on the mean. The extreme values in an extended tail pull the mean away from the center. Hence, the mean is drawn further away from the center as the distribution becomes more skewed. Consequently, it's best to use the mean to measure the central tendency when you have asymmetric distribution.

 

Median

The median represents the middle value. The median value splits the dataset into two halves. To find the median, sort the data in ascending order and find the data point with equal values above and below it. The outliers and skewed data have less impact on the median. When we have a skewed distribution, the median is a better measure of central tendency than the mean.

 

In the case of symmetric distribution, the mean and median are approximately equal and around the center. In the case of skewed distribution, outliers in the tail pull mean away from the center towards the long tail.

 

Note: Statisticians say that the median is robust, while the mean is sensitive to outliers and skewed distributions.

 

Mode

The mode represents the most frequent value in the data set. On a vertical bar chart, the mode is the tallest bar. If the data have multiple values that appear the most often, we have a multimodal distribution. If no value repeats or each value have the same frequency, then the data do not have a mode.

 

Now the problem with the mode is that it is not unique, we can have multimodal distribution as mentioned above, so it leaves us with situations when we have two or more values that share the highest frequency. Secondly, it will not return an accurate measure of central tendency when the most common mark is far from the rest of the data set.

 

Get the tech career you deserve, faster!
Connect with our expert counsellors to understand how to hack your way to success
User rating 4.7/5
1:1 doubt support
95% placement record
Akash Pal
Senior Software Engineer
326% Hike After Job Bootcamp
Himanshu Gusain
Programmer Analyst
32 LPA After Job Bootcamp
After Job
Bootcamp

Measures of central tendency of Categorical data

We can use the mode with categorical, ordinal, and discrete data. The mode is the only choice to measure central tendency in the case of categorical data. However, there isn't a central value with categorical data because you can't order the groups. The mode can be a value not in the center with ordinal and discrete data. Again, the mode represents the most common value.

 

Calculating a  mean for categorical variables would be inappropriate because the spacing between categories may be uneven. Since standard deviation and variance depend on the mean, mean and median should not summarize categorical features.

 

For ordinal categorical data, both the median and mode can be calculated as measures of central tendency. For nominal categorical data, the mode is calculable and interpretable.

 

Let us move into the most exciting part,i.e., the coding part. Well, the visualization of categorical data is pretty straightforward. We will use the Titanic dataset for visualization.

 

Implementation

Importing libraries

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore'

 

Reading Dataset

df=pd.read_csv(r"C:\Users\goyal\Desktop\ml\jupyter\train.csv")

 

Encoding

df.loc[df['Sex']=='male','Sex']=0
df.loc[df['Sex']=='female','Sex']=1
# instead of nan values
df['Embarked']=df['Embarked'].fillna('S'
# Change to categoric column to numeric
df.loc[df['Embarked']=='S','Embarked']=0
df.loc[df['Embarked']=='C','Embarked']=1
df.loc[df['Embarked']=='Q','Embarked']=2

 

Plotting Of Different Categorical Features

sns.catplot(x="SibSp",kind="count",palette="ch:.25",data=df)

 

df['SibSp'].mode()

 

Output

0    0

dtype: int64

 

 

sns.catplot(x="Embarked",kind="count",palette="ch:.95",data=df)

 

df['Embarked'].mode()

 

Output

0    0

dtype: object

 

 

sns.catplot(x="Sex", y="Survived", hue="Pclass", kind="bar", data=df)

 

 

cols=[ 'Pclass''Sex''SibSp''Parch''Embarked']

n_r=2
n_c=3
fig,axs = plt.subplots(n_r,n_c,figsize=(n_c*3.2,n_r*3.2))

forin range(0,n_r):
  forin range(0,n_c):

    i=r*n_c + c
    ax=axs[r][c]
    sns.countplot(df[cols[i]],hue=df['Survived'],ax=ax)
    ax.set_title(cols[i])
    ax.legend(title='Survived',loc='upper right')

plt.tight_layout()

 

 

Frequently Asked Question

  1. What is the best measure of central tendency for categorical data?
    The mode is the only central tendency measure for categorical data, while a median works best with ordinal data.
     
  2. Can categorical data be normally distributed?
    Categorical data can not be a normal distribution because the normal distribution only makes sense if we are dealing with interval data, and the normal distribution is continuous and on the whole real line.
     
  3. What type of data is categorical?
    Categorical data is a type of data that can be stored into groups or categories with the aid of names or labels.

Key Takeaways

Let us brief the article.

Firstly we saw the basics of categorical data. Moving on, we looked into different central tendencies in detail. Lastly, we saw the most appropriate central tendency to measure categorical data.

I hope you find this article helpful. Stay updated for more exciting articles.

Happy Learning Ninjas!

Previous article
Categorical Data  - Intro & Hands-On
Next article
Numerical Data - Measure of Central Tendency
Live masterclass