Table of contents
1.
Introduction
2.
What Is Data Wrangling?
3.
Why do we use data wrangling?
4.
Steps of Data Wrangling
4.1.
Understand the Dataset
4.2.
Re-structuring
4.2.1.
Datasets from various sources and systems will arrive in multiple formats, sizes, and kinds. As a result, the data must be reorganized to a standard format at this point. The data should also be classified and divided based on its intended use. 
4.2.2.
This includes setting the headers finding the number of rows and columns and describing the data type under observation. 
4.3.
Data Cleaning
4.3.1.
As the name describes, data cleaning is used for dealing with errors, null values, duplications, and outliers. The data is bought into a similar type that makes the data more optimized. 
4.3.2.
Most of our data frame modification tools are intended to function with clean data. As a result, cleaning is most simply accomplished once the data set structure has been neatly organized. As a result, I propose executing the reshaping tidying phase first, followed by the cleaning.
4.4.
Data Augmentation
4.4.1.
Augmenting data takes current data and supplement it with additional info to get new insights. For example, if a shop wants to understand how effective their social media advertising is, they might increase their sales data with marketing data.
4.5.
Validating Data
4.5.1.
Validating data signifies the authentication of the data under observation. We authenticate the reliability factor, credibility, and readiness of the data to be processed. This process helps us determine that the data is accurate and not altered by raw data or null values. 
4.6.
Exporting the Dataset
4.6.1.
Now that we have covered most of the crucial steps, the last is to publish, share or export the dataset to train our model. The dataset is stored on a sharing platform where all the uses in an organization can make use of it. 
5.
Implementation of data wrangling
5.1.
Understanding the Data
6.
FAQs
7.
Key Takeaways
Last Updated: Mar 27, 2024

Steps of Data wrangling

Author Tushar Tangri
1 upvote

Introduction

User-based data or any kind of data on a daily basis is generated in huge bulk. The majority of this data is raw and contains bad data, i.e., data that would make our model make bad decisions. To get rid of such data that is not useful to us and generates error values, we use methods such as data wrangling, which helps us extract helpful information from our raw data. 

In this article, we will learn about data wrangling from scratch along with deep insights to explore more about data wrangling and the steps involved in it. 

What Is Data Wrangling?

The technique of filtering, reorganizing, and enhancing raw data is known as data wrangling. It is complicated because raw data has not been processed or integrated into a system. These records are translated into a standard format via data wrangling, which helps to emphasize significant insights. 

Any data analyst will tell you that here is where you have to get your hands "dirty" before moving on to true analytics with its models and visual interfaces.

Data wrangling refers to all of the processing done on your data prior to analysis. It entails assessing data quality and context before translating the data into the appropriate format.

The procedure consolidates data into a single location and corrects any missing or incorrect information. Data wrangling, also known as data munging, is the most time-consuming and work-intensive aspect of data processing. According to data scientists, data wrangling consumes around 75% of their work. Teams may focus on assuring data accuracy by devoting time to the process.

Why do we use data wrangling?

This procedure is crucial for organizations since it is the only way to make raw data useable. Effective data manipulation will assist organizational executives in gaining a birds-eye picture of their whole firm. It also reduces data duplication, which would have influenced decision-making.

Furthermore, because it centralizes various data sources, different divisions within a company may interact and develop creative ideas. To make a sustainable model, it is reasonably necessary to have the optimized dataset to analyze and create high-accuracy models. Let’s learn the essential steps of data wrangling. 

Steps of Data Wrangling

There are several essential steps involved that helps data wrangling work the best. We have categorized them under the given labels. 

Understand the Dataset

Raw data comprises various formats and layouts, making it necessary to understand our dataset and classify it according to it. It is the first step towards building an efficient and accurate model. 

Understanding the data set helps us have prior insights into how we should proceed with the data and wrangle it accordingly. 

Re-structuring

Datasets from various sources and systems will arrive in multiple formats, sizes, and kinds. As a result, the data must be reorganized to a standard format at this point. The data should also be classified and divided based on its intended use. 

This includes setting the headers finding the number of rows and columns and describing the data type under observation. 

Data Cleaning

As the name describes, data cleaning is used for dealing with errors, null values, duplications, and outliers. The data is bought into a similar type that makes the data more optimized. 

Most of our data frame modification tools are intended to function with clean data. As a result, cleaning is most simply accomplished once the data set structure has been neatly organized. As a result, I propose executing the reshaping tidying phase first, followed by the cleaning.

Data Augmentation

Augmenting data takes current data and supplement it with additional info to get new insights. For example, if a shop wants to understand how effective their social media advertising is, they might increase their sales data with marketing data.

Validating Data

Validating data signifies the authentication of the data under observation. We authenticate the reliability factor, credibility, and readiness of the data to be processed. This process helps us determine that the data is accurate and not altered by raw data or null values. 

Exporting the Dataset

Now that we have covered most of the crucial steps, the last is to publish, share or export the dataset to train our model. The dataset is stored on a sharing platform where all the uses in an organization can make use of it. 

Now that we are well versed with the steps involved in data wrangling, let’s look into the implementation of each step. 

Implementation of data wrangling

Let’s implement the data wrangling steps on raw data taken from Kaggle about old cars’ price data. 

Understanding the Data

Step 1: Import the required libraries such as numPy, Pandas, and matplotlib. 

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
You can also try this code with Online Python Compiler
Run Code

 

Step 2: Specify the headers and print the table. 

cols = ['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style', 'drive-wheels', 'engine-location', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type', 'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']
data = pd.read_csv('imports-85.data.txt', names=cols)
print(data.shape)
data.head()
You can also try this code with Online Python Compiler
Run Code

 

Output:

Step 3: Identify and handle missing values 

data = data.replace("?", np.NaN)
data.head()
You can also try this code with Online Python Compiler
Run Code

 

Output:

 

Step 4: Search for null data in the given dataset. 

data.isnull().any().any()
data.isnull().sum()
You can also try this code with Online Python Compiler
Run Code

 

Output: 

symboling             0
normalized-losses    41
make                  0
fuel-type             0
aspiration            0
num-of-doors          2
body-style            0
drive-wheels          0
engine-location       0
wheel-base            0
length                0
width                 0
height                0
curb-weight           0
engine-type           0
num-of-cylinders      0
engine-size           0
fuel-system           0
bore                  4
stroke                4
compression-ratio     0
horsepower            2
peak-rpm              2
city-mpg              0
highway-mpg           0
price                 4
dtype: int64

 

Step 5: Since we know we can replace the null value in a given row or column by taking out the mean and assigning it to the majority of the values, we apply the same here, starting with the row containing the most number of null values. 

avg_norm_loss = data['normalized-losses'].astype("float").mean()
avg_norm_loss
You can also try this code with Online Python Compiler
Run Code

 

Output:

122.0

 

 

data["normalized-losses"].replace(np.NaN, avg_norm_loss, inplace = True)
data["normalized-losses"]
You can also try this code with Online Python Compiler
Run Code

 

Output:

0      122
1      122
2      122
3      164
4      164
      ... 
200     95
201     95
202     95
203     95
204     95
Name: normalized-losses, Length: 205, dtype: object

 

 

avg_bore = data["bore"].astype("float").mean()
data["bore"].replace(np.NaN, avg_bore, inplace = True)
data['bore']
You can also try this code with Online Python Compiler
Run Code

 

Output:

0      3.47
1      3.47
2      2.68
3      3.19
4      3.19
       ... 
200    3.78
201    3.78
202    3.58
203    3.01
204    3.78
Name: bore, Length: 205, dtype: object

 

 

avg_stroke = data["stroke"].astype("float").mean(axis = 0)
print("Average of stroke:", avg_stroke)
data["stroke"].replace(np.nan, avg_stroke, inplace = True)
You can also try this code with Online Python Compiler
Run Code

 

avg_horsepower = data['horsepower'].astype('float').mean(axis=0)
print("Average horsepower:", avg_horsepower)
data['horsepower'].replace(np.nan, avg_horsepower, inplace=True)
You can also try this code with Online Python Compiler
Run Code

 

avg_peakrpm= data['peak-rpm'].astype('float').mean(axis=0)
print("Average peak rpm:", avg_peakrpm)
data['peak-rpm'].replace(np.nan, avg_peakrpm, inplace=True)
You can also try this code with Online Python Compiler
Run Code

 

data['num-of-doors'].value_counts()
You can also try this code with Online Python Compiler
Run Code

 

Output: 

four    114
two      89
Name: num-of-doors, dtype: int64

 

data['num-of-doors'].value_counts().idxmax()
You can also try this code with Online Python Compiler
Run Code

 

data["num-of-doors"].replace(np.nan, "four", inplace=True)
data.head()
You can also try this code with Online Python Compiler
Run Code

 

Output:

Step 6: Drop the rows with NAN values and find the shape.formulatingunderstanding

before_rows = data.shape[0]
data.dropna(subset=["price"], axis=0, inplace=True)
after_rows = data.shape[0]
print("Number of dropped rows {}".format(before_rows - after_rows))
# reset index, because we droped two rows
data.reset_index(drop=True, inplace=True)
You can also try this code with Online Python Compiler
Run Code

 

Output: 

Number of dropped rows 4

 

 

data.shape
You can also try this code with Online Python Compiler
Run Code

 

Output:

(201, 26)

 

 

data.dtypes
You can also try this code with Online Python Compiler
Run Code

 

Output:

symboling              int64
normalized-losses     object
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                  object
stroke                object
compression-ratio    float64
horsepower            object
peak-rpm              object
city-mpg               int64
highway-mpg            int64
price                 object
dtype: object

 

Step 7: Convert the given data into proper format. 

data[["bore", "stroke"]] = data[["bore", "stroke"]].astype("float")
data[["normalized-losses"]] = data[["normalized-losses"]].astype("int")
data[["price"]] = data[["price"]].astype("float")
data[["peak-rpm"]] = data[["peak-rpm"]].astype("float")
data.head()
You can also try this code with Online Python Compiler
Run Code

 

Output:

 

Step 8: Standardize the data 

# transform mpg to L/100km by mathematical operation (235 divided by mpg)
data["highway-mpg"] = 235/data["highway-mpg"]

# rename column name from "highway-mpg" to "highway-L/100km"
data.rename(columns = {'highway-mpg':'highway-L/100km'}, inplace=True)
You can also try this code with Online Python Compiler
Run Code

 

# replace (original value) by (original value)/(maximum value)
data['length'] = data['length']/data['length'].max()
data['width'] = data['width']/data['width'].max()
You can also try this code with Online Python Compiler
Run Code

 

data["height"] = data["height"]/data["height"].max()
data["horsepower"] = data["horsepower"].astype(int, copy = True)
You can also try this code with Online Python Compiler
Run Code

 

data.head()
You can also try this code with Online Python Compiler
Run Code

 

Output:

 

Step 9: Let’s visualize the horsepower column as an example data.

%matplotlib inline
import matplotlib.pyplot as plt
plt.hist(data["horsepower"])

# set x/y labels and plot title
plt.xlabel("horsepower")
plt.ylabel("count")
plt.title("horsepower bins")
You can also try this code with Online Python Compiler
Run Code

 

Output: 

Step 10: Now, describe and export the data. 

data.describe()
You can also try this code with Online Python Compiler
Run Code

 

Output:

data.to_csv('wrangled_data.csv')
You can also try this code with Online Python Compiler
Run Code

 

FAQs

  1. Is it necessary to always proceed with data wrangling first?
    Raw data (real-world data) is always imperfect and thus cannot be processed by a model. Inevitable mistakes would occur as a result of this. That is why data must be preprocessed before being sent via a model. 
     
  2. How does data-wrangling improve data usability? 
    Data Wrangling enhances data usability by transforming data into a suitable format for the end system. It facilitates the rapid and easy construction of data flows through an Intuitive User Interface, where the data flow process may be simply planned and automated. 
     
  3. How should we handle noise in our raw data? 
    The most straightforward strategy to deal with noisy data is to acquire additional data. The more data you gather, the better you will determine the underlying phenomena that are causing the data to be collected. This will eventually aid in the reduction of noise's impact.

Key Takeaways

The technique of cleaning, reorganizing, and enhancing raw, complicated data into a readable format is known as data wrangling. Effective data-wrangling enables firms to access correct data, allowing them to understand their operations better and formulate informed plans. Discovering, organizing, cleaning, enriching, verifying, and distributing data are the six main tasks in data wrangling. Many business operations, such as fraud detection and consumer behavior analysis, might benefit from data wrangling. 

To read more such interesting real-world implementation of concepts, read our blogs at the coding ninjas’ website.  

Live masterclass