Table of contents
1.
Introduction
2.
A brief about TimeSeries Data and its one of the example
3.
What is Resampling?
3.1.
Upsampling and Downsampling
3.2.
Syntax of resample()
3.3.
Example
3.4.
Python
3.5.
Python
4.
What is Rolling Calculations?
4.1.
Syntax of rolling() 
4.2.
Example
4.3.
Python
5.
What is Differencing?
5.1.
Syntax of diff() 
5.2.
Example
5.3.
Python
6.
Frequently Asked Questions
6.1.
How Pandas provides time series analysis? 
6.2.
What is the difference between upsampling and downsampling?
6.3.
Rolling calculations applied to non-numeric data?
7.
Conclusion
Last Updated: Mar 27, 2024
Easy

Resampling, Rolling Calculations, and Differencing in Pandas

Author Dhruv Rawat
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

We know Time series data, which consists of observations taken at specific time intervals, is being used in various domains like financeeconomicsweather forecasting, and more. Analysing and understanding such data often require specialised techniques that consider the temporal aspects of the data.

Resampling, Rolling Calculations, and Differencing in Pandas

In this article, we will look at three important techniques for time series analysis that include Resampling, Rolling calculations, and Differencing in Pandas with the help of suitable examples. 

So, let us get started.

A brief about TimeSeries Data and its one of the example

A time series is a sequence of data points that occur in sequential order over a given period of time. The data points in the series could be numeric, categorical, or mixed. Time series data helps us to find a pattern or trends over a period of time. 

One example of time series data is the daily closing prices of a stock over a period of time. This data can be used to track the stock's performance over time and to make predictions about future prices.

What is Resampling?

Resampling is the process of changing the frequency of time series data. 

For example, we can resample a daily time series to weekly or monthly frequencies. It is used when we need to compare data at different time frequencies or when the data contains irregular time intervals. 

Resampling is done through the resample() method.

Upsampling and Downsampling

Two main methods of Resampling are:

  • Upsampling involves increasing the frequency of the data
     
  • Downsampling involves decreasing the frequency of the data

Syntax of resample()

Below is the syntax:

DataFrame.resample(rule, on=None, how='mean', fill_value=None, limit=None,
                   closed='left', label='left', convention='start',
                   kind='period', copy=False, **kwargs)

Let's look at the main parameters:

  • rule: This is the frequency that we want to resample the data to. For example, we can use 'D' for daily frequency, 'W' for weekly frequency and so on
     
  • on: This is the column that contains the date or time information. Pandas uses the index of the DataFrame in case we do not provide 
     
  • how: This is an aggregation method for the data. For example,  'mean' and 'sum', 'min', 'max', and 'median' are the common methods

Example

Let us see an example where we will resample the time series of a public dataset to a monthly frequency and calculate the monthly average using Pandas.

Here is the link to the AirPassengers dataset, which we will use in the below code:
 

  • Python

Python

import pandas as pd

# Load the AirPassengers dataset

df = pd.read_csv('AirPassengers.csv', index_col='Month', parse_dates=True)

print(df)
You can also try this code with Online Python Compiler
Run Code

 

Output:

The above code reads the dataset, which contains the monthly observation of international flights from 1949 to 1960 and prints it according to the Month column as it is.

original dataset output


Now, let's see the code on how to resample it and perform the mean on each month:
 

  • Python

Python

# Resample to monthly frequency

df_monthly = df.resample('M').mean()




# Print the output

print(df_monthly)
You can also try this code with Online Python Compiler
Run Code

 

Output:

Resample to monthly frequency and mean output


To conclude, we first load the AirPassengers dataset. The resample () method is called to resample the time series to a monthly frequency, and we can notice the days on each month are according to the month itself. After this, we calculate the monthly average. 

The output of the code is a DataFrame with 12 observations, one for each month.

What is Rolling Calculations?

Rolling calculations are a type of aggregation calculated over a moving window of data. This helps in identifying trends, smoothing the data, and detecting outliers within the time series.
Common rolling calculations involve calculating rolling means, rolling standard deviations, etc. 

The rolling() function is used to solve such calculations in a rolling window.

Syntax of rolling() 

Below is its syntax:

DataFrame.rolling(window, min_periods=None, freq=None, center=False, win_type=None, on=None, axis=0, closed=None)

Let's look at main parameters:

  • window represents the size of the rolling window
     
  • min_periods states the minimum number of data points required for each window. Otherwise, the result is NA
     
  • axis is the axis of the data frame on which operations are performed

Example

Let's see rolling calculations with an example:
 

  • Python

Python

import pandas as pd

# Load the AirPassengers dataset

df = pd.read_csv('AirPassengers.csv', index_col='Month', parse_dates=True)



# Calculate the rolling mean with a window size of 12

df_rolling = df.rolling(12).mean()



# Plot the data

df.plot(figsize=(12, 6))

df_rolling.plot(figsize=(12, 6))
You can also try this code with Online Python Compiler
Run Code


The above code first loads the air passengers dataset into a DataFrame with the index_col argument specifying the Month column, which is used as the index. The parse_dates tells the month column to be parsed as dates.

After that, we calculate the rolling mean with a window size of 12. The fourth line in the code plots the original data, and the fifth line plots the rolling mean.
 

Output:

The output of the code is a plot with two lines, and below are the two outputs showing the difference.

Original data:

 

Original data output

Data showing the rolling mean:

Data showing the rolling mean

We can see that the rolling mean smooths out the data and helps to identify trends and seasonality.

What is Differencing?

Differencing is a statistical technique which is used to remove the trend and seasonality from a time series. This can make the data more stationary, and stationary data has constant statistical properties over time which can improve accuracy. 

Pandas provides the diff() method to perform differencing. 

Syntax of diff() 

Below is its syntax:

df.diff(periods=1, axis=0)


Below are its parameters:

  • periods: The number of periods to lag the data. The default value is one
     
  • axis: Take the difference over rows (0) or columns (1)

Example

Let us see an example of differencing a time series:-
 

  • Python

Python

import pandas as pd

# Load the AirPassengers dataset

df = pd.read_csv('AirPassengers.csv', index_col='Month', parse_dates=True)



# Calculate the first difference

df_diff = df.diff(periods=1, axis=0)



# Plot the data

df.plot(figsize=(12, 6))

df_diff.plot(figsize=(12, 6))
You can also try this code with Online Python Compiler
Run Code

The above code loads the air passengers dataset into a DataFrame with the index_col argument specifying the Month column, which is used as the index. The parse_dates tells the month column to be parsed as dates.

The diff() function calculates the first difference because the period is 1, so the first difference is calculated by subtracting the previous value from the current value. 
 

Output:
The output of the code is a plot with two lines, and below are the two outputs showing the difference.

Original data:

Original data output


Data showing the first difference:

Data showing the first difference

We can conclude that the first difference removes the trend from the data, leaving only the seasonal and random components.

However, we notice that the graph after the first difference is very noisy because we are subtracting consecutive values, which amplifies small changes in the data and also the trend has been removed by using the diff() method.

Frequently Asked Questions

How Pandas provides time series analysis? 

Pandas provides many tools to perform time series analysis, such as resample(), .rolling(), and .diff() for resampling, rolling calculations, and differencing, respectively. This makes time series analysis efficient and insightful.

What is the difference between upsampling and downsampling?

Upsampling increases the data frequency, and Downsampling reduces the frequency. In Upsampling, gaps are filled with new data points and in Downsampling aggregation of data is done like averages.

Rolling calculations applied to non-numeric data?

Rolling calculations are usually applied to numeric data in order to calculate the statistics. However, they are not directly applicable to non-numeric data like text or any variables.

Conclusion

Congratulations, you did a fantastic job!!. This article has covered three important time series analysis techniques: resampling, rolling calculations, and differencing in Pandas with suitable examples. At last, some frequently asked questions are discussed.
 

Here are some more related articles:

Check out The Interview Guide for Product Based Companies and some famous Interview Problems from Top Companies, like AmazonAdobeGoogle, etc., on Coding Ninjas Studio.

Also, check out some of the Guided Paths on topics such as Data Structure and AlgorithmsCompetitive ProgrammingOperating SystemsComputer Networks, DBMSSystem Design, etc., as well as some Contests, Test SeriesInterview Bundles, and some Interview Experiences curated by top Industry Experts only on Coding Ninjas Studio.

We hope you liked this article.

"Have fun coding!”

Live masterclass