Do you think IIT Guwahati certified course can help you in your career?
No
Introduction
We know Time series data, which consists of observations taken at specific time intervals, is being used in various domains like finance, economics, weather forecasting, and more. Analysing and understanding such data often require specialised techniques that consider the temporal aspects of the data.
In this article, we will look at three important techniques for time series analysis that include Resampling, Rolling calculations, and Differencing in Pandas with the help of suitable examples.
So, let us get started.
A brief about TimeSeries Data and its one of the example
A time series is a sequence of data points that occur in sequential order over a given period of time. The data points in the series could be numeric, categorical, or mixed. Time series data helps us to find a pattern or trends over a period of time.
One example of time series data is the daily closing prices of a stock over a period of time. This data can be used to track the stock's performance over time and to make predictions about future prices.
What is Resampling?
Resampling is the process of changing the frequency of time series data.
For example, we can resample a daily time series to weekly or monthly frequencies. It is used when we need to compare data at different time frequencies or when the data contains irregular time intervals.
Resampling is done through the resample() method.
Upsampling and Downsampling
Two main methods of Resampling are:
Upsampling involves increasing the frequency of the data
Downsampling involves decreasing the frequency of the data
rule: This is the frequency that we want to resample the data to. For example, we can use 'D' for daily frequency, 'W' for weekly frequency and so on
on: This is the column that contains the date or time information. Pandas uses the index of the DataFrame in case we do not provide
how: This is an aggregation method for the data. For example, 'mean' and 'sum', 'min', 'max', and 'median' are the common methods
Example
Let us see an example where we will resample the time series of a public dataset to a monthly frequency and calculate the monthly average using Pandas.
The above code reads the dataset, which contains the monthly observation of international flights from 1949 to 1960 and prints it according to the Month column as it is.
Now, let's see the code on how to resample it and perform the mean on each month:
Python
Python
# Resample to monthly frequency
df_monthly = df.resample('M').mean()
# Print the output
print(df_monthly)
You can also try this code with Online Python Compiler
To conclude, we first load the AirPassengers dataset. The resample () method is called to resample the time series to a monthly frequency, and we can notice the days on each month are according to the month itself. After this, we calculate the monthly average.
The output of the code is a DataFrame with 12 observations, one for each month.
What is Rolling Calculations?
Rolling calculations are a type of aggregation calculated over a moving window of data. This helps in identifying trends, smoothing the data, and detecting outliers within the time series. Common rolling calculations involve calculating rolling means, rolling standard deviations, etc.
The rolling() function is used to solve such calculations in a rolling window.
The above code first loads the air passengers dataset into a DataFrame with the index_col argument specifying the Month column, which is used as the index. The parse_dates tells the month column to be parsed as dates.
After that, we calculate the rolling mean with a window size of 12. The fourth line in the code plots the original data, and the fifth line plots the rolling mean.
Output:
The output of the code is a plot with two lines, and below are the two outputs showing the difference.
Original data:
Data showing the rolling mean:
We can see that the rolling mean smooths out the data and helps to identify trends and seasonality.
What is Differencing?
Differencing is a statistical technique which is used to remove the trend and seasonality from a time series. This can make the data more stationary, and stationary data has constant statistical properties over time which can improve accuracy.
Pandas provides the diff() method to perform differencing.
Syntax of diff()
Below is its syntax:
df.diff(periods=1, axis=0)
Below are its parameters:
periods: The number of periods to lag the data. The default value is one
axis: Take the difference over rows (0) or columns (1)
Example
Let us see an example of differencing a time series:-
The above code loads the air passengers dataset into a DataFrame with the index_col argument specifying the Month column, which is used as the index. The parse_dates tells the month column to be parsed as dates.
The diff() function calculates the first difference because the period is 1, so the first difference is calculated by subtracting the previous value from the current value.
Output: The output of the code is a plot with two lines, and below are the two outputs showing the difference.
Original data:
Data showing the first difference:
We can conclude that the first difference removes the trend from the data, leaving only the seasonal and random components.
However, we notice that the graph after the first difference is very noisy because we are subtracting consecutive values, which amplifies small changes in the data and also the trend has been removed by using the diff() method.
Frequently Asked Questions
How Pandas provides time series analysis?
Pandas provides many tools to perform time series analysis, such as resample(), .rolling(), and .diff() for resampling, rolling calculations, and differencing, respectively. This makes time series analysis efficient and insightful.
What is the difference between upsampling and downsampling?
Upsampling increases the data frequency, and Downsampling reduces the frequency. In Upsampling, gaps are filled with new data points and in Downsampling aggregation of data is done like averages.
Rolling calculations applied to non-numeric data?
Rolling calculations are usually applied to numeric data in order to calculate the statistics. However, they are not directly applicable to non-numeric data like text or any variables.
Conclusion
Congratulations, you did a fantastic job!!. This article has covered three important time series analysis techniques: resampling, rolling calculations, and differencing in Pandas with suitable examples. At last, some frequently asked questions are discussed.