Visualizing and Predicting Analysis of Cricket Match - Part 2
This blog is continuation of the previous blog Visualizing and Predicting Analysis of Cricket Match - Part 1
IPL Score Prediction using Deep Learning
Since its inception in 2008, the IPL has drawn spectators from all over the world. A high level of uncertainty and last-minute tension led fans to watch the match. In a short period, the IPL has become the highest-grossing cricket league. In cricket matches, we often see score lines that indicate the probability of a team winning based on the current match situation. This prediction is usually made with the help of data analysis. Before advances in machine learning, predictions were usually based on intuition or some basic algorithms. The image above clearly shows how bad it is to use run rate as the only factor to predict the final score in a cricket match with a limited number of overs.
As a cricket fan, it is fascinating to visualize cricket statistics. I've researched various blogs to find patterns that can be used to predict IPL match scores in advance.
Why Deep Learning?
We humans find it difficult to recognize patterns in vast amounts of data. This is where machine learning and deep learning come into play. It learns how players and teams have previously played against opposing teams and train the model accordingly. We used attribute-aware deep learning, which can perform much better than previous models and provide accurate results, as we can achieve moderate accuracy using machine learning algorithms alone.
Tools used:
- Jupyter Notebook / Google colab
- Visual Studio
Technology used:
- Deep Learning
- Machine Learning
- Flask (Front-end integration).
- For the smooth running of the project, we have used some libraries like Pandas, NumPy, TensorFlow, Scikit-learn, and Matplotlib.
The architecture of the model
Step-by-step implementation:
First, let's import all the necessary libraries:
Python3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
Step 1: Understanding the dataset!
We got the data from https://cricsheet.org/downloads/ipl.zip as Cricsheet is considered a good platform for data collection when working with cricket data. Includes data from 2007 to 2021. IPL player statistics were used to analyze performance from here to improve the accuracy of the model. This dataset contains details for each IPL player from 2016 to 2019. I made some changes to the dataset. B. Added a new column named 'y' containing the runs scored in his first 6 overs of that particular innings.
Step 2: Data cleaning and formatting
We imported both datasets into a dataframe using pandas using the .read_csv() method and the first 5 lines of each record are displayed. We made some changes to the dataset like added a new column named "y" containing the runs scored in his first six overs of that particular innings.
Python3
ipl2 = pd.read_csv('ipl2_dataset.csv')
ipl2.head()
Python3
data = pd.read_csv('IPL Player Stats - 2016 till 2019.csv')
data.head()
Now, we will merge both datasets.
Python3
ipl2= ipl2.drop(['Unnamed: 0','extras','match_id', 'runs_off_bat'],axis = 1)
new_ipl2 = pd.merge(ipl2,data,left_on='striker',right_on='Player',how='left')
new_ipl2.drop(['wicket_type', 'player_dismissed'],axis=1,inplace=True)
new_ipl2.columns
After merging the columns and removing new unwanted columns, we have the following columns left. Here's the modified dataset.
There are various ways to fill null values in our dataset. Here I am simply replacing the categorical values, which are nan with '.'
Python3
str_cols = new_ipl2.columns[new_ipl.dtypes==object]
new_ipl2[str_cols] = new_ipl2[str_cols].fillna('.')
Step 3: Encoding the categorical data to numerical values.
For a column to help the model make predictions, the values must be meaningful to the computer. They can't (yet) understand the text and draw conclusions from it, so we need to encode the strings into numeric categorical values. You can also go through the process manually, but the scikit-learn library gives you the option of using LabelEncoder.
Python3
listf = []
for c in new_ipl.columns:
if new_ipl2.dtype==object:
print(c,"->" ,new_ipl2.dtype)
listf.append(c)
Python3
v1 = new_ipl2['venue'].unique()
v2 = new_ipl2['batting_team'].unique()
v3 = new_ipl2['bowling_team'].unique()
v4 = new_ipl2['striker'].unique()
v5 = new_ipl2['bowler'].unique()
def lbEncode(data):
dataset = pd.DataFrame(new_ipl2)
feature_dict ={}
for feature in dataset:
if dataset[feature].dtype==object:
le = preprocessing.lbEncode()
fs = dataset[feature].unique()
le.fit(fs)
dataset[feature] = le.transform(dataset[feature])
feature_dict[feature] = le
return dataset
lbEncode(new_ipl2)
Python3
ipl2_dataset = new_ipl2[['venue','innings', 'batting_team',
'bowling_team', 'striker', 'non_striker',
'bowler']]
a1 = ipl2_dataset['venue'].unique()
a2 = ipl2_dataset['batting_team'].unique()
a3 = ipl2_dataset['bowling_team'].unique()
a4 = ipl2_dataset['striker'].unique()
a5 = ipl2_dataset['bowler'].unique()
new_ipl2.fillna(0,inplace=True)
features={}
for i in range(len(v1)):
features[v1[i]]=a1[i]
for i in range(len(v2)):
features[v2[i]]=a2[i]
for i in range(len(v3)):
features[v3[i]]=a3[i]
for i in range(len(v4)):
features[v4[i]]=a4[i]
for i in range(len(v5)):
features[v5[i]]=a5[i]
features
Step 4: Feature Engineering and Selection
The dataset has multiple columns, but I can't get that much input from the user, so I took a set of selected features as input and split them into X and Y. Next, split the data into a training set and a test set before using the machine learning algorithm.
Python3
X = new_ipl2[['venue', 'innings','batting_team',
'bowling_team', 'striker','bowler']].values
y = new_ipl2['y'].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.34, random_state=42)
These large numbers are difficult to compare with the model, so it's a good idea to scale the data before processing. Here I use his MinMaxScaler from sklearn. Recommended preprocessing when working with deep learning.
Python3
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Step 5: Building, Training & Testing the Model
Now comes the most exciting part of our project, model building! First, we import Sequential from tensorflow.keras.models. Also, import Dense & Dropout from tensorflow.keras.layers as we will be using multiple layers.
Python3
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow. keras.callbacks import EarlyStopping
EarlyStopping is used to avoid overfitting. Basically, early stopping is to stop calculating loss when val_loss is greater than loss. The val_loss curve should always be below the value curve. When it is determined that the difference between "val_loss" and "loss" has become constant, the training is terminated.
Python3
model = Sequential()
model.add(Dense(43, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(22, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(11, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
Here, we want the final output to be 1, so we created two hidden layers and reduced the number of neurons. We then used the Adam optimizer and the loss as the mean squared error when compiling the model. Now let's start training the model at epoch = 400.
Python3
model.fit(x=X_train, y=y_train, epochs=400,
validation_data=(X_test,y_test),
callbacks=[early_stop] )
It will take some time because of a huge number of samples and epochs and will output the 'loss' and 'val_loss' of each sample as below.
After the training is complete, let us visualize our model's losses.
Python3
model_losses = pd.DataFrame(model.history.history)
model_losses.plot()
As we can see, our model has perfect behavior!
Step 6: Predictions!
Here we come to the final part of our project, where we will be predicting our X_test. Then we will create a dataframe that will show us the actual values and the predicted values.
Python3
predictions = model.predict(X_test)
sample = pd.DataFrame(predictions,columns=['Predict'])
sample['Actual']=y_test
sample.head(10)
As you can see, our model predicts very well. I get almost the same results. To better examine the difference between actual and predicted results, the performance metrics show the error rate using mean_absolute_error and mean_squared_error from sklearn.metrics
Performance Metrics!
Python3
from sklearn.metrics import mean_absolute_error,mean_squared_error
mean_absolute_error(y_test,predictions)
Python3
np.sqrt(mean_squared_error(y_test,predictions))Visualizing and Predicting Analysis of Cricket Match - Part 2