Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Visualizing and Predicting Analysis of Cricket Match - Part 2
2.
IPL Score Prediction using Deep Learning
2.1.
Why Deep Learning?
2.2.
Tools used:
2.3.
Technology used:
2.4.
The architecture of the model
2.5.
Step-by-step implementation:
2.6.
Performance Metrics! 
3.
Frequently Asked Questions
3.1.
Briefly explain the models used in the project. 
3.2.
What is Data Visualization?
3.2.1.
Data visualization translates information into a visual context, such as a map or graph, to make data more straightforward for the human brain to understand and pull insights from it.
3.3.
What is Mean Squared Error?
3.3.1.
The MSE(Mean Squared Error) is the average squared distance between the observed and predicted values. Also known as the Mean Squared Deviation.
4.
Conclusion
Last Updated: Mar 27, 2024

Visualizing and Predicting Analysis of Cricket Match - Part 2

Author Shivam Sinha
0 upvote

Visualizing and Predicting Analysis of Cricket Match - Part 2

This blog is continuation of the previous blog Visualizing and Predicting Analysis of Cricket Match - Part 1

IPL Score Prediction using Deep Learning

Since its inception in 2008, the IPL has drawn spectators from all over the world. A high level of uncertainty and last-minute tension led fans to watch the match. In a short period, the IPL has become the highest-grossing cricket league. In cricket matches, we often see score lines that indicate the probability of a team winning based on the current match situation. This prediction is usually made with the help of data analysis. Before advances in machine learning, predictions were usually based on intuition or some basic algorithms. The image above clearly shows how bad it is to use run rate as the only factor to predict the final score in a cricket match with a limited number of overs.

IPL game screenshot with projected score

As a cricket fan, it is fascinating to visualize cricket statistics. I've researched various blogs to find patterns that can be used to predict IPL match scores in advance.

Why Deep Learning?

We humans find it difficult to recognize patterns in vast amounts of data. This is where machine learning and deep learning come into play. It learns how players and teams have previously played against opposing teams and train the model accordingly. We used attribute-aware deep learning, which can perform much better than previous models and provide accurate results, as we can achieve moderate accuracy using machine learning algorithms alone.

Tools used:

  • Jupyter Notebook / Google colab
  • Visual Studio

Technology used:

  • Deep Learning
  • Machine Learning
  • Flask (Front-end integration).
  • For the smooth running of the project, we have used some libraries like Pandas, NumPy, TensorFlow, Scikit-learn, and Matplotlib.

The architecture of the model

System architecture

 

Step-by-step implementation:

First, let's import all the necessary libraries:

Python3

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing

 

Step 1: Understanding the dataset!

We got the data from https://cricsheet.org/downloads/ipl.zip as Cricsheet is considered a good platform for data collection when working with cricket data. Includes data from 2007 to 2021. IPL player statistics were used to analyze performance from here to improve the accuracy of the model. This dataset contains details for each IPL player from 2016 to 2019. I made some changes to the dataset. B. Added a new column named 'y' containing the runs scored in his first 6 overs of that particular innings.

Step 2: Data cleaning and formatting

We imported both datasets into a dataframe using pandas using the .read_csv() method and the first 5 lines of each record are displayed. We made some changes to the dataset like added a new column named "y" containing the runs scored in his first six overs of that particular innings.

Python3

ipl2 = pd.read_csv('ipl2_dataset.csv')
ipl2.head()
IPL 2 dataset

Python3

data = pd.read_csv('IPL Player Stats - 2016 till 2019.csv')
data.head()
Data of players till 2019

Now, we will merge both datasets.

Python3

ipl2= ipl2.drop(['Unnamed: 0','extras','match_id', 'runs_off_bat'],axis = 1)
new_ipl2 = pd.merge(ipl2,data,left_on='striker',right_on='Player',how='left')
new_ipl2.drop(['wicket_type', 'player_dismissed'],axis=1,inplace=True)
new_ipl2.columns

 

After merging the columns and removing new unwanted columns, we have the following columns left. Here's the modified dataset.

Dataset

 

There are various ways to fill null values in our dataset. Here I am simply replacing the categorical values, which are nan with '.'

Python3

str_cols = new_ipl2.columns[new_ipl.dtypes==object]
new_ipl2[str_cols] = new_ipl2[str_cols].fillna('.')

Step 3: Encoding the categorical data to numerical values.

For a column to help the model make predictions, the values must be meaningful to the computer. They can't (yet) understand the text and draw conclusions from it, so we need to encode the strings into numeric categorical values. You can also go through the process manually, but the scikit-learn library gives you the option of using LabelEncoder.

Python3

listf = []  
for c in new_ipl.columns:
    if new_ipl2.dtype==object:
        print(c,"->" ,new_ipl2.dtype)
        listf.append(c)
list of attributes

 

Python3

v1 = new_ipl2['venue'].unique()
v2 = new_ipl2['batting_team'].unique()
v3 = new_ipl2['bowling_team'].unique()
v4 = new_ipl2['striker'].unique()
v5 = new_ipl2['bowler'].unique()
  
def lbEncode(data):
    dataset = pd.DataFrame(new_ipl2)
    feature_dict ={}
      
    for feature in dataset:
        if dataset[feature].dtype==object:
            le = preprocessing.lbEncode()
            fs = dataset[feature].unique()
            le.fit(fs)
            dataset[feature] = le.transform(dataset[feature])
            feature_dict[feature] = le
              
    return dataset
  
lbEncode(new_ipl2)
table that shows attributes of different matches

 

Python3

ipl2_dataset = new_ipl2[['venue','innings', 'batting_team', 
                      'bowling_team', 'striker', 'non_striker',
                      'bowler']]
  
a1 = ipl2_dataset['venue'].unique()
a2 = ipl2_dataset['batting_team'].unique()
a3 = ipl2_dataset['bowling_team'].unique()
a4 = ipl2_dataset['striker'].unique()
a5 = ipl2_dataset['bowler'].unique()
new_ipl2.fillna(0,inplace=True)
  
features={}
  
for i in range(len(v1)):
    features[v1[i]]=a1[i]
for i in range(len(v2)):
    features[v2[i]]=a2[i]
for i in range(len(v3)):
    features[v3[i]]=a3[i]
for i in range(len(v4)):
    features[v4[i]]=a4[i]
for i in range(len(v5)):
    features[v5[i]]=a5[i]
      
features
stadium name and score comparison

Step 4: Feature Engineering and Selection

The dataset has multiple columns, but I can't get that much input from the user, so I took a set of selected features as input and split them into X and Y. Next, split the data into a training set and a test set before using the machine learning algorithm.

Python3

X = new_ipl2[['venue', 'innings','batting_team',
             'bowling_team', 'striker','bowler']].values
y = new_ipl2['y'].values
  
from sklearn.model_selection import train_test_split
  
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.34, random_state=42)

These large numbers are difficult to compare with the model, so it's a good idea to scale the data before processing. Here I use his MinMaxScaler from sklearn. Recommended preprocessing when working with deep learning.

Python3

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
  
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

 

Step 5: Building, Training & Testing the Model

Now comes the most exciting part of our project, model building! First, we import Sequential from tensorflow.keras.models. Also, import Dense & Dropout from tensorflow.keras.layers as we will be using multiple layers.

Python3

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow. keras.callbacks import EarlyStopping

EarlyStopping is used to avoid overfitting. Basically, early stopping is to stop calculating loss when val_loss is greater than loss. The val_loss curve should always be below the value curve. When it is determined that the difference between "val_loss" and "loss" has become constant, the training is terminated.

Python3

model = Sequential() 
model.add(Dense(43, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(22, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(11, activation='relu'))
model.add(Dropout(0.5)) 
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')

 

Here, we want the final output to be 1, so we created two hidden layers and reduced the number of neurons. We then used the Adam optimizer and the loss as the mean squared error when compiling the model. Now let's start training the model at epoch = 400.

Python3

model.fit(x=X_train, y=y_train, epochs=400, 
          validation_data=(X_test,y_test),
          callbacks=[early_stop] )

It will take some time because of a huge number of samples and epochs and will output the 'loss' and 'val_loss' of each sample as below.

sample loss and val_loss

After the training is complete, let us visualize our model's losses.

Python3

model_losses = pd.DataFrame(model.history.history)
model_losses.plot()
Matplotlib graph of the code

As we can see, our model has perfect behavior!  

Step 6: Predictions!

Here we come to the final part of our project, where we will be predicting our X_test. Then we will create a dataframe that will show us the actual values and the predicted values.

Python3

predictions = model.predict(X_test)
sample = pd.DataFrame(predictions,columns=['Predict'])
sample['Actual']=y_test
sample.head(10)
Output of above code

 

As you can see, our model predicts very well. I get almost the same results. To better examine the difference between actual and predicted results, the performance metrics show the error rate using mean_absolute_error and mean_squared_error from sklearn.metrics

Performance Metrics! 

Python3

from sklearn.metrics import mean_absolute_error,mean_squared_error
  
mean_absolute_error(y_test,predictions)
output

Python3

np.sqrt(mean_squared_error(y_test,predictions))Visualizing and Predicting Analysis of Cricket Match - Part 2
output

Frequently Asked Questions

Briefly explain the models used in the project. 

  • Linear Regression - A linear relationship between input and output is assumed, and the best fit line is obtained. 
  • K-nearest neighbor - KNN assumes data points closer to each other are similar to each other. So a new data point is assigned a category based on its nearest neighbors, where K is the number of nearest neighbors. 
  • Random Forest - It's an ensemble learning method where several decision trees generate an output. For regression, tasks can take the mean of the results from decision trees. 
  • Recurrent Neural Network - It is a deep learning method. RNNs are feed-forward neural networks with a memory used to store the process of sequence inputs.  

What is Data Visualization?

Data visualization translates information into a visual context, such as a map or graph, to make data more straightforward for the human brain to understand and pull insights from it.

What is Mean Squared Error?

The MSE(Mean Squared Error) is the average squared distance between the observed and predicted values. Also known as the Mean Squared Deviation.

Conclusion

The IPL score prediction model is very similar to a real-life machine learning task and extensively uses previously acquired knowledge of various supervised machine learning algorithms. We strongly advise readers to code along, which always gives a better picture of the nitty-gritty of the models.

You may check out our industry-oriented machine learning courses curated by industry experts. 

To learn more, see Cloud ComputingMicrosoft Azure, Basics of C++ with Data StructureDBMSOperating System by Coding Ninjas, and keep practicing on our platform Coding Ninjas Studio.

If you think you are ready for the tech giants company, check out the mock test series on code studio.

You can also refer to our Guided Path on Coding Ninjas Studio to upskill yourself in domains like Data Structures and AlgorithmsCompetitive ProgrammingAptitude, and many more! You can also prepare for tech giants companies like Amazon, Microsoft, Uber, etc., by looking for the questions asked by them in recent interviews. If you want to prepare for placements, refer to the interview bundle. If you are nervous about your interviews, you can see interview experiences to get ideas about questions that have been asked by these companies.

 Do upvote if you find this blog helpful!

Be a Ninja

Happy Coding!

Live masterclass