Group 6

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 50

Optimal model to predict defender’s tackle points of PKL

Season-8

MPBA_G505 - Statistics and Basic Econometrics

Submitted to Dr. Achint Nigam

Group - 6

Kannaiahgari Sahith - 2023H1540832P

Vudayagiri Sai Shiva Kumar - 2023H1540859P

Bodempudi Siri - 2023H1540871P

Aigal Chetas Manjunath - 2023H1540802P

M V S Sri Sathvika - 2023H1540864P

Naveen Kumar - 2023H1540845P

Sunil Kumar Behera - 2023H1540843P

Through this project, our focus is to build a multiple linear regression model which can predict the total
tackle points that a defender can score based on his previous performances and the factors that affect his
performance.
Table of contents
Problem Statement:
In professional kabaddi league, defenders play a pivotal role in preventing the opposite
team’s raiders from scoring points. Analysing and estimating a defender’s ability to
accumulate tackle points can provide best insights and strategic decision-making for team
management. To tackle this problem (predicting a defender’s ability of picking points) , a
predictive model that accurately forecasts the tackle points, that a defender can likely
achieve in kabaddi matches can be developed.

Objective:
The objective of this project is to predict the total tackle points scored by a defender by
taking the various parameters like total tackles, height, weight, age, super tackles, high
5’s,matches played, auction price, position, average time on mat into consideration.

Data File Description:


• Independent Variables:

1. Matches:

Total number of matches played by a player in Pro Kabaddi League season 8, which
can reflect their experience.

2. Total Tackles:

Total number of tackles attempted by a defender in Pro Kabaddi League season 8

3. Height:

Height of the player, which may influence their playing style and tackling ability.

4. Weight:

Weight of the player, which can affect their performance and physical strength.

5. Age:

Age of the player as of Pro Kabaddi League Season 8.

6. Super Tackles:

A Super Tackle is awarded to the defending team in Pro Kabaddi League (PKL) when
they successfully tackle the raider while having three or lesser players on the mat.
On such an occasion, the defending team scores two points instead of the usual one.

7. High 5s:

High 5 is achieved when a defender scores five or more tackle points in a single
match.
8. Average time on mat:

This metric calculates the average amount of time that a defender is on the Kabaddi
mat (playing field) in matches he played. The time on mat includes moments when
the defender is actively involved in defensive actions, such as attempting tackles,
defending against raiders, and assisting teammates in defensive maneuvers.

9. Position:

Shows the position of defender on the mat. It can be Right Cover, Right Corner, Left
Cover, Left Corner

Defender Positions
10. Auction Price:

Indicates the price for which the player was bought by that particular team or
retained or the price offered to him for acting as a replacement for the player who
quit the season for some reason

• Dependent Variable:

1. Total Tackle Points:

Indicates the total number of points earned by the defender for each successful
tackle made by him for Pro Kabaddi League Season 8
Methodology:
1. We gathered team wise data about the defenders who played Pro kabaddi League
season 8
2. Data was collected based on following parameters
– Matches played

– Total Tackles

– Height

– Weight

– Age

– Position

– Super tackles

– high 5s

– average time on mat

– auction price

– Total Tackle points

3. Data Processing:

• We removed the null or not available data and replaced those values with mean
values

• Dummy variables were created for the categorical variable we have in our data set
which is related to position where the defender is defending from

4. Data Analysis:

• Performed linearity check between each independent variable and the dependent
variable we have in our data set

• Plotted the Correlation matrix

• Proceeded with model building and programmed 4 different models

• For each model we performed model evaluation to see if the model is fit for the
project
Importing Libraries
The Following libraries are imported and used in this project
library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.3.2

Warning: package 'ggplot2' was built under R version 4.3.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0


──
✔ dplyr 1.1.3 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts()
──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all
conflicts to become errors

library(ggplot2)
library(coefplot)

Warning: package 'coefplot' was built under R version 4.3.2

library(car)

Warning: package 'car' was built under R version 4.3.2

Loading required package: carData

Warning: package 'carData' was built under R version 4.3.2

Attaching package: 'car'

The following object is masked from 'package:dplyr':

recode

The following object is masked from 'package:purrr':

some

library(corrplot)

Warning: package 'corrplot' was built under R version 4.3.2


corrplot 0.92 loaded

Importing Data
Data was imported through a CSV file
data_q <- read.csv('kabaddi_Group_6.csv')

print(colnames(data_q))

[1] "Name" "Team" "Matches"


[4] "Total_Tackles" "Height.cm." "Weight.kg."
[7] "Age" "Super_Tackles" "High_5s"
[10] "Avg.time_on_mat..." "Position" "Auction_Price.Lakhs."
[13] "Total_Tackle_Points"

head(data_q)

Name Team Matches Total_Tackles Height.cm.


1 Amit Sheoran Bengaluru Bulls 9 11 179.83
2 Saurabh Nandal Bengaluru Bulls 24 112 179.83
3 Mahender Singh Bengaluru Bulls 20 70 176.78
4 Mayur Jagannath Kadam Bengaluru Bulls 22 48 170.68
5 Aman Antil Bengaluru Bulls 23 109 176.78
6 PO Surjeet Singh Tamil Thalaivas 20 100 174.00
Weight.kg. Age Super_Tackles High_5s Avg.time_on_mat... Position
1 73 24 2 0 28.23 Left_Corner
2 70 22 6 2 86.34 Right_Corner
3 70 25 8 2 80.04 Left_Cover
4 70 25 0 0 65.41 Right_Cover
5 60 22 2 2 70.67 Left_Corner
6 83 31 4 2 83.51 Right_Cover
Auction_Price.Lakhs. Total_Tackle_Points
1 27.5 5
2 25.0 69
3 50.0 39
4 15.0 14
5 30.0 53
6 75.0 53

Linearity Check:
We are plotting the data points as a scatter plot to check the linearity if the predictor
variables in the regression have a straight-line relationship with the outcome variable.
var1 <- lm(Total_Tackle_Points ~ Matches , data = data_q)
plot(data_q$Matches, data_q$Total_Tackle_Points,main = "Total tackle Points
Vs Matches",xlab = "Matches", ylab = "Total Tackle Points")
abline(var1, col = "red")
var2 <- lm(Total_Tackle_Points ~`Total_Tackles` , data = data_q)
plot(data_q$Total_Tackles,data_q$Total_Tackle_Points,main = "Total tackle
Points Vs Total Tackles",xlab = "Total_Tackles", ylab = "Total Tackle
Points")
abline(var2, col = "red")

var3 <- lm(Total_Tackle_Points ~Avg.time_on_mat... , data = data_q)


plot(data_q$Avg.time_on_mat..., data_q$Total_Tackle_Points,main = "Total
tackle Points Vs Avg.time on mat",xlab = "Avg.time_on_mat...", ylab = "Total
Tackle Points")
abline(var3, col = "red")
var4 <- lm(Total_Tackle_Points ~`Height.cm.` , data = data_q)
plot(data_q$Height.cm.,data_q$Total_Tackle_Points,main = "Total tackle Points
Vs Height.cm",xlab = "Height.cm.", ylab = "Total Tackle Points")
abline(var4, col = "red")

var5 <- lm(Total_Tackle_Points ~`Weight.kg.` , data = data_q)


plot(data_q$Weight.kg.,data_q$Total_Tackle_Points, main = "Total tackle
Points Vs Weight.kg",xlab = "Weight.kg.", ylab = "Total Tackle Points")
abline(var5, col = "red")
var6 <- lm(Total_Tackle_Points ~`Super_Tackles` , data = data_q)
plot(data_q$Super_Tackles,data_q$Total_Tackle_Points,main = "Total tackle
Points Vs Super Tackles",xlab = "Super_Tackles", ylab = "Total Tackle
Points")
abline(var6, col = "red")

var7 <- lm(Total_Tackle_Points ~`Age` , data = data_q)


plot(data_q$Age, data_q$Total_Tackle_Points, main ="Total tackle Points Vs
Age",xlab = "Age", ylab = "Total Tackle Points")
abline(var7, col = "red")
var8 <- lm(Total_Tackle_Points ~`High_5s` , data = data_q)
plot(data_q$High_5s, data_q$Total_Tackle_Points,main = "Total tackle Points
Vs High_5s",xlab = "High_5s", ylab = "Total Tackle Points")
abline(var8, col = "red")

var9 <- lm(Total_Tackle_Points ~`Auction_Price.Lakhs.` , data = data_q)


plot(data_q$Auction_Price.Lakhs.,data_q$Total_Tackle_Points, main = "Total
tackle Points Vs Auction Price.Lakhs",xlab = "Auction_Price.Lakhs.", ylab =
"Total Tackle Points")
abline(var9, col = "red")
From the above plots we observe there is a linear relationship among independent and
dependent variables

Correlation Matrix Plot:


A correlation matrix is a statistical technique to evaluate the relationship between two
variables in a data set.
Interpretation - From the below correlation matrix we observe that total tackles, high 5s,
matches, average time on mat have correlation coefficient greater than 0.7. Among these
variables total tackles have highest correlation with dependent variable i.e.., total tackle
points
column_types <- sapply(data_q, class)

data_q$Total_Tackle_Points <- as.numeric(data_q$Total_Tackle_Points)


numeric_vars <- data_q[sapply(data_q, is.numeric)]

correlation_matrix <- cor(numeric_vars)


corrplot(correlation_matrix,method = 'square',addCoef.col='black',
number.cex=0.5, tl.cex=0.5,t1.str=45)

Warning in text.default(pos.xlabel[, 1], pos.xlabel[, 2], newcolnames, srt =


tl.srt, : "t1.str" is not a graphical parameter

Warning in text.default(pos.ylabel[, 1], pos.ylabel[, 2], newrownames, col =


tl.col, : "t1.str" is not a graphical parameter

Warning in title(title, ...): "t1.str" is not a graphical parameter


Multi Linear Regression:
Multi Linear Regression or Multiple Regression is a study of how a dependent variable is
related to two or more independent variables. In our project we have 9 independent and 1
dependent variables. In order to perform multi linear regression we have different models,
of which we chose 3 different models. They are:
• Correlation Model

• Backward elimination model

• P Value Model
Model 1 : Correlation Model
• We are dropping matches, high 5s and avg time on mat because total takles is highly
correlated with dependent variable

• We are developing this correlation model depending on other variables


#-----we are dropping matches, high 5s and avg time on mat because total
takles is highly correlated with dependent variable------
cor_model <- lm(Total_Tackle_Points ~ `Total_Tackles` + `Height.cm.` +
`Weight.kg.` + `Super_Tackles` + `Age` + `Auction_Price.Lakhs.` + `Position`,
data = data_q)

Below is the summary of the correlation model


summary(cor_model)

Call:
lm(formula = Total_Tackle_Points ~ Total_Tackles + Height.cm. +
Weight.kg. + Super_Tackles + Age + Auction_Price.Lakhs. +
Position, data = data_q)

Residuals:
Min 1Q Median 3Q Max
-10.5436 -3.0054 -0.3907 2.6888 20.9114

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -63.12533 27.58561 -2.288 0.0261 *
Total_Tackles 0.43074 0.02037 21.144 < 2e-16 ***
Height.cm. 0.28609 0.15254 1.876 0.0661 .
Weight.kg. 0.17360 0.16250 1.068 0.2901
Super_Tackles 2.02372 0.37345 5.419 1.43e-06 ***
Age -0.11754 0.16179 -0.726 0.4707
Auction_Price.Lakhs. 0.02148 0.02984 0.720 0.4747
PositionLeft_Cover -4.80335 2.22309 -2.161 0.0352 *
PositionRight_Corner 2.04536 1.94121 1.054 0.2967
PositionRight_Cover 0.92843 1.89199 0.491 0.6256
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.248 on 54 degrees of freedom


Multiple R-squared: 0.9467, Adjusted R-squared: 0.9378
F-statistic: 106.6 on 9 and 54 DF, p-value: < 2.2e-16
Average Variable Plots:
These Plots are for checking the partial correlation in multi linear regression and it is
plotted between the independent variable and dependent variable
avPlots(cor_model)

Coefficient Plot:
Used for plotting the coefficients of fitted values
coefplot(cor_model)
Residual vs Fitted Values Plot:
This Plot gives us how residuals are distributed. If they are randomly distributed then it’s a
good model.
In output graph below,the data points are randomly distributed without following any
pattern. Which indicates that the model is efficient enough to predict.
ggplot(aes(x=.fitted, y=.resid), data = cor_model) + geom_point() +
geom_hline(yintercept = 0) + geom_smooth(se = FALSE) + labs(x="Fitted
Values", y="Residuals")

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'


Standardized Residual Plot:
We plot density plots to visually check whether or not the residuals are normally
distributed. If the plot is roughly bell-shaped, then the residuals likely follow a normal
distribution and as compared with the output the density plot roughly follows a bell shape,
which ensures that the residuals are more normally distributed.
# Plot standardized residuals
resid_std <- rstandard(cor_model)
plot(density(resid_std), main="Standardized Residuals Plot",
xlab="Standardized Residuals", ylab="Density")
abline(h = 0, v = 0) # Add reference lines
abline(v = c(-2, 2), col = "red")
abline(h = 0, col = "blue")

As we can observe the plot is a bell shaped curve indicating that the residuals are normally
distributed

Evaluation of Model 1:
A model is to be evaluated to prove that it’s best model for prediction. The evaluation is
based on various parameters.
1. Residuals vs Fitted Values:

This Plot gives us how residuals are distributed against fitted values. If they are
randomly distributed then it’s a good model
#Extract residuals
residuals <- residuals(cor_model)
# Plot residuals vs. fitted values
plot(cor_model$fitted.values, residuals, main = "Residuals vs. Fitted
Values", xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red", lty = 4) # Add a horizontal line at y = 0

if we observe the Residuals are randomly distributed without following a fixed pattern so
this is an indication that the model can predict
2. Histogram of Residuals
# Plot a histogram of residuals
hist(residuals, main = "Histogram of Residuals", xlab = "Residuals")
If we observe the plot of the histogram indicates a bell shaped curve. Which indicates that
the residuals are normally distributed
3. Check for the Normality of Residuals Using Normal Q-Q Plot
In Normal Q-Q Plot. Q stands for Quantile. The points on the Q-Q Plot provides an indication
of univariate normality of the data set. If the data is normally distributed the points fall on
the 45 degree referenced line if not, they deviate from the reference line
# Plot a Q-Q plot to check normality of residuals
qqnorm(residuals, main = "Normal Q-Q Plot")
qqline(residuals, col = 4)

In the diagram above, the quantile values of the standard normal distribution are plotted
on the x-axis in the Normal QQ plot, and the corresponding quantile values of the dataset
are plotted on the y-axis. You can see that most of the points fall close to the 45-degree
reference line
4. Check For Heteroskedasticity

Heteroskedasticity refers to the state of systematic changes in the spread of


residuals or the error term of the model. The presence of residual variance in a
model shows that the scattering of the model is dependent on at least one
independent variable. This adds business to the model and hence creates a scenario
of deviation of the model from effective and actual results. Hence a check for
Heteroscedasticity is important
# Check for heteroscedasticity by plotting residuals vs. predictor variables
par(mfrow = c(2, 2))
plot(cor_model, which = c(1, 2, 3, 5)) # Plots 1, 2, 3, and 5 in the 2x2
layout
abline(h = c(-2, 2), col = "red", lty = 2) # Horizontal lines at -2 and 2
Residual vs leverage:
A residuals vs. leverage plot is a type of diagnostic plot that allows us to identify influential
observations in a regression model.
ggplot(cor_model, aes(sample=.stdresid)) + stat_qq() + geom_abline()
Model 2: Backward Elimination:
• Stepwise regression is a method that iteratively examines the statistical significance
of each independent variable in a linear regression model

• The backward elimination process begins by fitting a multiple linear regression


model with all the independent variables. The variable with the highest p-value is
removed from the model, and a new model fits. This process is repeated until all
variables in the model have a p-value below some threshold, typically 0.05.
#----- Backward elimination model --------
full_model <- lm(Total_Tackle_Points ~ Matches + `Total_Tackles` +
`Height.cm.` + `Weight.kg.` + `Avg.time_on_mat...` + `Super_Tackles` + `Age`
+ `High_5s` + `Auction_Price.Lakhs.` + `Position`, data = data_q)
back_model <- step(full_model, direction = "backward")

Start: AIC=213.03
Total_Tackle_Points ~ Matches + Total_Tackles + Height.cm. +
Weight.kg. + Avg.time_on_mat... + Super_Tackles + Age + High_5s +
Auction_Price.Lakhs. + Position

Df Sum of Sq RSS AIC


- Auction_Price.Lakhs. 1 7.16 1196.6 211.41
- Age 1 7.89 1197.3 211.45
- Avg.time_on_mat... 1 22.53 1212.0 212.23
- Matches 1 31.30 1220.7 212.69
<none> 1189.4 213.03
- Weight.kg. 1 58.75 1248.2 214.12
- Height.cm. 1 70.49 1259.9 214.72
- Position 3 189.95 1379.4 216.51
- High_5s 1 284.11 1473.5 224.74
- Super_Tackles 1 523.63 1713.1 234.38
- Total_Tackles 1 1521.02 2710.4 263.74

Step: AIC=211.41
Total_Tackle_Points ~ Matches + Total_Tackles + Height.cm. +
Weight.kg. + Avg.time_on_mat... + Super_Tackles + Age + High_5s +
Position

Df Sum of Sq RSS AIC


- Age 1 8.58 1205.2 209.87
- Matches 1 30.16 1226.8 211.01
- Avg.time_on_mat... 1 34.04 1230.6 211.21
<none> 1196.6 211.41
- Weight.kg. 1 74.69 1271.3 213.29
- Height.cm. 1 79.72 1276.3 213.54
- Position 3 189.39 1386.0 214.82
- High_5s 1 284.07 1480.7 223.05
- Super_Tackles 1 533.16 1729.8 233.00
- Total_Tackles 1 1523.32 2719.9 261.97
Step: AIC=209.87
Total_Tackle_Points ~ Matches + Total_Tackles + Height.cm. +
Weight.kg. + Avg.time_on_mat... + Super_Tackles + High_5s +
Position

Df Sum of Sq RSS AIC


- Avg.time_on_mat... 1 26.36 1231.5 209.26
<none> 1205.2 209.87
- Matches 1 44.45 1249.6 210.19
- Weight.kg. 1 67.46 1272.6 211.36
- Height.cm. 1 78.22 1283.4 211.90
- Position 3 194.39 1399.6 213.44
- High_5s 1 291.34 1496.5 221.73
- Super_Tackles 1 537.87 1743.0 231.49
- Total_Tackles 1 1518.34 2723.5 260.05

Step: AIC=209.26
Total_Tackle_Points ~ Matches + Total_Tackles + Height.cm. +
Weight.kg. + Super_Tackles + High_5s + Position

Df Sum of Sq RSS AIC


<none> 1231.5 209.26
- Matches 1 61.41 1293.0 210.37
- Height.cm. 1 75.31 1306.9 211.06
- Weight.kg. 1 76.67 1308.2 211.12
- Position 3 208.10 1439.6 213.25
- High_5s 1 269.61 1501.2 219.93
- Super_Tackles 1 559.66 1791.2 231.23
- Total_Tackles 1 2010.05 3241.6 269.20

If we observe There are three independent variables dropped which indicates that the P-
Value of those variables are greater than 0.05. The dropped variables are:
1. Auction Price

2. Age

3. Average time on mat


Below is the detailed summary provided for the Backward Elimination Model:
summary(back_model)

Call:
lm(formula = Total_Tackle_Points ~ Matches + Total_Tackles +
Height.cm. + Weight.kg. + Super_Tackles + High_5s + Position,
data = data_q)

Residuals:
Min 1Q Median 3Q Max
-11.8446 -2.3352 -0.9767 2.1291 16.0434

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -65.20823 24.49777 -2.662 0.01022 *
Matches 0.24771 0.15095 1.641 0.10662
Total_Tackles 0.33539 0.03573 9.388 6.09e-13 ***
Height.cm. 0.25126 0.13827 1.817 0.07474 .
Weight.kg. 0.23570 0.12855 1.834 0.07224 .
Super_Tackles 1.73615 0.35047 4.954 7.55e-06 ***
High_5s 2.72547 0.79268 3.438 0.00113 **
PositionLeft_Cover -2.83246 2.05119 -1.381 0.17300
PositionRight_Corner 2.41419 1.68470 1.433 0.15762
PositionRight_Cover 1.83309 1.72166 1.065 0.29174
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.776 on 54 degrees of freedom


Multiple R-squared: 0.9559, Adjusted R-squared: 0.9485
F-statistic: 129.9 on 9 and 54 DF, p-value: < 2.2e-16
Average Variable Plots:
These Plots are for checking the partial correlation in multi linear regression and it is
plotted between the independent variable and dependent variable
#AV-Plots are giving information about linearity in a model
avPlots(back_model)

Coefficient Plot
Used for plotting the coefficients of fitted values
coefplot(back_model)
Residual vs Fitted Values Plot:
This Plot gives us how residuals are distributed. If they are randomly distributed then it’s a
good model
ggplot(aes(x=.fitted, y=.resid), data = back_model) + geom_point() +
geom_hline(yintercept = 0) + geom_smooth(se = FALSE) + labs(x="Fitted
Values", y="Residuals")

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'


Here, in our graph the data points are randomly distributed without following any pattern.
Which indicates that the model is efficient enough to predict
Standardized Residual Plot:
we plot density plots to visually check whether or not the residuals are normally
distributed. If the plot is roughly bell-shaped, then the residuals likely follow a normal
distribution and as compared with the output the density plot roughly follows a bell shape,
which ensures that the residuals are more normally distributed.
resid_std <- rstandard(back_model)
plot(density(resid_std), main="Standardized Residuals Plot",
xlab="Standardized Residuals", ylab="Density")
abline(h = 0, v = 0)
# Add reference lines
abline(v = c(-2, 2), col = "red")
abline(h = 0, col = "blue")

As we can observe the plot is a bell shaped curve indicating that the residuals are normally
distributed.
Evaluation of Model 2:
A model is to be evaluated to prove that it’s best model for prediction. The evaluation is
based on various parameters.
1. Residuals vs Fitted Values:

This Plot gives us how residuals are distributed against fitted values. If they are
randomly distributed then it’s a good model
#Extract residuals
residuals <- residuals(back_model)
# Plot residuals vs. fitted values
plot(back_model$fitted.values, residuals, main = "Residuals vs. Fitted
Values", xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red", lty = 4) # Add a horizontal line at y = 0
if we observe the Residuals are randomly distributed without following a fixed pattern so
this is an indication that the model can predict
2. Histogram of Residuals
# Plot a histogram of residuals
hist(residuals, main = "Histogram of Residuals", xlab = "Residuals")

If we observe the plot of the histogram indicates a bell shaped curve. Which indicates that
the residuals are normally distributed
3. Check for the Normality of Residuals Using Normal Q-Q Plot
In Normal Q-Q Plot. Q stands for Quantile. The points on the Q-Q Plot
provides an indication of univariate normality of the data set. If the data
is normally distributed the points fall on the 45 degree referenced line if
not, they deviate from the reference line

# Plot a Q-Q plot to check normality of residuals


qqnorm(residuals, main = "Normal Q-Q Plot")
qqline(residuals, col = 4)

In the diagram above, the quantile values of the standard normal distribution are plotted
on the x-axis in the Normal QQ plot, and the corresponding quantile values of the dataset
are plotted on the y-axis. You can see that most of the points fall close to the 45-degree
reference line
4. Check For Heteroskedasticity

Heteroskedasticity refers to the state of systematic changes in the spread of


residuals or the error term of the model. The presence of residual variance in a
model shows that the scattering of the model is dependent on at least one
independent variable. This adds business to the model and hence creates a scenario
of deviation of the model from effective and actual results. Hence a check for
Heteroscedasticity is important
#Sample Vs Theory values Plot
ggplot(back_model, aes(sample=.stdresid)) + stat_qq() + geom_abline()
# Check for heteroscedasticity by plotting residuals vs. predictor variables
par(mfrow = c(2, 2))
plot(back_model, which = c(1, 2, 3, 5)) # Plots 1, 2, 3, and 5 in the 2x2
layout
abline(h = c(-2, 2), col = "red", lty = 2) # Horizontal lines at -2 and 2

Residual vs leverage:
A residuals vs. leverage plot is a type of diagnostic plot that allows us to identify influential
observations in a regression model.
Model 3: P Value Model
• This model is developed in order to check the significance of P- value
#---- Model 4: Checking for the significance of p-
value----------------------------------------------
pval_model <- lm(Total_Tackle_Points ~ `Total_Tackles` + `Super_Tackles` +
`High_5s`, data = data_q)
summary(pval_model)

Call:
lm(formula = Total_Tackle_Points ~ Total_Tackles + Super_Tackles +
High_5s, data = data_q)

Residuals:
Min 1Q Median 3Q Max
-11.2347 -2.2619 -0.3919 2.2107 22.5962

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.91587 1.31554 -0.696 0.488996
Total_Tackles 0.35613 0.02659 13.395 < 2e-16 ***
Super_Tackles 1.77184 0.34667 5.111 3.52e-06 ***
High_5s 2.82733 0.75436 3.748 0.000403 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.27 on 60 degrees of freedom


Multiple R-squared: 0.9403, Adjusted R-squared: 0.9373
F-statistic: 314.9 on 3 and 60 DF, p-value: < 2.2e-16

If we observe the above summary, there are 7 independent variables with p value greater
than 0.05 and hence they are dropped. The dropped variables are:
• Average time on mat

• Position

• Height

• Weight

• Age

• Auction Price

• Matches
Average Variable Plots:
These Plots are for checking the partial correlation in multi linear regression and it is
plotted between the independent variable and dependent variable
avPlots(pval_model)
Coefficient Plot
Used for plotting the coefficients of fitted values
coefplot(pval_model)
Residual vs Fitted Values Plot:
This Plot gives us how residuals are distributed. If they are randomly distributed then it’s a
good model
ggplot(aes(x=.fitted, y=.resid), data = pval_model) + geom_point() +
geom_hline(yintercept = 0) + geom_smooth(se = FALSE) + labs(x="Fitted
Values", y="Residuals")

`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

resid_std <- rstandard(pval_model)

Here, in our graph the data points are randomly distributed without following any pattern.
Which indicates that the model is efficient enough to predict
Standardized Residual Plot:
we plot density plots to visually check whether or not the residuals are normally
distributed. If the plot is roughly bell-shaped, then the residuals likely follow a normal
distribution and as compared with the output the density plot roughly follows a bell shape,
which ensures that the residuals are more normally distributed.
# Plot standardized residuals
plot(density(resid_std), main="Standardized Residuals
Plot",xlab="Standardized Residuals", ylab="Density")
abline(h = 0, v = 0) # Add reference lines
abline(v = c(-2, 2), col = "red")
abline(h = 0, col = "blue")
As we can observe the plot is a bell shaped curve indicating that the residuals are normally
distributed
Evaluation of Model 3:
A model is to be evaluated to prove that it’s best model for prediction. The evaluation is
based on various parameters.
1. Residuals vs Fitted Values:

This Plot gives us how residuals are distributed against fitted values. If they are
randomly distributed then it’s a good model
#Extract residuals
residuals <- residuals(pval_model) # Plot residuals vs. fitted values
plot(pval_model$fitted.values, residuals, main = "Residuals vs. Fitted
Values", xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red", lty = 4) # Add a horizontal line at y = 0

if we observe the Residuals are randomly distributed without following a fixed pattern so
this is an indication that the model can predict
2. Histogram of Residuals
# Plot a histogram of residuals
hist(residuals, main = "Histogram of Residuals", xlab = "Residuals")

If we observe the plot of the histogram indicates a bell shaped curve. Which indicates that
the residuals are normally distributed
3. Check for the Normality of Residuals Using Normal Q-Q Plot

In Normal Q-Q Plot. Q stands for Quantile. The points on the Q-Q Plot provides an
indication of univariate normality of the data set. If the data is normally distributed
the points fall on the 45 degree referenced line if not, they deviate from the
reference line
# Plot a Q-Q plot to check normality of residuals
qqnorm(residuals, main = "Normal Q-Q Plot")
qqline(residuals, col = 4)

In the diagram above, the quantile values of the standard normal distribution are plotted
on the x-axis in the Normal QQ plot, and the corresponding quantile values of the dataset
are plotted on the y-axis. You can see that most of the points fall close to the 45-degree
reference line
4. Check For Heteroskedasticity

Heteroskedasticity refers to the state of systematic changes in the spread of


residuals or the error term of the model. The presence of residual variance in a
model shows that the scattering of the model is dependent on at least one
independent variable. This adds business to the model and hence creates a scenario
of deviation of the model from effective and actual results. Hence a check for
Heteroscedasticity is important
# Check for heteroscedasticity by plotting residuals vs. predictor variables
par(mfrow = c(2, 2))
plot(pval_model, which = c(1, 2, 3, 5)) # Plots 1, 2, 3, and 5 in the 2x2
layout
abline(h = c(-2, 2), col = "red", lty = 2) # Horizontal lines at -2 and 2
Residual vs leverage:
A residuals vs. leverage plot is a type of diagnostic plot that allows us to identify influential
observations in a regression model.
ggplot(pval_model, aes(sample=.stdresid)) + stat_qq() + geom_abline()
Comparision between the models
Multi Plot:
multiplot(cor_model,back_model,pval_model)
Comparision Using ANOVA:
anova(cor_model,back_model,pval_model)

Analysis of Variance Table

Model 1: Total_Tackle_Points ~ Total_Tackles + Height.cm. + Weight.kg. +


Super_Tackles + Age + Auction_Price.Lakhs. + Position
Model 2: Total_Tackle_Points ~ Matches + Total_Tackles + Height.cm. +
Weight.kg. + Super_Tackles + High_5s + Position
Model 3: Total_Tackle_Points ~ Total_Tackles + Super_Tackles + High_5s
Res.Df RSS Df Sum of Sq F Pr(>F)
1 54 1487.0
2 54 1231.5 0 255.47
3 60 1666.1 -6 -434.57 2.6302 0.02606 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Comparision Using AIC:
The Akaike information criterion (AIC) is a mathematical method for evaluating how well a
model fits the data it was generated from. In statistics, AIC is used to compare different
possible models and determine which one is the best fit for the data. AIC is calculated from:
• the number of independent variables used to build the model.

• the maximum likelihood estimate of the model (how well the model reproduces the
data).

• The best-fit model according to AIC is the one that explains the greatest amount of
variation using the fewest possible independent variables.
AIC(cor_model,back_model,pval_model)

df AIC
cor_model 11 404.9449
back_model 11 392.8809
pval_model 5 400.2233

We conclude that Backward Elimination model is the best model as per AIC

Tech Resources:
• R Language and R studio

References:
1. https://www.prokabaddi.com/
2. https://www.kabaddiadda.com/
3. https://www.sportzcraazy.com/
4. https://kabaddian.com/
5. https://www.wikiwiki.in/
6. https://www.sportskeeda.com/
7. https://wikisportsbio.com/kabaddi/
8. https://www.news18.com/
9. https://prokabaddiarena.com/
10. https://www.kabaddiadda.com/tournament/87-pro-kabaddi-league-season-8/
auction-summary

You might also like