Sports Video Analysis

RGB CAMERA BASED CRICKET SHOT
CLASSIFICATION AND PLAYER EXPERTISE

PREDICTION SYSTEM
Sampath Kumar Ramayanapu
Master of Technology Thesis

June 2017
International Institute of Information Technology, Bangalore

RGB CAMERA BASED CRICKET SHOT
CLASSIFICATION AND PLAYER EXPERTISE
PREDICTION SYSTEM
Submitted to International Institute of Information Technology,

Bangalore
in Partial Fulfillment of
the Requirements for the Award of
Master of Technology
by
Sampath Kumar Ramayanapu

SMT2014019
International Institute of Information Technology, Bangalore

June 2017
To my parents , wife and son: for all the love, affection and guidance
To Dr. Dinesh Babu Jayagopi: for being the best advisor

Thesis Certificate
This is to certify that the thesis titled RGB Camera based Cricket shot classifica-
tion and player expertise prediction system submitted to the International Institute of
Information Technology, Bangalore, for the award of the degree of Master of Technol-
ogy is a bona fide record of the research work done by Sampath Kumar Ramayanapu,
SMT2014019, under my supervision. The contents of this thesis, in full or in parts, have
not been submitted to any other Institute or University for the award of any degree or
diploma.
Dr.Dinesh Babu Jayagopi
Bengaluru,
The 13th of June, 2017.
iv
RGB CAMERA BASED CRICKET SHOT CLASSIFICATION AND PLAYER

EXPERTISE PREDICTION SYSTEM
Abstract
“Video analysis has many applications in sports. Over the years, a lot of research
has been done in the analysis of sports videos. These new technological innovations
has made vision-based research much more interesting and efficient than ever before.
Coaches and athletes are using the medium extensively to measure and correct the tech-
nique, and to analyze team and individual performances. We are in a world that uses
wearables, sensors and other tools to measure how an individual player is performing.
Adding video based methodologies allows us to see exactly what is happening in real
time. Learning through visual methods has shown significant impact on individuals to
perform better when compared to other methods. The advantage of Video based analy-
sis is that, it enables the players, coaches and trainers to re-evaluate the performance
anytime by replaying the videos. This kind of video analysis is applicable to any kind
of sport. In this Thesis, we proposed our techniques and results on automatic analysis
of cricket videos which facilitates shot classification and feedback about the way shots
are played. We used human joint locations as training features and classification is
modeled using Hidden Markov Model (HMM) and expertise prediction is modeled as
support vector regression(SVR). We perform experiments of the proposed method by us-
ing cricket videos collected from Youtube and manually captured data. Results showed
a shot classification accuracy of 91% and expertise prediction accuracy of 0.40 as mean
rank correlation value.”
v
Acknowledgements
“I am truly honored and privileged to have worked in the Multimodal Perception

Lab(MPL) at IIIT-B under the supervision of Dr.Dinesh Babu Jayagopi. Working in
MPL lab was a great experience. First and foremost, I would like to thank my advisor,
Dr. Dinesh Babu Jayagopi, for providing an opportunity to work under his guidance.
He has given me ample motivation and freedom to take up this project and at the same
time guided me in the right direction. I am also thankful for his promptness in helping to
collect dataset with the help of IIIT-B students in short time. I have enjoyed interacting
with him during the entire course of Project Elective and Thesis works.
I am also grateful to Samsung Research India, Bengaluru for sponsoring and providing
an opportunity to pursue my M-Tech. Also I would like to thank my Reporting Manager,
Team Members and Group Heads for helping and guiding me.
Most of all, I want to express my sincere gratitude to my beloved parents, wife and son
for standing by me at all times and encouraging me. Once again my heartfelt thanks to
all.”
vi
Contents
Abstract iv
Acknowledgements v
List of Figures x
List of Tables xii
List of Abbreviations xiii
1 Introduction 1
1.1 Overview about Cricket shots . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Scope and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Outline of the work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Organization of this report . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Background and Related Work 7

vii
2.1 Background and preliminaries . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Supervised and Unsupervised learning . . . . . . . . . . . . . . 7
2.1.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.4 HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.5 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.6 Representing Human Pose . . . . . . . . . . . . . . . . . . . . 14
2.2 Related work and state of the art . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Human activity and action detection . . . . . . . . . . . . . . . 15
2.2.2 Works related to Cricket . . . . . . . . . . . . . . . . . . . . . 16
2.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 System Overview 18
3.1 Player detection from video frames . . . . . . . . . . . . . . . . . . . . 18
3.2 Joint locations detection . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.1 Angle between joints . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.2 Slope and Distance based features . . . . . . . . . . . . . . . . 23
3.4 Training a shot detector . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 Additional Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

viii
4 Data Sets and Experimental Evaluation 27
4.1 YouTube Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 IIIT-B Cricket Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3.1 Classification using SVM . . . . . . . . . . . . . . . . . . . . 29
4.3.2 Classification using HMM . . . . . . . . . . . . . . . . . . . . 30
4.3.3 Results with Youtube dataset . . . . . . . . . . . . . . . . . . . 30
4.3.4 Results with IIIT-B dataset . . . . . . . . . . . . . . . . . . . . 32
4.3.5 Experiments with deep learning . . . . . . . . . . . . . . . . . 34
5 Feedback and Quality assessment 35
5.1 Biomechanics of cricket shots . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Player expertise assessment . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.2 Expertise prediction using regression . . . . . . . . . . . . . . 38
5.2.3 Player expertise level classification . . . . . . . . . . . . . . . 39
5.3 Summary of expertise assessment . . . . . . . . . . . . . . . . . . . . 41
6 CONCLUSION AND FUTURE WORK 42
Bibliography 43
ix
A Individual Player expertise ratings 48

x
List of Figures
FC1.1 Lengths of delivery according to the Pitch Map . . . . . . . . . . . . 2
FC1.2 Different kinds of Cricket shots . . . . . . . . . . . . . . . . . . . . . 3
FC1.3 Visualization of shots-Cover drive, Straight drive, Square cut, Pull shot 4
FC1.4 Overview of the proposed system . . . . . . . . . . . . . . . . . . . . 5
FC2.1 Components and flow of classification task . . . . . . . . . . . . . . . 8
FC2.2 Illustation of Binary and Multi-class classification . . . . . . . . . . . 9
FC2.3 Illustration of SVM optimal hyperplane . . . . . . . . . . . . . . . . 10
FC2.4 Representation of Hidden Markov Model . . . . . . . . . . . . . . . 12
FC2.5 Confusion matrix representation . . . . . . . . . . . . . . . . . . . . 13
FC2.6 Deformable Parts Model representation . . . . . . . . . . . . . . . . 14
FC2.7 Pictorial Structures Model representation . . . . . . . . . . . . . . . . 14
FC2.8 Representation of HMM Model . . . . . . . . . . . . . . . . . . . . 17
FC3.1 System overview for shot classification training . . . . . . . . . . . . 19
FC3.2 System overview for shot classification inference . . . . . . . . . . . 20

xi
FC3.3 Block diagram for Player detection . . . . . . . . . . . . . . . . . . . 20
FC3.4 Stick image illustration with detected joint locations . . . . . . . . . . 21
FC3.5 Illustration of Angle between joints . . . . . . . . . . . . . . . . . . 22
FC3.6 Slope between Shoulder and Thigh, Distance between Elbow and Thigh 23
FC3.7 Distance between Elbow and Shoulder . . . . . . . . . . . . . . . . . 23
FC3.8 Distance between Hands and Head . . . . . . . . . . . . . . . . . . . 24
FC3.9 Distance between Hands and Body . . . . . . . . . . . . . . . . . . . 24
FC3.10Distance between Legs . . . . . . . . . . . . . . . . . . . . . . . . . 24
FC3.11Spatio Temporal Interest Points . . . . . . . . . . . . . . . . . . . . . 26
FC4.1 Screen shots of IIIT-B Cricket dataset . . . . . . . . . . . . . . . . . 28
FC4.2 Confusion matrix : with Youtube dataset and SVM . . . . . . . . . . 31
FC4.3 Confusion matrix : with Youtube dataset and HMM . . . . . . . . . . 32
FC4.4 Confusion matrix : with IIIT-B dataset and HMM . . . . . . . . . . . 33
FC4.5 Confusion matrix : with IIIT-B dataset and SVM . . . . . . . . . . . 34
FC5.1 Principles for better shot accuracy . . . . . . . . . . . . . . . . . . . 36
FC5.2 Body segments peak sequence to verify shot accuracy . . . . . . . . . 37
FC5.3 Expertise prediction results . . . . . . . . . . . . . . . . . . . . . . . 39
FC5.4 Confusion matrix for expertise classification . . . . . . . . . . . . . . 41

xii
List of Tables
TC4.1 Youtube Dataset details . . . . . . . . . . . . . . . . . . . . . . . . . 27
TC4.2 IIIT-B Dataset details . . . . . . . . . . . . . . . . . . . . . . . . . . 28
TC4.3 Training and Test data split with Youtube dataset . . . . . . . . . . . 30
TC4.4 Accuracy comparison with Youtube dataset . . . . . . . . . . . . . . 31
TC4.5 Training and Test data split with IIIT-B dataset . . . . . . . . . . . . . 32
TC4.6 Accuracy comparison with IIIT-B dataset . . . . . . . . . . . . . . . 33
TC4.7 Transfer learning experiment . . . . . . . . . . . . . . . . . . . . . . 34
TC5.1 Training and Test data split for player expertise prediction . . . . . . . 38
TC5.2 Player expertise prediction results using Regression . . . . . . . . . . 39
TC5.3 Training and Test data split for player expertise classification . . . . . 40
TC5.4 Expertise classification results . . . . . . . . . . . . . . . . . . . . . 40
TA1.1 Players expertise ratings (Players from 1 to 30) . . . . . . . . . . . . 48
TA1.2 Players expertise ratings (Players from 31 to 50) . . . . . . . . . . . . 49

xiii
List of Abbreviations
CDHMM . . . . . . Continuous Density Hidden Markov Model
CNN . . . . . . . . . . Convolutional Neural Networks
DHMM . . . . . . . Discrete Probability Hidden Markov Model
DNN . . . . . . . . . . Deep Neural Networks
HMM . . . . . . . . . Hidden Markov Model
HOG . . . . . . . . . . Histogram of oriented Gradients
IIIT-B . . . . . . . . . International Institute of Information Technology Bangalore
LSP . . . . . . . . . . . Leeds Sports Pose dataset
MEMM . . . . . . . Maximum Entropy Markov model
MPL . . . . . . . . . . Multi-model Perception Lab
PMTK . . . . . . . . Probabilistic Modeling Toolkit
RBF . . . . . . . . . . . Radial Basis Function
STIP . . . . . . . . . . Space-Time Interest Points
SVM . . . . . . . . . . Support Vector Machine
SVR . . . . . . . . . . . Support Vector Regression

1
CHAPTER 1
INTRODUCTION
Video analysis is a commonly used tool in modern day sports which can provide a
training boost for individual and team. Coaches and trainers analyze video from live
action and training exercises, and the results of their careful analyses provide helpful
feedback for the players. Thanks to video analysis, players can gain a competitive edge,
correct faults and maximize their strengths. Sports academies and coaching centers are
providing training to many players across the world. But not all the academies and
coaching centers are equipped with enough real time video analysis tools and technolo-
gies which can make their learning more efficient. Few licensed video analysis tools
are available for which royalty has to be paid for using their services. Because of their
high costs an individual player can not possess them personally. But today, with the
lower costs of cameras and the prevalence of smart phones and tablets, video can be
captured easily anytime, anywhere. By taking this advantage if simple video analysis
methods are implemented which can be deployed on laptops and smart phones, it will
be a greatest advantage to many people.
The focus of this thesis is on the application of video analysis to Cricket videos. There
are different kind of shots played in Cricket. A lot of learning and practice is needed to
play a shot perfectly. A player has to select appropriate shot to be used based on line,
length and speed of a delivery. Even an expert batsmen give their wicket or score less
runs because of poor shot selection. Feedback about the quality of shot depends on type
2
of shot played, timing of the shot and few other parameters like body posture. So as a
part of video analysis detecting type of shot that was played is very important. Once
the shot is detected appropriate visual feedback will be given based on the type of shot.
1.1 Overview about Cricket shots
In Cricket around 12 different kinds of shots are available which can be played from
either side of the wicket. Selection of shot to play depends on the line, length and speed
of ball and other match conditions. Pitch map is shown in below figure FC1.1. There
are three different lines of delivery of the ball.
• Off stump and outside
• Middle stump
• Leg stump and outside
There are five main lengths of delivery of the ball.
• Bouncer
• Short length
• Good length
• Full length
• Full toss
Figure FC1.1: Lengths of delivery according to the Pitch Map

3
A batsman needs proper footwork to get into the best position to play a shot. The
different types of shots a batsman can play are named below and shown in figure FC1.2.
• Hook shot
• Pull shot
• Square cut
• Back defense
• Off drive
• Straight drive
• Cover drive
• Sweep shot
Figure FC1.2: Different kinds of Cricket shots

• Forward defense
Drives are the shots played generally from the front foot in front of the wickets (On
drive, straight drive, cover drive), cuts are back foot shots (square cut, late cut) where
you’re cutting the ball, pull is where you pull a short ball into the leg side.
1.2 Scope and Objectives
The current work is primarily focused on classification of different kinds of Cricket

shots based on video sequences and providing feedback about the way shots are played.
Due to dataset limitations our proposed model is trained to classify four different kinds
of shots.They are Cover drive, Straight drive, Square cut and Pull shot. Representation
of those shots is illustrated in below figure FC1.3. This Model can be easily extended
to classify more shots if corresponding dataset is available.
4
Figure FC1.3: Visualization of shots-Cover drive, Straight drive, Square cut, Pull shot
A player will select a shot based on factors like line and length of the ball. Quality
of the shot will depend on many parameters like timing of the shot, appropriate body
posture, swing of the bat etc. As a part of this thesis work we have not considered all of
these parameters in determining the quality of the shots. In this thesis work we limited
the scope of feedback to only predict the player expertise level based on different shots
played by him.
1.3 Outline of the work
A simple overview of the proposed system is shown in figure FC1.4 below. From
videos each frame is extracted and processed. Overall system functionality is explained
in below steps.
1. Frames are extracted from videos and player(location as bounding box) is detected
in each frame
2. Pose estimation is performed based on the identified player location for each
frame and joint locations are captured. Using these joint location values train-
ing features are computed.
5
3. An Support Vector Machine(SVM) or Hidden Markov Model(HMM) is trained

for each type of shot using these training features.
4. Using these trained models different types of shots will be classified.
5. Suggestions/feedback about the quality of shut will be generated based on the type
of shot.
Figure FC1.4: Overview of the proposed system
1.4 Organization of this report
• Chapter 2 provides review of related works, motivations and briefly talk about the
Human pose estimation systems and Action recognition methods.
• Chapter 3 describes about proposed solution for shot classification.
• Chapter 4 provides details about the datasets used in this work, the experiments
performed and the summary of the results.
• Chapter 5 provides details about the proposed methods for shot quality assess-
ment.
6
• Chapter 6 provides summary of the entire thesis with concluding remarks and
directions for future work.
1.5 Applications
Video Analysis has many applications in sports. They can also be extended to sci-
entific domains. Video analysis software can also be used for bio-mechanics research,
and in injury rehabilitation. Few use cases are listed here.
1. Modeling from best : Analyzing video of the best player at your position or in
your sport will showcase habits the players uses on a regular basis that help him
succeed. When you have identified some techniques of the best players, you can
work them into your own game to improvise it.
2. Injury prevention and recovery: Using video analysis you can study different tech-
niques and identify areas that must be changed to avoid injuring yourself in the
future.
3. Technique Analysis : Video analysis is very useful for identifying and correcting
problems with a playing technique. Using video analysis we can measure lot
of parameters like body posture, angle at which ball is thrown, trajectory of the
ball, swing of the bat etc. All these measures can help a player to enhance his
technique.
4. Enhanced game plan : Video analysis can also be used to prepare for upcoming
matches. Watching videos and analyzing the techniques of upcoming opponents,
teaches you their strengths and weaknesses, and enable you to formulate a game
plan to deal with them in a better way.
7
CHAPTER 2
BACKGROUND AND RELATED WORK
In this chapter, we present a brief background on RGB based human pose estimation
methods, action recognition using Hidden Markov Models(HMM) and Support Vector
Machine(SVM). We then discuss about how these action recognition methods can be
extended for shot classification in Cricket.
2.1 Background and preliminaries
In this section we present an overview about machine learning concepts like classi-
fication problem, supervised learning models like SVM, statistical Markov model like
HMM and Human pose estimation methods.
2.1.1 Supervised and Unsupervised learning
In Supervised learning each training example is a pair consisting input X, and an

desired output variable Y. We use an algorithm to learn the mapping function from the
input to the output.
Y = f (X)
8
The goal of this learning is to optimize the cost function so that when you have new in-
put data (X’) we can predict the output variables (Y’) with better accuracy. Supervised
learning problems can be further grouped into regression and classification problems.
In Unsupervised learning we have only input data (X) and no corresponding desired
output variables. The goal of unsupervised learning algorithms is to model the under-
lying structure or distribution in the data to learn more about the data. Unsupervised
learning problems can be further grouped into clustering and association problems
2.1.2 Classification
Classification is a supervised learning where the goal is to predict the categorical

class labels of new instances based on past observations. The example of e-mail-Spam
detection represents a binary classification task, where the machine learning algorithm
learns a set of rules in order to distinguish between two possible classes Spam and non
Spam e-mail. However, the set of class labels does not have to be of a binary nature.
The predictive model learned by a supervised learning algorithm can assign any class
label that was presented in the training dataset to a new, unlabeled instance. A typical
example of a multi-class classification task is handwritten character recognition. Below
figure FC2.1 represents the flow of classification task.
Figure FC2.1: Components and flow of classification task

9
Let the given training data is represented as

[(x1 , y1 ), (x2 , y2 ).....(xn , yn )]where xi ∈ Rm and yi ∈ {−1, +1}
Consider observations as points in R with an associated sign (either +/- corresponding to
0/1). An hypothesis for this problem is defined as f (x) = sign(wT x + b). It is illustrated
in below figure FC2.2a
(a) Binary classification (b) Multicalss classification
Figure FC2.2: Illustation of Binary and Multi-class classification
This Hypothesis divides given data into two distinct sets.

wT x + b > 0(+) and wT x + b < 0(−).
Loss function is defined based on the number of misclassifications.

loss = ∑yi − sign( fw,b (xi ))

i
We need to minimize this loss function so that classification accuracy will be increased.
Binary classification can be extended to multiple classes also, which is know as Multi-
class classification. It is illustrated in above figure FC2.2b.
2.1.3 SVM
A Support Vector Machine (SVM) is a discriminative classifier which constructs a

separation hyperplane (e.g., in 2-dimensional space, a straight line), that separates data
examples belonging to two classes, such that the minimal distance between points and
the separation hyperplane is maximized. The algorithm outputs an optimal hyperplane
10
which can categorize new examples.
In general there exists multiple hyperplanes which can separate given data as shown
in Figure below FC2.3a. We need to define a criterion to estimate the best line that gives
the solution.
(a) Multiple Hyperplanes (b) Optimal Hyperplane obtained using SVM
Figure FC2.3: Illustration of SVM optimal hyperplane
SVM algorithm is based on finding the hyperplane that gives the largest minimum
distance to the training examples. This distance is called margin. Therefore, the optimal
separating hyperplane maximizes the margin of training data. It is illustrated in FC2.3b.
Training data is defined as {(xi, yi), i = 1, 2, . . . .N}
The following equation gives hyperplane
~w.~x + b = 0
where ~x denotes training examples closest to the hyperplane. These are called support
vectors. Support vectors pair is represented as {(xi, yi), i = 1, 2, . . . .N} Support vectors
parallel to the optimal hyperplane, which are lie on two hyperplanes of equation
~w.~x + b = −1
~w.~x + b = +1
11
The maximization of the margin with the equations of the two support vector hyper-
planes contributes to the following constrained optimization problem.
1
min{ k~w k2 } where yi (~w.~x + b) ≥ 1, i = 1, . . . ., N.
2
2.1.4 HMM
The class of generative models, encompasses a very powerful set of algorithms

known as Markov random processes and Hidden markov models. Hidden Markov Mod-
els have proved to be the best in working with sequential information set, and that is the
reason, it has worked so well with speech related tasks. A discrete Markov process can
be explained as follows : A system may be assumed to exist in one of the possible N
states, denoted by qt at a given time t. Probabilistic description of the Markov chain is
thus
P[qt = S j |qt−1 = Si |qt−2 = Sk | . . . .]
A discrete first order Markov chain leads to the transition probabilities of
ai j = P[qt = Si |qt−1 = S j ], 1 ≤ i, j ≤ N
In the case of Hidden Markov models, observations are probabilistic functions of state,
and the underlying stochastic process is not observable, as shown FC2.4. An HMM has
following elements :
• Number of states (N) : S = (S1 , S2 , ..., SN )
• Alphabet size (M), observation symbol per state : V = v1 , v2 , ..., vM
• State Transition Probability (A) : A = ai j ; ai j = P[qt+1 = S j |qt = Si ]

12
• Observation Symbol Probability Distribution (B) : B = b j (k); bi (k) = P[vk at t|qk =

Si ]
• Initial State distribution : πi = P[q1 = Si ]
Parameter set of HMM is given by λ = (A, B, π). Based on the above specifications,
following problems can be solved :
• Testing : efficient computation of P(O|λ ), matching model with an observation

sequence.
• Optimal state sequence computation : given O, λ , choose corresponding state

sequence.
• Training : Maximize P(O|λ ), i.e. adjusting model parameters.
The observation Symbols, can either be discrete symbols or continuous densities on

the whole. Based on which, HMMs can either be Discrete Probability Hidden Markov
Model(DHMM) or Continuous Density Hidden Markov Model (CDHMM).
Figure FC2.4: Representation of Hidden Markov Model

13
2.1.5 Metrics
Performance of the classification can be measured using the confusion matrix, see
figure FC2.5 for an example of the confusion matrix for two-class problem. This ma-
trix provides summary for assignment of examples from each class to the predicted
classes, using results from all experiments in the cross-validation process. Based on the
confusion matrix, the following performance measures can be computed
Precision = T P/(T P + FP)
Recall = T P/(T P + FN)

2 · Recall · Precision
F − value =
Recall + Precision
Figure FC2.5: Confusion matrix representation
For regression tasks we used Person’s correlation coefficient [1], which gives the
relation between predicted and actual values. Higher the correlation value better the
accuracy.
14
2.1.6 Representing Human Pose
Human pose is generally represented using Deformable Parts Models as shown in

FC2.6 and Pictorial Structure models as shown in FC2.7. Each body part is represented
with the features calculated using Histogram of oriented Gradients (HOG). Few meth-
ods based on Deep Neural Networks (DNN) are also available to estimate the human
pose.
Figure FC2.6: Deformable Parts Model representation
Figure FC2.7: Pictorial Structures Model representation

15
2.2 Related work and state of the art
In this chapter we review relevant related work in the field of video analysis and
Cricket shot classification.
2.2.1 Human activity and action detection
Ability to recognize human activities and actions will enhance the capabilities of
a robot that interacts with humans. However automatic detection of human activities
could be challenging due to the individual nature of the activities. Lot of work research
has been done in actions and activity recognition with different features and learning
methods.We are discussing few methods here.
Human action and activity recognition has been previously studied by a number
of different authors. One common approach is to use space time features to model
points of interest in video [2]. Several authors have supplemented these techniques by
adding more information to these features [3–5]. Other, less common approaches for
activity recognition include filtering techniques [6]. Hierarchical techniques for activity
recognition have been used as well, but these typically focus on neurologically inspired
visual cortex-type models [7]. Often these authors adhere faithfully to the models of
the visual cortex, using motion-direction sensitive cells such as Gabor filters in the first
layer [3]. Another class of techniques used for activity recognition is that of the hidden
Markov model (HMM). Early work by Brand, Oliver, and Pentland (1997) [8] utilized
coupled HMMs to recognize two-handed activities. Weinland et al. [9] utilize an HMM
together with a three-dimensional occupancy grid to model three dimensional humans.
Martinez-Contreras et al. (2009) [10] utilize motion templates together with HMMs to
recognize human activities. Sminchisescu et al. (2005) [11] utilized conditional ran-
dom fields and maximum-entropy Markov models for activity recognition, arguing that
16
these models overcome some of the limitations presented by hidden Markov models.
In the work presented by Jaeyong Sung and Colin Ponce [12], they used a RGBD sen-
sor (Microsoft Kinect) as the input sensor to capture 3D human pose and they used
hierarchical Maximum Entropy Markov Model (MEMM) for modeling the activities.
In recent times usage of deep learning and Neural networks is also increasing.In the
work presented by Shuiwang Ji and Wei Xu [13], usage of a 3D Convolutional Neural
Networks(CNN) model for action recognition was discussed.
2.2.2 Works related to Cricket
Lot of research is going on in cricket sport analysis. The Hawk-Eye [14] is an ad-
vanced coaching system for cricket. Rahish Tandon and Dr. Amitabha have proposed
semantic analysis of broadcasting video [15] involve the use of auxiliary cues to detect
events. Another shot boundary detection and shot classification based on multi-scale
spatio temporal analysis of color and optical flow features [16]. David Lowe’s and
Timor’s research for [17] for SIFT feature descriptor [18, 19] helps to identify objects
direction and with integration with optical flow that can be helpful to detect the bat,
ball and body parts movement direction. Bangpeng Yao and Li Fei-Fei have researched
about modeling mutual context of object and human pose in human-object interaction
activities [20]. Machine also have to recognize the human pose by effective learning
approach [21]. Ashwani Aggarwal and Susmit Biswas have proposed a technique for
object detection and motion estimation from a MPEG video using background subtrac-
tion [22]. Different filter based tracking like Kalman [23], KLT [24] etc approaches are
proposed with several modification. Those have played an important role on field of
object detection, tracking as well as in action recognition. Mubarak shah, Javed Ahmed
and Mikel d. Rodriguiz have proposed an action MACH filter [25] for action recog-
nition, which can distinguish between different sports activities. But these approach
take a large learning set to make template class of the action corresponds to the spe-
17
cific sports activity. Debajyoti and AZM Chowdhury proposed a method to classify
shots based on optical flow vectors computed from each video frame [26]. An aver-
age accuracy of 60% is reported using this approach. Debajyoti and AZM Chowdhury
proposed a method to classify shots based on optical flow vectors computed from each
video frame. An average accuracy of 60% is reported using this approach. In the work
presented by Harikrishna and Sanjeev [27] neural network approach is used to classify
the cricket video events like boundary, four, out etc.
2.3 Motivation
Considering the shot classification approach proposed by Debajyoti and AZM Chowd-
hury [28] which is based on motion vectors, and action recognition methods based
on HMM, we proposed a method to classify different cricket shots. The features we
used are based on player joint locations(also know as skeleton features) which can cap-
ture complete player body movements. Along with joint locations Space Time Interest
Points(STIP) to represent spatial and temporal events which shown improvement in the
performance of classifiers. Using these feature vectors classification task is modeled
using Hidden Markov Model(HMM) as shown in figure FC2.8. ai j represents transition
probabilities and bik represent observation probabilities. A separate HMM is trained for
each kind of cricket shot.Inference is done based on the maximum likelihood of the test
data.
Figure FC2.8: Representation of HMM Model

18
CHAPTER 3
SYSTEM OVERVIEW
The overall functionality for Cricket shot classification system is divided into five
major parts.
1. Frames are extracted from videos and player(location as bounding box) is detected
in each frame
2. Pose estimation is performed based on the identified player location for each
frame and joint locations are captured .
3. Using these joint location values, training features are computed.
4. Using these training features an HMM is trained for each type of shot.
5. Using these trained HMM models classification is performed on test data.
These steps are illustrated in below figures. For HMM Training FC3.1 and for Testing
FC3.2 . Further each step is discussed in detail.
3.1 Player detection from video frames
For detecting players from video frames we have used YOLO Object detector [29],
a state-of-the-art, real-time object detection system that can detect over 9000 object
19
Figure FC3.1: System overview for shot classification training
categories. Each video frame is given to object detector interface which will give the
location and type of objects it has detected. Since we are interested in only person de-
tection, we filtered the results to observe only persons in frames. There is a possibility
that multiple players might be present in single frame. Since we are interested in ex-
tracting batsmen player location we have set filtering conditions on results to consider
only locations of striking batsmen. Object detector will give location as a bounding box
co-ordinates.
[B = [Xt ,Yl , Xw ,Yh ], where (Xt ,Yl ) gives the starting location of the bounding box, Xw
gives the width of bounding box and Xh gives the height of bounding box. Center of the
(Xt +Xw ) (Yl +Yh )
bounding box is computed as Xc = 2 and Yc = 2 . Block diagram for player
detection is illustrated in FC3.3. We have set 0.25 as the threshold value to detect the
objects. Any value less than this will lead to lot of multiple and overlapped object
detections. So 0.25 is considered as optimal threshold value.
20
Figure FC3.2: System overview for shot classification inference
Figure FC3.3: Block diagram for Player detection
3.2 Joint locations detection
For detecting joint locations of players from video frames we used Pose Estimation
using Iterative error feedback approach [30]which is a CNN based hierarchical feature
21
extractor model with top-down feedback approach. This model was trained on MPII
Human Pose dataset and d Leeds Sports Pose dataset (LSP).
Figure FC3.4: Stick image illustration with detected joint locations
Input video frame along with player location which was detected in previous step
is passed as input to the pose estimation system. It outputs a set of 17 joint locations
as (X,Y) positions of the input images. It is illustrated in figure FC3.4. Location of
17th joint is same as the center of the bounding box which we given as input to pose
estimation framework. In case of occlusions and hidden joints, those locations are
approximately identified as per the feedback and correction approximation methods of
the pose estimation framework.
3.3 Feature extraction
Joint locations extracted using pose estimation framework as base for generating
feature vectors for shot detection system. Joint locations are image coordinates and
22
specific to image, which can change based on the physical dynamics of the person. We
need to convert these values to relative features which would represent person indepen-
dent feature set. In order to achieve this we came up with few features like joint angles,
slope values of different joint and distance between few joints. These are explained
below.
3.3.1 Angle between joints
As illustrated in figure FC3.5b, we are computing 10 different angles based on the

joint locations.
(a) Joint locations

(b) Considered angle between joints
Figure FC3.5: Illustration of Angle between joints
Since each shot is played in a different way, these angle values across shots will vary
based on the player action. So these features can be used to differentiate each shot.
23
3.3.2 Slope and Distance based features
The following additional features are computed based on the joint locations.
1. Slope between Shoulder and Thigh FC3.6
2. Distance between Elbow and Thigh FC3.6
3. Distance between Elbow and Shoulder FC3.7
4. Distance between Hands and Head FC3.8
5. Distance between Hands and Body FC3.9
6. Distance between Legs FC3.10
Figure FC3.6: Slope between Shoulder and Thigh, Distance between Elbow and Thigh
Figure FC3.7: Distance between Elbow and Shoulder

24
Figure FC3.8: Distance between Hands and Head
Figure FC3.9: Distance between Hands and Body
Figure FC3.10: Distance between Legs

25
3.4 Training a shot detector
The problem of shot recognition or classification can be represented as a function

that maps a set of features corresponding to a video clip to one of type of cricket shot.
ki
Let Xi = { Xi, j } be feature vectors of video i. The set of labeled videos can be
j=1
N
depicted as a tuple ( ~xi , yi ) where N is the total number of labeled video instances
i=1
l
and yi ∈ C, C = { Ck } , where l is total number of shot types. A classifier can be
k=1
trained over the training, which maximizes the log likelihood of the labeled instances
being generated from the given model. Once a model is trained a set of unlabeled videos
Nt
represented by { ~xi } , where Nt represents number of test video sequences.
i=1
We used HMM for modeling and testing different types of cricket shots. HMM is
significantly used in scenarioes where time series data with spatio-temporal variations
needs to be learnt. Simple illustration is show in figure FC2.8. We generated HMM
model for each type of shot which is represented as λi = (A, B, π). So there will be l
different HMM models will be learnt representing each shot type. λ = {λ1 , λ2 , . . . , λl }.
The number of states of HMM is empirically selected. Mixtures of Gaussians is

used to model each observation. WE used two mixtures per feature as a part of our
evaluations. The model parameters are learn in such a way that they can maximize the
likelihood of observations P(O|λi ) for classifying a shot using the given training data
set.
Once the models are learn, we used maximum log likelihood estimation to classify
the shot.
argλ ∈AllShots max P(O|λa )
where P(O|λa ) is the conditional probability of shot sequence.

26
3.5 Additional Features
Along with joint location features, we considered usage of Space Time Interest point
(STIP) features also. Local space-time features [2, 31] are popularly used feature for
detection of action recognition [32]. STIP capture salient visual patterns in a space-
time image volume by extending the local Spatial image descriptor to the space-time
domain. Obtaining local space-time features has two steps: spatio-temporal interest
point (STIP) detection followed by feature extraction. Video is given as input to STIP
engine, it outputs a set of interest points which captures the spatio-temporal variations
in the video, which is represented in figure FC3.11.
Figure FC3.11: Spatio Temporal Interest Points
Each interest point is a combination of HOG and HOF features. Each These STIP
features are used along with joint location features and fed to HMM framework for
learning.
27
CHAPTER 4
DATA SETS AND EXPERIMENTAL EVALUATION
We created two datasets for Training, Valuation and Testing Purpose. First dataset
is based on the cricket videos download from Youtube. Second dataset is our own data
set which we created by recording multiple shots while different people are playing.
4.1 YouTube Dataset
We downloaded few videos from Youtube [33] and extracted few shots from them.
As a part of this dataset we considered only two types of shots, Cover drive and Pull
shot. Each video has only one shot sequence. Details about the number of videos are
presented in below table TC4.1.
Table TC4.1: Youtube Dataset details
Type of Shot No. of Videos No. Of shot sequences
Cover Drive 50 50
Pull Shot 50 50
28
4.2 IIIT-B Cricket Dataset
We created another dataset by recording videos while different people are playing
cricket. It consists of four different kinds of shots namely Cover drive, Straight Drive,
Square Cut and Pull Shot. Details about the number of videos are listed in below table
TC4.2.
Table TC4.2: IIIT-B Dataset details
Type of Shot No. of Videos No. Of shot sequences
Cover Drive 107 215
Pull Shot 105 211
Square Cut 106 212
Straight Drive 110 220
This dataset is a mix of videos with single shot sequence and multiple shot se-
quences, i.e in one video same shot is played multiple times. So if individual shot
sequences are considered, its count will be more than the number of video files. Screen
shots from this dataset are shown in below figure FC4.1.
Figure FC4.1: Screen shots of IIIT-B Cricket dataset

29
4.3 Experiments and Results
For our classification tasks we experimented with SVM classifiers and HMM based
classifiers. Along with these we experimented with deep learning based models also.
We used standard Liblinear [34] library for SVM and Probabilistic Modeling Toolkit
(PMTK) [35] for HMM. For deep learning based experiments we used Long-term Re-
current Convolutional Networks based approach proposed by Lisa Anne [36].
4.3.1 Classification using SVM
Inorder to classify using SVM, we used Bag of visual words model [37]. Training a
SVM classifier is performed in below steps
1. Separate your data sets into the data sets for each class.
2. Features are extracted from videos as described in chapter 3.
3. The features extracted are then clustered using k-means [38] with N clusters in
order to form a visual codebook with N words. In our experiments we used 20
clusters.
4. A Bag of words [37] is then constructed for each example (video sequence) based
on the occurrences of the codewords in the given example.
5. The Bag of words features are then fed to Multi-Class non-Linear SVM [34] to
learn classifier.
6. Fed the test data to classifier to get predicted class lable.

30
4.3.2 Classification using HMM
We used standard PMTK library [35] for classification experiments using HMM.
Classification task is performed as per below steps
1. Separate data sets into the data sets for each class.
2. Features are extracted from videos as described in chapter 3.
3. Train one HMM per class and generate model.
4. On the test data set compare the likelihood of each model to classify the sequence.
4.3.3 Results with Youtube dataset
As a part of preprocessing step we extracted individual shots d from videos down-

loaded from you-tube and labeled them as two class of shots namely Cover drive and
Pull shot. The following table TC4.3 shows the details of number of training and test
videos.
Table TC4.3: Training and Test data split with Youtube dataset
Type of Shot Training Videos Test Videos
Cover Drive 35 5
Pull Shot 35 5
Total 70 10
From these videos features are extracted as described in chapter 3. These features
are used to train and validate SVM and HMM based classifiers. Accuracy comparison
values are shown in table TC4.4.
31
Table TC4.4: Accuracy comparison with Youtube dataset
Type of Shot Accuracy with SVM Accuracy with HMM
Cover Drive 80 71.4
Pull Shot 80 85.7
Total Accuracy 80 78.5
In case of SVM with K-Means clustering, we used twenty clusters, with which we
got an average accuracy of 80 percent. In case of HMM, we used five hidden states
and two Gaussian mixtures as HMM parameters with which we got around 78 percent
of accuracy. Confusion matrix is plotted to compare the performance of two classi-
fier models. Overall performance of SVM FC4.2 is slightly better compared to HMM
FC4.3.
(a) Non normalized Confusion matrix (b) Normalized Confusion matrix
Figure FC4.2: Confusion matrix : with Youtube dataset and SVM

32
Figure FC4.3: Confusion matrix : with Youtube dataset and HMM
4.3.4 Results with IIIT-B dataset
IIIT-B cricket dataset contains videos related to four different shots namely Cover
Drive, Pull Shot,Square Cut and Straight Drive. The following table TC4.5 shows the
details of number of training and test videos.
Table TC4.5: Training and Test data split with IIIT-B dataset
Type of Shot Training Videos Test Videos
Cover Drive 173 40
Pull Shot 171 40
Square Cut 172 40
Straight Drive 180 40
Total 696 160
Similar to Youtube dataset, same experiments are conducted with this dataset also.
Classification is performed using SVM and HMM with same configuration parameters.
Comparison results for two classifiers are illustrated in below table TC4.6.
33
Table TC4.6: Accuracy comparison with IIIT-B dataset
Type of Shot Accuracy with HMM Accuracy with SVM
Cover Drive 97.5 67.4
Pull Shot 90 67.7
Square Cut 87.5 60.4
Straight Drive 90 67.7
Total Accuracy 91.25 65.8
Confusion matrix is plotted to compare the performance of two classifier models.

Since the dataset is huge compared to Youtube dataset and more data variations are
present, HMM Classifier FC4.4 is performing better compared to SVM FC4.5
Figure FC4.4: Confusion matrix : with IIIT-B dataset and HMM

34
Figure FC4.5: Confusion matrix : with IIIT-B dataset and SVM
4.3.5 Experiments with deep learning
Considering the LRCNN based base activity recognition model proposed in [36] we
tried some transfer learning experiments with our dataset. The base model was trained
on UCF101 dataset [39]. UCF101 is an action recognition data set of realistic action
videos having 101 action categories. This pre-trained model can recognize action type
as cricket batting or cricket bowling. But it can not recognize which type of shot it is.
In order to achieve this we did transfer learning [40] with our dataset and generated a
new model. With this data, we observed 60 percent classification accuracy.
Table TC4.7: Transfer learning experiment
Training data Test data Accuracy
80 10 60
35
CHAPTER 5
FEEDBACK AND QUALITY ASSESSMENT
As a part of this thesis we are limiting the scope off feedback and quality assessment
task to only player expertise prediction.We proposed a system for assessing the players
expertise rating based on how well they played different cricket shots. In our approach,
we trained a regression model from skeletal features(Joint locations) and spatio tempo-
ral(STIP) features to expertise scores assigned to each player.Ratings are given in the
scale of 1 to 5. Rating of 1 refers to poor and 5 is best. These ground truth scores
are obtained from expert cricket players. For these experiments we used IIIT-B cricket
dataset, which consists of four different shots played by 50 players.
5.1 Biomechanics of cricket shots
The stance [41] is the ”ready” position when the batsman is about to face a delivery.
To strike a cricket ball, a balanced and stable base must be first created. An ideal stance
is ”comfortable relaxed and balanced,” with the feet 40cm apart, parallel and astride
the crease. The front shoulder should be pointing down the wicket, the head facing the
bowler, the weight equally balanced and the bat near the back toe. As the ball is about
to be released, the batsman will lift his bat up behind up in anticipation of playing a
stroke, and will shift his weight onto the tip of his feet. At this point player is ready to
move swiftly into position to address the ball. Figure FC5.1 below shows few principles
36
that needs to be considered to achieve for better shot accuracy.
Figure FC5.1: Principles for better shot accuracy
In the the case study presented by Dr Paul Hurrion [42] about optimal rotational
moments of body segments to achieve better shot accuracy, all body segments should
accelerate and decelerate in the correct sequence with specific timing before impact.
The peaking sequence of the major segments for best energy transfer is: pelvis (stable
front ankle, knee and pelvis), thorax (chest / shoulders), both arms and wrists and finally
the bat handle. The motion should occur sequentially with each peak speed being higher
and later (closer to impact) than the previous one. This pattern is necessary to efficiently
transfer energy and accelerate each body segment. Figure FC5.2 illustrates the usage of
body segments peak sequence to verify the shot accuracy.
37
Figure FC5.2: Body segments peak sequence to verify shot accuracy
5.2 Player expertise assessment
In this section we present our system for assessing the quality or expertise of an a
player based on multiple shots played by him. Our model learns a regression model
from spatio-temporal features and Skeletal features and corresponding ground truth
scores given by expert players. The set of labeled videos can be depicted as a tuple
N
( ~Xi ,Yi ) where N is the total number of labeled video instances and Yi ∈ R denotes
i=1
the ground truth quality score of the player j in video i. With these features we modeled
an expertise assessment system using support vector regression to predict an expertise
score. Along with that we modeled a binary classifier which can categorize the players
into two classes namely Proficient and Learners.
38
5.2.1 Features
Human experts are typically trained over many years to develop complex rules to
evaluate sports actions. In order to make machines achieve this tasks they must be
provided with similar rules as well. Accessing the same visual features which human
experts perceive is complex. So for our experiments we considered below skeleton
based features.
1. Angles between joints as described in section 3.3.1.
2. Slope and Distance based features as described in section 3.3.2.
3. Normalized joint locations with respect to location of head.
5.2.2 Expertise prediction using regression
Below table TC5.1 shows the dataset split for expertise prediction using regression.
Since expert ratings for the players are not available at this point of time, we considered
ratings given by players having good knowledge of cricket. We collected ratings from
multiple people and average of these ratings is considered as ground truth expertise
rating. Individual player expertise ratings are shown in Appendix A.
Table TC5.1: Training and Test data split for player expertise prediction
Number of Players Training Videos Test Videos Expertise scale
50 700 150 1-5
With all the features mentioned in section 5.2.1, we trained a regression model using
linear Support Vector Regression (SVR). Prediction accuracy is measured using test
dataset. We observed 0.40 mean rank correlation (Higher is better).
39
Figure FC5.3: Expertise prediction results
Table TC5.2: Player expertise prediction results using Regression
Kernel type Max Rank Correlation Mean Rank Correlation
Linear 0.12 0.07
RBF 0.51 0.40
we trained a regression model with both linear and Radial Basis Function(RBF)
kernels. Results are plotted in figure FC5.3. As shown in table TC5.2 Regression model
with RBF kernel has shown better accuracy with 0.40 mean rank correlation compared
to 0.07 with linear kernel.
5.2.3 Player expertise level classification
We modeled a binary classifier which can categorize the players into two expertise
levels namely Proficient and Learners. For this classification task, we modified the
dataset used for regression in section 5.2.2 as mentioned below.
40
• Players with expertise rating 3, 4 and 5 are labeled as Proficient.
• Players with expertise rating 1 and 2 are labeled as Learners.
Below table TC5.3 shows the dataset split after above changes.
Table TC5.3: Training and Test data split for player expertise classification
Expertise level Training Videos Test Videos
Proficient 415 87
Learner 285 71
With these features we trained a classifier model using support vector classifier. We
observed total classification accuracy of 69.8%. Below table TC5.4 shows the classifi-
cation results.
Table TC5.4: Expertise classification results
Expertise Level Accuracy
Proficient 73.5
Learner 66.1
Total Accuracy 69.8
Confusion matrix illustrating detailed classification results is show in figure FC5.4

41
Figure FC5.4: Confusion matrix for expertise classification
5.3 Summary of expertise assessment
With regression model, we observed average rank correlation value of 0.40 for
player expertise prediction. Generally human experts who can rate players possesses
rank correlation value of 0.96. So there is a lot of scope for improvement. With clas-
sification model we observed an accuracy of 70%. If feature set is extended with more
visual features, accuracy can be improved and more expertise levels can be added.
42
CHAPTER 6
CONCLUSION AND FUTURE WORK
In this thesis work, we have presented HMM based model for cricket shot classifica-
tion. We have proposed an regression based approach for assessing players performance
rating. We observed 92% accuracy for shot classification and 0.40 average rank corre-
lation for expertise prediction. with IIIT-B dataset. We performed all our experiments
on four classes of shots. This can be easily extended to any number of classes based
on availability of dataset. Related to feature set, we considered features based on joint
locations (Skeletal features). In order to classify a shot or to estimate the rating of a
player, we may have to consider more features. Going forward more descriptors can be
added to make accuracy of the system better. This work can be extended to add tracking
of Cricket Ball, Bat and timing detection etc. Assessing the quality of actions in sports
is an important problem with many real-world applications. It can provide feedback on
how the player can improve. Although the quality of an action is a subjective measure,
the independent human experts have a large correlation compared to computer based
evaluation systems. This shows that there is a lot of scope for a computer vision system
to improvise and learn from data.
43
Bibliography
[1] Karl Pearson. Pearson correlation coefficient.
[2] Ivan Laptev. On space-time interest points. International journal of computer

vision, 64(2-3):107–123, 2005.
[3] H. Jhuang, T. Serre, L. Wolf, and T. Poggio. A biologically inspired system for
action recognition. In 2007 IEEE 11th International Conference on Computer
Vision, pages 1–8, Oct 2007.
[4] S. F. Wong, T. K. Kim, and R. Cipolla. Learning motion categories using both se-
mantic and structural information. In 2007 IEEE Conference on Computer Vision
and Pattern Recognition, pages 1–6, June 2007.
[5] Jingen Liu, S. Ali, and M. Shah. Recognizing human actions using multiple fea-
tures. In 2008 IEEE Conference on Computer Vision and Pattern Recognition,
pages 1–8, June 2008.
[6] M. D. Rodriguez, J. Ahmed, and M. Shah. Action mach a spatio-temporal maxi-

mum average correlation height filter for action recognition. In 2008 IEEE Con-
ference on Computer Vision and Pattern Recognition, pages 1–8, June 2008.
[7] T. Serre, L. Wolf, and T. Poggio. Object recognition with features inspired by
visual cortex. In 2005 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (CVPR’05), volume 2, pages 994–1000 vol. 2, June 2005.
44
[8] M. Brand, N. Oliver, and A. Pentland. Coupled hidden markov models for com-
plex action recognition. In Proceedings of IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, pages 994–999, Jun 1997.
[9] D. Weinland, E. Boyer, and R. Ronfard. Action recognition from arbitrary views
using 3d exemplars. In 2007 IEEE 11th International Conference on Computer
Vision, pages 1–7, Oct 2007.
[10] F. Martinez-Contreras, C. Orrite-Urunuela, E. Herrero-Jaraba, H. Ragheb, and

S. A. Velastin. Recognizing human actions using silhouette-based hmm. In
2009 Sixth IEEE International Conference on Advanced Video and Signal Based
Surveillance, pages 43–48, Sept 2009.
[11] C. Sminchisescu, A. Kanaujia, Zhiguo Li, and D. Metaxas. Conditional models

for contextual human motion recognition. In Tenth IEEE International Conference
on Computer Vision (ICCV’05) Volume 1, volume 2, pages 1808–1815 Vol. 2, Oct
2005.
[12] Jaeyong Sung, C. Ponce, B. Selman, and A. Saxena. Unstructured human activity
detection from rgbd images. In 2012 IEEE International Conference on Robotics
and Automation, pages 842–849, May 2012.
[13] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks
for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell., 35(1):221–
231, January 2013.
[14] Hawk-Eye Innovations. Hawk-eye in cricket.
[15] Rashish Tandon and Dr. Amitabha. Semantic analysis of a cricket broadcast video.
[16] Dipen S. Rughwani. Semantic query processing on broadcast cricket videos.
[17] David G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J.
Comput. Vision, 60(2):91–110, November 2004.
45
[18] Timor Kadir, Paola Hobson, and Michael Brady. From salient features to scene
description. 2005.
[19] Jianbo Shi and Carlo Tomasi. Good features to track. Technical report, Ithaca,
NY, USA, 1993.
[20] B. Yao and L. Fei-Fei. Modeling mutual context of object and human pose in
human-object interaction activities. In 2010 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition, pages 17–24, June 2010.
[21] C. M. Bishop. Pattern recognition and machine learning, 2006, vol. 4.
[22] T. Yokoyama, T. Iwasaki, and T. Watanabe. Motion vector based moving object
detection and tracking in the mpeg compressed domain. In 2009 Seventh Inter-
national Workshop on Content-Based Multimedia Indexing, pages 201–206, June
2009.
[23] H. Patel and D. Thakore. Moving object tracking using kalman filter.
[24] Berthold K.P. Horn and Brian G. Schunck. Determining optical flow. Technical
report, Cambridge, MA, USA, 1980.
[25] Mikel D. Rodriguez, Javed Ahmed, and Mubarak Shah. Action mach: a spatio-
temporal maximum average correlation height filter for action recognition. In In
Proceedings of IEEE International Conference on Computer Vision and Pattern
Recognition, 2008.
[26] D. Karmaker, A. Z. M. E. Chowdhury, M. S. U. Miah, M. A. Imran, and M. H.

Rahman. Cricket shot classification using motion vector. In 2015 Second In-
ternational Conference on Computing Technology and Information Management
(ICCTIM), pages 125–129, April 2015.
46
[27] N. Harikrishna, S. Satheesh, S. D. Sriram, and K. S. Easwarakumar. Temporal

classification of events in cricket videos. In 2011 National Conference on Com-
munications (NCC), pages 1–5, Jan 2011.
[28] D. Karmaker, A. Z. M. E. Chowdhury, M. S. U. Miah, M. A. Imran, and M. H.

Rahman. Cricket shot classification using motion vector. 2015 Second Interna-
tional Conference on Computing Technology and Information Management (ICC-
TIM), pages 125–129, 2015.
[29] J. Redmon and A. Farhadi. YOLO9000: Better, Faster, Stronger. ArXiv e-prints,
December 2016.
[30] João Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Jitendra Malik. Human
pose estimation with iterative error feedback. CoRR, abs/1507.06550, 2015.
[31] Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld.
Learning realistic human actions from movies. In Computer Vision and Pattern
Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.
[32] Ronald Poppe. A survey on vision-based human action recognition. Image and
vision computing, 28(6):976–990, 2010.
[33] Youtube. Youtube cricket videos.
[34] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector ma-
chines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27,
2011.
[35] Kevin Murphy Matt Dunham. Probabilistic modeling toolkit for matlab/octave.
[36] Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Sub-
hashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent con-
volutional networks for visual recognition and description. CoRR, abs/1411.4389,
2014.
47
[37] Andrew Zisserman. Bag of visual words model.
[38] Python Opensource. K-means clustering.
[39] University of Central Florida. Ucf101 - action recognition data set.
[40] Berkeley. Caffe-transfer learning example.
[41] BBC. Cricket stance.
[42] Dr Paul Hurrion. Balance ,stability - the key to power and consistency in batting.
[43] Harris Drucker, Christopher JC Burges, Linda Kaufman, et al. Support vector
regression machines.
48
APPENDIX A
INDIVIDUAL PLAYER EXPERTISE RATINGS
Table TA1.1: Players expertise ratings (Players from 1 to 30)
Player Rating1 Rating2 Average Player Rating1 Rating2 Average
Rating Rating
Player 1 5 5 5 Player 16 5 5 5
Player 10 4 5 5 Player 25 2 3 3
Player 11 2 4 3 Player 26 2 5 4
Player 12 1 2 2 Player 27 2 5 4
Player 13 4 4 4 Player 28 1 1 1
Player 14 3 3 3 Player 29 5 5 5
Player 15 2 4 3 Player 30 5 5 5
49
Table TA1.2: Players expertise ratings (Players from 31 to 50)
Player Rating1 Rating2 Average Player Rating1 Rating2 Average
Rating Rating
Player 31 3 2 3 Player 41 4 4 4
Player 32 4 4 4 Player 42 1 1 1
Player 33 3 3 3 Player 43 1 1 1
Player 34 3 2 3 Player 44 3 3 3
Player 35 2 3 3 Player 45 1 1 1
Player 36 4 3 4 Player 46 2 3 3
Player 37 3 4 4 Player 47 4 2 3
Player 38 4 3 4 Player 48 3 4 4
Player 39 1 3 2 Player 49 4 3 4
Player 40 1 3 2 Player 50 3 4 4

Sports Video Analysis

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sports Video Analysis

Uploaded by

Copyright:

Available Formats

RGB CAMERA BASED CRICKET SHOT

CLASSIFICATION AND PLAYER EXPERTISE

Sampath Kumar Ramayanapu

Master of Technology Thesis

International Institute of Information Technology, Bangalore

CLASSIFICATION AND PLAYER EXPERTISE

Submitted to International Institute of Information Technology,

Sampath Kumar Ramayanapu

International Institute of Information Technology, Bangalore

To Dr. Dinesh Babu Jayagopi: for being the best advisor

Dr.Dinesh Babu Jayagopi

RGB CAMERA BASED CRICKET SHOT CLASSIFICATION AND PLAYER

“I am truly honored and privileged to have worked in the Multimodal Perception

List of Tables xii

List of Abbreviations xiii

1.1 Overview about Cricket shots . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Scope and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Outline of the work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Organization of this report . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background and Related Work 7

2.1 Background and preliminaries . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Supervised and Unsupervised learning . . . . . . . . . . . . . . 7

2.1.6 Representing Human Pose . . . . . . . . . . . . . . . . . . . . 14

2.2 Related work and state of the art . . . . . . . . . . . . . . . . . . . . . 15

2.2.1 Human activity and action detection . . . . . . . . . . . . . . . 15

2.2.2 Works related to Cricket . . . . . . . . . . . . . . . . . . . . . 16

3.1 Player detection from video frames . . . . . . . . . . . . . . . . . . . . 18

3.2 Joint locations detection . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3.1 Angle between joints . . . . . . . . . . . . . . . . . . . . . . . 22

3.3.2 Slope and Distance based features . . . . . . . . . . . . . . . . 23

3.4 Training a shot detector . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.5 Additional Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Data Sets and Experimental Evaluation 27

4.1 YouTube Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2 IIIT-B Cricket Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.3.1 Classification using SVM . . . . . . . . . . . . . . . . . . . . 29

4.3.2 Classification using HMM . . . . . . . . . . . . . . . . . . . . 30

4.3.3 Results with Youtube dataset . . . . . . . . . . . . . . . . . . . 30

4.3.4 Results with IIIT-B dataset . . . . . . . . . . . . . . . . . . . . 32

4.3.5 Experiments with deep learning . . . . . . . . . . . . . . . . . 34

5 Feedback and Quality assessment 35

5.1 Biomechanics of cricket shots . . . . . . . . . . . . . . . . . . . . . . 35

5.2 Player expertise assessment . . . . . . . . . . . . . . . . . . . . . . . . 37

5.2.2 Expertise prediction using regression . . . . . . . . . . . . . . 38

5.2.3 Player expertise level classification . . . . . . . . . . . . . . . 39

5.3 Summary of expertise assessment . . . . . . . . . . . . . . . . . . . . 41

6 CONCLUSION AND FUTURE WORK 42

A Individual Player expertise ratings 48

FC1.1 Lengths of delivery according to the Pitch Map . . . . . . . . . . . . 2

FC1.2 Different kinds of Cricket shots . . . . . . . . . . . . . . . . . . . . . 3

FC1.4 Overview of the proposed system . . . . . . . . . . . . . . . . . . . . 5

FC2.1 Components and flow of classification task . . . . . . . . . . . . . . . 8

FC2.2 Illustation of Binary and Multi-class classification . . . . . . . . . . . 9

FC2.3 Illustration of SVM optimal hyperplane . . . . . . . . . . . . . . . . 10

FC2.4 Representation of Hidden Markov Model . . . . . . . . . . . . . . . 12

FC2.5 Confusion matrix representation . . . . . . . . . . . . . . . . . . . . 13

FC2.6 Deformable Parts Model representation . . . . . . . . . . . . . . . . 14

FC2.7 Pictorial Structures Model representation . . . . . . . . . . . . . . . . 14

FC2.8 Representation of HMM Model . . . . . . . . . . . . . . . . . . . . 17

FC3.1 System overview for shot classification training . . . . . . . . . . . . 19

FC3.2 System overview for shot classification inference . . . . . . . . . . . 20