Professional Documents
Culture Documents
Sports Video Analysis
Sports Video Analysis
PREDICTION SYSTEM
by
This is to certify that the thesis titled RGB Camera based Cricket shot classifica-
tion and player expertise prediction system submitted to the International Institute of
Information Technology, Bangalore, for the award of the degree of Master of Technol-
ogy is a bona fide record of the research work done by Sampath Kumar Ramayanapu,
SMT2014019, under my supervision. The contents of this thesis, in full or in parts, have
not been submitted to any other Institute or University for the award of any degree or
diploma.
Bengaluru,
The 13th of June, 2017.
iv
Abstract
“Video analysis has many applications in sports. Over the years, a lot of research
has been done in the analysis of sports videos. These new technological innovations
has made vision-based research much more interesting and efficient than ever before.
Coaches and athletes are using the medium extensively to measure and correct the tech-
nique, and to analyze team and individual performances. We are in a world that uses
wearables, sensors and other tools to measure how an individual player is performing.
Adding video based methodologies allows us to see exactly what is happening in real
time. Learning through visual methods has shown significant impact on individuals to
perform better when compared to other methods. The advantage of Video based analy-
sis is that, it enables the players, coaches and trainers to re-evaluate the performance
anytime by replaying the videos. This kind of video analysis is applicable to any kind
of sport. In this Thesis, we proposed our techniques and results on automatic analysis
of cricket videos which facilitates shot classification and feedback about the way shots
are played. We used human joint locations as training features and classification is
modeled using Hidden Markov Model (HMM) and expertise prediction is modeled as
support vector regression(SVR). We perform experiments of the proposed method by us-
ing cricket videos collected from Youtube and manually captured data. Results showed
a shot classification accuracy of 91% and expertise prediction accuracy of 0.40 as mean
rank correlation value.”
v
Acknowledgements
Contents
Abstract iv
Acknowledgements v
List of Figures x
1 Introduction 1
1.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.4 HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.5 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 System Overview 18
5.2.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Bibliography 43
ix
List of Figures
FC1.3 Visualization of shots-Cover drive, Straight drive, Square cut, Pull shot 4
FC3.6 Slope between Shoulder and Thigh, Distance between Elbow and Thigh 23
List of Tables
TC5.1 Training and Test data split for player expertise prediction . . . . . . . 38
TC5.3 Training and Test data split for player expertise classification . . . . . 40
List of Abbreviations
CHAPTER 1
INTRODUCTION
Video analysis is a commonly used tool in modern day sports which can provide a
training boost for individual and team. Coaches and trainers analyze video from live
action and training exercises, and the results of their careful analyses provide helpful
feedback for the players. Thanks to video analysis, players can gain a competitive edge,
correct faults and maximize their strengths. Sports academies and coaching centers are
providing training to many players across the world. But not all the academies and
coaching centers are equipped with enough real time video analysis tools and technolo-
gies which can make their learning more efficient. Few licensed video analysis tools
are available for which royalty has to be paid for using their services. Because of their
high costs an individual player can not possess them personally. But today, with the
lower costs of cameras and the prevalence of smart phones and tablets, video can be
captured easily anytime, anywhere. By taking this advantage if simple video analysis
methods are implemented which can be deployed on laptops and smart phones, it will
be a greatest advantage to many people.
The focus of this thesis is on the application of video analysis to Cricket videos. There
are different kind of shots played in Cricket. A lot of learning and practice is needed to
play a shot perfectly. A player has to select appropriate shot to be used based on line,
length and speed of a delivery. Even an expert batsmen give their wicket or score less
runs because of poor shot selection. Feedback about the quality of shot depends on type
2
of shot played, timing of the shot and few other parameters like body posture. So as a
part of video analysis detecting type of shot that was played is very important. Once
the shot is detected appropriate visual feedback will be given based on the type of shot.
In Cricket around 12 different kinds of shots are available which can be played from
either side of the wicket. Selection of shot to play depends on the line, length and speed
of ball and other match conditions. Pitch map is shown in below figure FC1.1. There
are three different lines of delivery of the ball.
• Middle stump
• Bouncer
• Short length
• Good length
• Full length
• Full toss
A batsman needs proper footwork to get into the best position to play a shot. The
different types of shots a batsman can play are named below and shown in figure FC1.2.
• Hook shot
• Pull shot
• Square cut
• Back defense
• Off drive
• Straight drive
• Cover drive
• Sweep shot
Drives are the shots played generally from the front foot in front of the wickets (On
drive, straight drive, cover drive), cuts are back foot shots (square cut, late cut) where
you’re cutting the ball, pull is where you pull a short ball into the leg side.
Figure FC1.3: Visualization of shots-Cover drive, Straight drive, Square cut, Pull shot
A player will select a shot based on factors like line and length of the ball. Quality
of the shot will depend on many parameters like timing of the shot, appropriate body
posture, swing of the bat etc. As a part of this thesis work we have not considered all of
these parameters in determining the quality of the shots. In this thesis work we limited
the scope of feedback to only predict the player expertise level based on different shots
played by him.
A simple overview of the proposed system is shown in figure FC1.4 below. From
videos each frame is extracted and processed. Overall system functionality is explained
in below steps.
1. Frames are extracted from videos and player(location as bounding box) is detected
in each frame
2. Pose estimation is performed based on the identified player location for each
frame and joint locations are captured. Using these joint location values train-
ing features are computed.
5
5. Suggestions/feedback about the quality of shut will be generated based on the type
of shot.
• Chapter 2 provides review of related works, motivations and briefly talk about the
Human pose estimation systems and Action recognition methods.
• Chapter 4 provides details about the datasets used in this work, the experiments
performed and the summary of the results.
• Chapter 5 provides details about the proposed methods for shot quality assess-
ment.
6
• Chapter 6 provides summary of the entire thesis with concluding remarks and
directions for future work.
1.5 Applications
Video Analysis has many applications in sports. They can also be extended to sci-
entific domains. Video analysis software can also be used for bio-mechanics research,
and in injury rehabilitation. Few use cases are listed here.
1. Modeling from best : Analyzing video of the best player at your position or in
your sport will showcase habits the players uses on a regular basis that help him
succeed. When you have identified some techniques of the best players, you can
work them into your own game to improvise it.
2. Injury prevention and recovery: Using video analysis you can study different tech-
niques and identify areas that must be changed to avoid injuring yourself in the
future.
3. Technique Analysis : Video analysis is very useful for identifying and correcting
problems with a playing technique. Using video analysis we can measure lot
of parameters like body posture, angle at which ball is thrown, trajectory of the
ball, swing of the bat etc. All these measures can help a player to enhance his
technique.
4. Enhanced game plan : Video analysis can also be used to prepare for upcoming
matches. Watching videos and analyzing the techniques of upcoming opponents,
teaches you their strengths and weaknesses, and enable you to formulate a game
plan to deal with them in a better way.
7
CHAPTER 2
In this chapter, we present a brief background on RGB based human pose estimation
methods, action recognition using Hidden Markov Models(HMM) and Support Vector
Machine(SVM). We then discuss about how these action recognition methods can be
extended for shot classification in Cricket.
In this section we present an overview about machine learning concepts like classi-
fication problem, supervised learning models like SVM, statistical Markov model like
HMM and Human pose estimation methods.
The goal of this learning is to optimize the cost function so that when you have new in-
put data (X’) we can predict the output variables (Y’) with better accuracy. Supervised
learning problems can be further grouped into regression and classification problems.
In Unsupervised learning we have only input data (X) and no corresponding desired
output variables. The goal of unsupervised learning algorithms is to model the under-
lying structure or distribution in the data to learn more about the data. Unsupervised
learning problems can be further grouped into clustering and association problems
2.1.2 Classification
loss = ∑yi − sign( fw,b (xi ))
i
We need to minimize this loss function so that classification accuracy will be increased.
Binary classification can be extended to multiple classes also, which is know as Multi-
class classification. It is illustrated in above figure FC2.2b.
2.1.3 SVM
In general there exists multiple hyperplanes which can separate given data as shown
in Figure below FC2.3a. We need to define a criterion to estimate the best line that gives
the solution.
SVM algorithm is based on finding the hyperplane that gives the largest minimum
distance to the training examples. This distance is called margin. Therefore, the optimal
separating hyperplane maximizes the margin of training data. It is illustrated in FC2.3b.
~w.~x + b = 0
where ~x denotes training examples closest to the hyperplane. These are called support
vectors. Support vectors pair is represented as {(xi, yi), i = 1, 2, . . . .N} Support vectors
parallel to the optimal hyperplane, which are lie on two hyperplanes of equation
~w.~x + b = −1
~w.~x + b = +1
11
The maximization of the margin with the equations of the two support vector hyper-
planes contributes to the following constrained optimization problem.
1
min{ k~w k2 } where yi (~w.~x + b) ≥ 1, i = 1, . . . ., N.
2
2.1.4 HMM
ai j = P[qt = Si |qt−1 = S j ], 1 ≤ i, j ≤ N
In the case of Hidden Markov models, observations are probabilistic functions of state,
and the underlying stochastic process is not observable, as shown FC2.4. An HMM has
following elements :
Parameter set of HMM is given by λ = (A, B, π). Based on the above specifications,
following problems can be solved :
2.1.5 Metrics
Performance of the classification can be measured using the confusion matrix, see
figure FC2.5 for an example of the confusion matrix for two-class problem. This ma-
trix provides summary for assignment of examples from each class to the predicted
classes, using results from all experiments in the cross-validation process. Based on the
confusion matrix, the following performance measures can be computed
For regression tasks we used Person’s correlation coefficient [1], which gives the
relation between predicted and actual values. Higher the correlation value better the
accuracy.
14
In this chapter we review relevant related work in the field of video analysis and
Cricket shot classification.
Ability to recognize human activities and actions will enhance the capabilities of
a robot that interacts with humans. However automatic detection of human activities
could be challenging due to the individual nature of the activities. Lot of work research
has been done in actions and activity recognition with different features and learning
methods.We are discussing few methods here.
Human action and activity recognition has been previously studied by a number
of different authors. One common approach is to use space time features to model
points of interest in video [2]. Several authors have supplemented these techniques by
adding more information to these features [3–5]. Other, less common approaches for
activity recognition include filtering techniques [6]. Hierarchical techniques for activity
recognition have been used as well, but these typically focus on neurologically inspired
visual cortex-type models [7]. Often these authors adhere faithfully to the models of
the visual cortex, using motion-direction sensitive cells such as Gabor filters in the first
layer [3]. Another class of techniques used for activity recognition is that of the hidden
Markov model (HMM). Early work by Brand, Oliver, and Pentland (1997) [8] utilized
coupled HMMs to recognize two-handed activities. Weinland et al. [9] utilize an HMM
together with a three-dimensional occupancy grid to model three dimensional humans.
Martinez-Contreras et al. (2009) [10] utilize motion templates together with HMMs to
recognize human activities. Sminchisescu et al. (2005) [11] utilized conditional ran-
dom fields and maximum-entropy Markov models for activity recognition, arguing that
16
these models overcome some of the limitations presented by hidden Markov models.
In the work presented by Jaeyong Sung and Colin Ponce [12], they used a RGBD sen-
sor (Microsoft Kinect) as the input sensor to capture 3D human pose and they used
hierarchical Maximum Entropy Markov Model (MEMM) for modeling the activities.
In recent times usage of deep learning and Neural networks is also increasing.In the
work presented by Shuiwang Ji and Wei Xu [13], usage of a 3D Convolutional Neural
Networks(CNN) model for action recognition was discussed.
Lot of research is going on in cricket sport analysis. The Hawk-Eye [14] is an ad-
vanced coaching system for cricket. Rahish Tandon and Dr. Amitabha have proposed
semantic analysis of broadcasting video [15] involve the use of auxiliary cues to detect
events. Another shot boundary detection and shot classification based on multi-scale
spatio temporal analysis of color and optical flow features [16]. David Lowe’s and
Timor’s research for [17] for SIFT feature descriptor [18, 19] helps to identify objects
direction and with integration with optical flow that can be helpful to detect the bat,
ball and body parts movement direction. Bangpeng Yao and Li Fei-Fei have researched
about modeling mutual context of object and human pose in human-object interaction
activities [20]. Machine also have to recognize the human pose by effective learning
approach [21]. Ashwani Aggarwal and Susmit Biswas have proposed a technique for
object detection and motion estimation from a MPEG video using background subtrac-
tion [22]. Different filter based tracking like Kalman [23], KLT [24] etc approaches are
proposed with several modification. Those have played an important role on field of
object detection, tracking as well as in action recognition. Mubarak shah, Javed Ahmed
and Mikel d. Rodriguiz have proposed an action MACH filter [25] for action recog-
nition, which can distinguish between different sports activities. But these approach
take a large learning set to make template class of the action corresponds to the spe-
17
cific sports activity. Debajyoti and AZM Chowdhury proposed a method to classify
shots based on optical flow vectors computed from each video frame [26]. An aver-
age accuracy of 60% is reported using this approach. Debajyoti and AZM Chowdhury
proposed a method to classify shots based on optical flow vectors computed from each
video frame. An average accuracy of 60% is reported using this approach. In the work
presented by Harikrishna and Sanjeev [27] neural network approach is used to classify
the cricket video events like boundary, four, out etc.
2.3 Motivation
Considering the shot classification approach proposed by Debajyoti and AZM Chowd-
hury [28] which is based on motion vectors, and action recognition methods based
on HMM, we proposed a method to classify different cricket shots. The features we
used are based on player joint locations(also know as skeleton features) which can cap-
ture complete player body movements. Along with joint locations Space Time Interest
Points(STIP) to represent spatial and temporal events which shown improvement in the
performance of classifiers. Using these feature vectors classification task is modeled
using Hidden Markov Model(HMM) as shown in figure FC2.8. ai j represents transition
probabilities and bik represent observation probabilities. A separate HMM is trained for
each kind of cricket shot.Inference is done based on the maximum likelihood of the test
data.
CHAPTER 3
SYSTEM OVERVIEW
The overall functionality for Cricket shot classification system is divided into five
major parts.
1. Frames are extracted from videos and player(location as bounding box) is detected
in each frame
2. Pose estimation is performed based on the identified player location for each
frame and joint locations are captured .
4. Using these training features an HMM is trained for each type of shot.
These steps are illustrated in below figures. For HMM Training FC3.1 and for Testing
FC3.2 . Further each step is discussed in detail.
For detecting players from video frames we have used YOLO Object detector [29],
a state-of-the-art, real-time object detection system that can detect over 9000 object
19
categories. Each video frame is given to object detector interface which will give the
location and type of objects it has detected. Since we are interested in only person de-
tection, we filtered the results to observe only persons in frames. There is a possibility
that multiple players might be present in single frame. Since we are interested in ex-
tracting batsmen player location we have set filtering conditions on results to consider
only locations of striking batsmen. Object detector will give location as a bounding box
co-ordinates.
[B = [Xt ,Yl , Xw ,Yh ], where (Xt ,Yl ) gives the starting location of the bounding box, Xw
gives the width of bounding box and Xh gives the height of bounding box. Center of the
(Xt +Xw ) (Yl +Yh )
bounding box is computed as Xc = 2 and Yc = 2 . Block diagram for player
detection is illustrated in FC3.3. We have set 0.25 as the threshold value to detect the
objects. Any value less than this will lead to lot of multiple and overlapped object
detections. So 0.25 is considered as optimal threshold value.
20
For detecting joint locations of players from video frames we used Pose Estimation
using Iterative error feedback approach [30]which is a CNN based hierarchical feature
21
extractor model with top-down feedback approach. This model was trained on MPII
Human Pose dataset and d Leeds Sports Pose dataset (LSP).
Input video frame along with player location which was detected in previous step
is passed as input to the pose estimation system. It outputs a set of 17 joint locations
as (X,Y) positions of the input images. It is illustrated in figure FC3.4. Location of
17th joint is same as the center of the bounding box which we given as input to pose
estimation framework. In case of occlusions and hidden joints, those locations are
approximately identified as per the feedback and correction approximation methods of
the pose estimation framework.
Joint locations extracted using pose estimation framework as base for generating
feature vectors for shot detection system. Joint locations are image coordinates and
22
specific to image, which can change based on the physical dynamics of the person. We
need to convert these values to relative features which would represent person indepen-
dent feature set. In order to achieve this we came up with few features like joint angles,
slope values of different joint and distance between few joints. These are explained
below.
Since each shot is played in a different way, these angle values across shots will vary
based on the player action. So these features can be used to differentiate each shot.
23
The following additional features are computed based on the joint locations.
Figure FC3.6: Slope between Shoulder and Thigh, Distance between Elbow and Thigh
We used HMM for modeling and testing different types of cricket shots. HMM is
significantly used in scenarioes where time series data with spatio-temporal variations
needs to be learnt. Simple illustration is show in figure FC2.8. We generated HMM
model for each type of shot which is represented as λi = (A, B, π). So there will be l
different HMM models will be learnt representing each shot type. λ = {λ1 , λ2 , . . . , λl }.
Once the models are learn, we used maximum log likelihood estimation to classify
the shot.
argλ ∈AllShots max P(O|λa )
Along with joint location features, we considered usage of Space Time Interest point
(STIP) features also. Local space-time features [2, 31] are popularly used feature for
detection of action recognition [32]. STIP capture salient visual patterns in a space-
time image volume by extending the local Spatial image descriptor to the space-time
domain. Obtaining local space-time features has two steps: spatio-temporal interest
point (STIP) detection followed by feature extraction. Video is given as input to STIP
engine, it outputs a set of interest points which captures the spatio-temporal variations
in the video, which is represented in figure FC3.11.
Each interest point is a combination of HOG and HOF features. Each These STIP
features are used along with joint location features and fed to HMM framework for
learning.
27
CHAPTER 4
We created two datasets for Training, Valuation and Testing Purpose. First dataset
is based on the cricket videos download from Youtube. Second dataset is our own data
set which we created by recording multiple shots while different people are playing.
We downloaded few videos from Youtube [33] and extracted few shots from them.
As a part of this dataset we considered only two types of shots, Cover drive and Pull
shot. Each video has only one shot sequence. Details about the number of videos are
presented in below table TC4.1.
Cover Drive 50 50
Pull Shot 50 50
28
We created another dataset by recording videos while different people are playing
cricket. It consists of four different kinds of shots namely Cover drive, Straight Drive,
Square Cut and Pull Shot. Details about the number of videos are listed in below table
TC4.2.
This dataset is a mix of videos with single shot sequence and multiple shot se-
quences, i.e in one video same shot is played multiple times. So if individual shot
sequences are considered, its count will be more than the number of video files. Screen
shots from this dataset are shown in below figure FC4.1.
For our classification tasks we experimented with SVM classifiers and HMM based
classifiers. Along with these we experimented with deep learning based models also.
We used standard Liblinear [34] library for SVM and Probabilistic Modeling Toolkit
(PMTK) [35] for HMM. For deep learning based experiments we used Long-term Re-
current Convolutional Networks based approach proposed by Lisa Anne [36].
Inorder to classify using SVM, we used Bag of visual words model [37]. Training a
SVM classifier is performed in below steps
1. Separate your data sets into the data sets for each class.
3. The features extracted are then clustered using k-means [38] with N clusters in
order to form a visual codebook with N words. In our experiments we used 20
clusters.
4. A Bag of words [37] is then constructed for each example (video sequence) based
on the occurrences of the codewords in the given example.
5. The Bag of words features are then fed to Multi-Class non-Linear SVM [34] to
learn classifier.
We used standard PMTK library [35] for classification experiments using HMM.
Classification task is performed as per below steps
1. Separate data sets into the data sets for each class.
4. On the test data set compare the likelihood of each model to classify the sequence.
Table TC4.3: Training and Test data split with Youtube dataset
Cover Drive 35 5
Pull Shot 35 5
Total 70 10
From these videos features are extracted as described in chapter 3. These features
are used to train and validate SVM and HMM based classifiers. Accuracy comparison
values are shown in table TC4.4.
31
In case of SVM with K-Means clustering, we used twenty clusters, with which we
got an average accuracy of 80 percent. In case of HMM, we used five hidden states
and two Gaussian mixtures as HMM parameters with which we got around 78 percent
of accuracy. Confusion matrix is plotted to compare the performance of two classi-
fier models. Overall performance of SVM FC4.2 is slightly better compared to HMM
FC4.3.
IIIT-B cricket dataset contains videos related to four different shots namely Cover
Drive, Pull Shot,Square Cut and Straight Drive. The following table TC4.5 shows the
details of number of training and test videos.
Table TC4.5: Training and Test data split with IIIT-B dataset
Similar to Youtube dataset, same experiments are conducted with this dataset also.
Classification is performed using SVM and HMM with same configuration parameters.
Comparison results for two classifiers are illustrated in below table TC4.6.
33
Considering the LRCNN based base activity recognition model proposed in [36] we
tried some transfer learning experiments with our dataset. The base model was trained
on UCF101 dataset [39]. UCF101 is an action recognition data set of realistic action
videos having 101 action categories. This pre-trained model can recognize action type
as cricket batting or cricket bowling. But it can not recognize which type of shot it is.
In order to achieve this we did transfer learning [40] with our dataset and generated a
new model. With this data, we observed 60 percent classification accuracy.
80 10 60
35
CHAPTER 5
As a part of this thesis we are limiting the scope off feedback and quality assessment
task to only player expertise prediction.We proposed a system for assessing the players
expertise rating based on how well they played different cricket shots. In our approach,
we trained a regression model from skeletal features(Joint locations) and spatio tempo-
ral(STIP) features to expertise scores assigned to each player.Ratings are given in the
scale of 1 to 5. Rating of 1 refers to poor and 5 is best. These ground truth scores
are obtained from expert cricket players. For these experiments we used IIIT-B cricket
dataset, which consists of four different shots played by 50 players.
The stance [41] is the ”ready” position when the batsman is about to face a delivery.
To strike a cricket ball, a balanced and stable base must be first created. An ideal stance
is ”comfortable relaxed and balanced,” with the feet 40cm apart, parallel and astride
the crease. The front shoulder should be pointing down the wicket, the head facing the
bowler, the weight equally balanced and the bat near the back toe. As the ball is about
to be released, the batsman will lift his bat up behind up in anticipation of playing a
stroke, and will shift his weight onto the tip of his feet. At this point player is ready to
move swiftly into position to address the ball. Figure FC5.1 below shows few principles
36
In the the case study presented by Dr Paul Hurrion [42] about optimal rotational
moments of body segments to achieve better shot accuracy, all body segments should
accelerate and decelerate in the correct sequence with specific timing before impact.
The peaking sequence of the major segments for best energy transfer is: pelvis (stable
front ankle, knee and pelvis), thorax (chest / shoulders), both arms and wrists and finally
the bat handle. The motion should occur sequentially with each peak speed being higher
and later (closer to impact) than the previous one. This pattern is necessary to efficiently
transfer energy and accelerate each body segment. Figure FC5.2 illustrates the usage of
body segments peak sequence to verify the shot accuracy.
37
In this section we present our system for assessing the quality or expertise of an a
player based on multiple shots played by him. Our model learns a regression model
from spatio-temporal features and Skeletal features and corresponding ground truth
scores given by expert players. The set of labeled videos can be depicted as a tuple
N
( ~Xi ,Yi ) where N is the total number of labeled video instances and Yi ∈ R denotes
i=1
the ground truth quality score of the player j in video i. With these features we modeled
an expertise assessment system using support vector regression to predict an expertise
score. Along with that we modeled a binary classifier which can categorize the players
into two classes namely Proficient and Learners.
38
5.2.1 Features
Human experts are typically trained over many years to develop complex rules to
evaluate sports actions. In order to make machines achieve this tasks they must be
provided with similar rules as well. Accessing the same visual features which human
experts perceive is complex. So for our experiments we considered below skeleton
based features.
Below table TC5.1 shows the dataset split for expertise prediction using regression.
Since expert ratings for the players are not available at this point of time, we considered
ratings given by players having good knowledge of cricket. We collected ratings from
multiple people and average of these ratings is considered as ground truth expertise
rating. Individual player expertise ratings are shown in Appendix A.
Table TC5.1: Training and Test data split for player expertise prediction
With all the features mentioned in section 5.2.1, we trained a regression model using
linear Support Vector Regression (SVR). Prediction accuracy is measured using test
dataset. We observed 0.40 mean rank correlation (Higher is better).
39
we trained a regression model with both linear and Radial Basis Function(RBF)
kernels. Results are plotted in figure FC5.3. As shown in table TC5.2 Regression model
with RBF kernel has shown better accuracy with 0.40 mean rank correlation compared
to 0.07 with linear kernel.
We modeled a binary classifier which can categorize the players into two expertise
levels namely Proficient and Learners. For this classification task, we modified the
dataset used for regression in section 5.2.2 as mentioned below.
40
Below table TC5.3 shows the dataset split after above changes.
Table TC5.3: Training and Test data split for player expertise classification
Proficient 415 87
Learner 285 71
With these features we trained a classifier model using support vector classifier. We
observed total classification accuracy of 69.8%. Below table TC5.4 shows the classifi-
cation results.
Proficient 73.5
Learner 66.1
With regression model, we observed average rank correlation value of 0.40 for
player expertise prediction. Generally human experts who can rate players possesses
rank correlation value of 0.96. So there is a lot of scope for improvement. With clas-
sification model we observed an accuracy of 70%. If feature set is extended with more
visual features, accuracy can be improved and more expertise levels can be added.
42
CHAPTER 6
In this thesis work, we have presented HMM based model for cricket shot classifica-
tion. We have proposed an regression based approach for assessing players performance
rating. We observed 92% accuracy for shot classification and 0.40 average rank corre-
lation for expertise prediction. with IIIT-B dataset. We performed all our experiments
on four classes of shots. This can be easily extended to any number of classes based
on availability of dataset. Related to feature set, we considered features based on joint
locations (Skeletal features). In order to classify a shot or to estimate the rating of a
player, we may have to consider more features. Going forward more descriptors can be
added to make accuracy of the system better. This work can be extended to add tracking
of Cricket Ball, Bat and timing detection etc. Assessing the quality of actions in sports
is an important problem with many real-world applications. It can provide feedback on
how the player can improve. Although the quality of an action is a subjective measure,
the independent human experts have a large correlation compared to computer based
evaluation systems. This shows that there is a lot of scope for a computer vision system
to improvise and learn from data.
43
Bibliography
[3] H. Jhuang, T. Serre, L. Wolf, and T. Poggio. A biologically inspired system for
action recognition. In 2007 IEEE 11th International Conference on Computer
Vision, pages 1–8, Oct 2007.
[4] S. F. Wong, T. K. Kim, and R. Cipolla. Learning motion categories using both se-
mantic and structural information. In 2007 IEEE Conference on Computer Vision
and Pattern Recognition, pages 1–6, June 2007.
[5] Jingen Liu, S. Ali, and M. Shah. Recognizing human actions using multiple fea-
tures. In 2008 IEEE Conference on Computer Vision and Pattern Recognition,
pages 1–8, June 2008.
[7] T. Serre, L. Wolf, and T. Poggio. Object recognition with features inspired by
visual cortex. In 2005 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (CVPR’05), volume 2, pages 994–1000 vol. 2, June 2005.
44
[8] M. Brand, N. Oliver, and A. Pentland. Coupled hidden markov models for com-
plex action recognition. In Proceedings of IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, pages 994–999, Jun 1997.
[9] D. Weinland, E. Boyer, and R. Ronfard. Action recognition from arbitrary views
using 3d exemplars. In 2007 IEEE 11th International Conference on Computer
Vision, pages 1–7, Oct 2007.
[12] Jaeyong Sung, C. Ponce, B. Selman, and A. Saxena. Unstructured human activity
detection from rgbd images. In 2012 IEEE International Conference on Robotics
and Automation, pages 842–849, May 2012.
[13] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks
for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell., 35(1):221–
231, January 2013.
[15] Rashish Tandon and Dr. Amitabha. Semantic analysis of a cricket broadcast video.
[17] David G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J.
Comput. Vision, 60(2):91–110, November 2004.
45
[18] Timor Kadir, Paola Hobson, and Michael Brady. From salient features to scene
description. 2005.
[19] Jianbo Shi and Carlo Tomasi. Good features to track. Technical report, Ithaca,
NY, USA, 1993.
[20] B. Yao and L. Fei-Fei. Modeling mutual context of object and human pose in
human-object interaction activities. In 2010 IEEE Computer Society Conference
on Computer Vision and Pattern Recognition, pages 17–24, June 2010.
[22] T. Yokoyama, T. Iwasaki, and T. Watanabe. Motion vector based moving object
detection and tracking in the mpeg compressed domain. In 2009 Seventh Inter-
national Workshop on Content-Based Multimedia Indexing, pages 201–206, June
2009.
[23] H. Patel and D. Thakore. Moving object tracking using kalman filter.
[24] Berthold K.P. Horn and Brian G. Schunck. Determining optical flow. Technical
report, Cambridge, MA, USA, 1980.
[25] Mikel D. Rodriguez, Javed Ahmed, and Mubarak Shah. Action mach: a spatio-
temporal maximum average correlation height filter for action recognition. In In
Proceedings of IEEE International Conference on Computer Vision and Pattern
Recognition, 2008.
[29] J. Redmon and A. Farhadi. YOLO9000: Better, Faster, Stronger. ArXiv e-prints,
December 2016.
[30] João Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Jitendra Malik. Human
pose estimation with iterative error feedback. CoRR, abs/1507.06550, 2015.
[31] Ivan Laptev, Marcin Marszalek, Cordelia Schmid, and Benjamin Rozenfeld.
Learning realistic human actions from movies. In Computer Vision and Pattern
Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.
[32] Ronald Poppe. A survey on vision-based human action recognition. Image and
vision computing, 28(6):976–990, 2010.
[34] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector ma-
chines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27,
2011.
[35] Kevin Murphy Matt Dunham. Probabilistic modeling toolkit for matlab/octave.
[36] Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Sub-
hashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent con-
volutional networks for visual recognition and description. CoRR, abs/1411.4389,
2014.
47
[42] Dr Paul Hurrion. Balance ,stability - the key to power and consistency in batting.
[43] Harris Drucker, Christopher JC Burges, Linda Kaufman, et al. Support vector
regression machines.
48
APPENDIX A
Rating Rating
Player 1 5 5 5 Player 16 5 5 5
Player 2 3 2 3 Player 17 4 2 3
Player 3 1 1 1 Player 18 4 4 4
Player 4 3 3 3 Player 19 1 1 1
Player 5 2 3 3 Player 20 2 4 3
Player 6 3 2 3 Player 21 2 5 4
Player 7 2 4 3 Player 22 2 3 3
Player 8 4 3 4 Player 23 3 1 2
Player 9 4 3 4 Player 24 1 3 2
Player 10 4 5 5 Player 25 2 3 3
Player 11 2 4 3 Player 26 2 5 4
Player 12 1 2 2 Player 27 2 5 4
Player 13 4 4 4 Player 28 1 1 1
Player 14 3 3 3 Player 29 5 5 5
Player 15 2 4 3 Player 30 5 5 5
49
Rating Rating
Player 31 3 2 3 Player 41 4 4 4
Player 32 4 4 4 Player 42 1 1 1
Player 33 3 3 3 Player 43 1 1 1
Player 34 3 2 3 Player 44 3 3 3
Player 35 2 3 3 Player 45 1 1 1
Player 36 4 3 4 Player 46 2 3 3
Player 37 3 4 4 Player 47 4 2 3
Player 38 4 3 4 Player 48 3 4 4
Player 39 1 3 2 Player 49 4 3 4
Player 40 1 3 2 Player 50 3 4 4