Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Extraction of Interesting Video Fragments from Soccer Matches

Maheen Rashid, Evgeny Toropov, Gaurav Singh December 11, 2013

Motivation

Soccer is probably the most popular televised sport in the world and thousands of soccer matches are contested every weekend. Soccer highlights are posted online on video sharing websites like Youtube, Footytube and these generate tremendous number of hits. However, highlights compilation is done manually and is often tedious. We propose ways to automate soccer highlight video creation by extracting interesting video fragments leveraging the structure of soccer broadcasts such as camera viewpoints, structured lming style and discovered knowledge about the game structure. Key soccer events like a goal are often visually similar to uninteresting non-goal events, and this is the main challenge we need to overcome.

Related Work

Ekin et al [2] use visual information for goal event detection. They use shots as the building block of their algorithm. Each video shot is classied as a long shot, a close up shot, or an out of eld shot. After classifying shots, they build on these to do more complex visual classication tasks like slow motion detection and goal mouth detection using gradient features. While we do not follow the exact approach taken by them, we do use their algorithm as inspiration for the set up of our approach. We also use shots as the building block of our algorithm and attempt to build on information from shot clusters for eventual detection of goal moments. However, we use original approaches to detect shot changes, classify shots and what features we use in addition to these for the task of goal detection. For instance, we approach the problem of classifying shots in two ways First, by calculating spatio-temporal features for video frames and then classifying using Bag of Words and second, by unsupervised clustering of GIST feature responses calculated over representative video frames of shot. In contrast to Ekin et al., we leverage audio volume and pitch to determine video frames of interest.

Approach

Match videos have the advantage of being extremely structured in the number and type of video shots that are used. Unconventional viewpoints are rarely used, and the sequence of shots in case of certain match events is extremely predictable. 1

For example, in the case of a goal, the shots change from a long shot on the goal area, to a closeup of the players (goal keeper, goal scorer celebrating), followed by one or two replays of the goal from different viewpoints, often in slow motion. This entire sequence is accompanied by an increase in volume and the excitement of the match commentary. We used a multi-layer method for classication. Each match video is split into a number of short video sequences, where each sequence is a separate camera shot. Different features are separately extracted on each shot. Shot features are then concatenated together to form a fragment that is classied as either a non-goal, or as goal. The input of our system is a full length video that is split in to fragments of shots. For each fragment we predict whether it is a goal or a non-goal event.

3.1

Detecting Shot Changes

A camera shot change is characterized by a dramatic change in viewpoint or camera position. We observe that consecutive frames in a video sequence are not likely to be dramatically different from one another unless there is a shot change. To do this each image is represented with its RGB histograms concatenated together. The sum of differences between consecutive histograms is then taken to form a single number measure of the difference between frames. Fig. 1 (a) is a visualization of the normalized difference between consecutive frames for a short 50 second long clip. Detecting shot changes requires detecting anomalous peaks in this signal. These signals are not smooth, and shots are of varying lengths. To detect shot changes, we divide the signal into bins. Each bin is assigned the maximum variance of its signal window. This process is repeated for bins with different sizes (we x bins to be 10,20,30,40, and 50 frames long) . Following, each bin is compared to its neighbouring bins, and a record is kept of bins that are at a local maximum. Finally, the regions of the signal for which bins of all sizes are a local maximum are detected, and the maximum within the smallest sized bin is detected as a peak. Peaks are then thresholded to be greater than half the median value of all detected peaks in a sequence. These peaks are our shot changes. Fig. 1 is a visualization of our shot detection algorithm. Fig. 1 (d) shows some results. A typical video segment of a goal and its replays take 1-2 minutes and consist of 10-20 shots. An issue with this simple shot change detection scheme is gradual change of shots as an artistic effect (see Fig. 2). In this case the difference in color histograms is not so profound and the detector usually misses the shot change.

Figure 1: From Top to Bottom: Difference of Histogram Signal, Binned Signals using different sized windows, Detecting windows at local maxima for all sized bins, Ftrames from detected shots

Figure 2: Gradual shot change usually missed by detector

3.2

Clustering shots based on Frame Features

As mentioned before, sequences of interesting events often involve the concatenation of shots of certain types stringed together in a certain order. This observation is made in previous work as well [2]. To learn these templates of shots it is necessary to be able to automatically classify shots based on viewpoint, or the subject of the shot. We cluster shots by extracting features on the individual shot frames. Rather than represent shots with single frames, we use three evenly spaced images from each shot to represent the cluster. This is done to make clustering robust to bad shot boundary detection and anomalous frames. Fig. 3 shows an example of bad shot boundary detection that is made more robust by using multiple frames.

Figure 3: Three Frames selected from one Shot We extract GIST and HOG features on the shot frames, and cluster frames using K-means. Ideally clusters would generalize across videos and would represent what kind of play was happening. Using K=4 we can see that this tends to happen. In Fig. 4 one cluster corresponds to close-up shots of players on the eld, one to audience shots or close-up shots of players with non-green background and two shots correspond to long shots with varying amount of audience.

Figure 4: Result of clustering frames from shots for K-means clustering K=4. Columns represent clusters

We also tried clustering on the basis of LAB and RGB spatial histograms. This gave us qualitatively poor performance as there can be dramatic color variation of grass depending on shadows and time of day. This can be seen in the rst cluster in Fig. 4. Additionally team shirt colors affect clustering done using color features and are unable to generalize across matches. GIST and HOG combined perform better than either of them alone. A possible reason behind this can be that GIST is able to capture the visual quality of the scene as a whole (how open the scene is), without zooming in on the ner details, which HOG is better able to capture. 3.2.1 Soft Assignment of Shots

While clustering based on GIST and HOG is qualitatively impressive, it is not perfect. Therefore, we assign shot membership to clusters in a soft manner. Rather than represent cluster membership by a single number, we use a feature vector of length K (in our case K=4) with distances to the cluster centres for each shot. 3.2.2 Group Shot Clusters across Videos

To compare clusters across videos we need to know which cluster corresponds to which kind of shot. In other words, we need to have some method of picking out which cluster corresponds to a long shot, which cluster corresponds to an out of eld shot, and so on. We tried a number of different approaches to robustly label our clusters. Initially we tried to rank clusters by measuring the distance of pixels in each cluster to the mode color. Ideally the mode color should correspond to the dominant contiguous background color in each frame. Clusters that correspond to long shots and feature large spaces of even color would therefore have a smaller distance to the mode color. Clusters that correspond to out of eld shots that feature large variations in color across shots should correspond to large distances from the mode color. However, this approach gave us poor results. The mode color often corresponded to colors that were frequent but were not contiguous. For instance, the mode color for audience shots was often black. In addition, the background color can vary greatly in both RGB and HSV space even if it is not perceptually obvious. We consequently attempted classifying shots based on deviation from mean color of the cluster. Since shot clustering is not perfect, this also proved ineffective at isolating long shots from close-up shots. The approach that eventually worked was to use the ratio of the magnitude of image gradients to the total number of pixels in the cluster. Fig.5 shows how shots with large even regions - such as close-up shots of players on grass background have low magnitude to pixel ratio, whereas audience shots tend to have a high ratio. The success of this approach is not surprising considering that shot clustering itself was most successful on gradient based, rather than color based features.

Figure 5: Top: Audience shot has higher image gradient magnitude, Below: Close-up of player has smaller image gradient magnitude

3.3

Clustering shots based on Video features

As an alternative to the previous method, we attempted to apply the Bag of Words (BoW) algorithm to cluster shots based on video 2D+time features. Our hope was to follow the usual BoW pipeline for video as described, for example, in [5]. We used HOG/HOF detector and descriptor because it has been evaluated to perform just as well as other detectors, and a little better than other descriptors. We used the code available at [3].

Figure 6: HOG-HOF descriptor (image from I. Laptev talk) Following the BoW approach, descriptors were collected from video shots. Every descriptor has 90 HOG + 72 HOF = 162 bins. For a average video shot we collected around 5000 descriptors. The overall pool of descriptors was clustered into 250 clusters using k-means, and every cluster became a visual word. Then the histogram of visual words was calculated for every shot. At this point in order to check the model, we tried to cluster shots histograms into several classes using k-means. The result was not encouraging - the output clusters did not relate to any similar activities, and could only describe dominant color. The problem is probably the wide variety of low-level activities despite of very small number of high-level activities. No high level knowledge was applied in BoW 6

pipeline with the spatio-temporal features due to the lack of time, therefore, the model performed poorly. Compare the images from one class of KTH Human Actions dataset and examples of celebration which we would like to have as a class too (Fig. 7)

Figure 7: Images of celebration class in soccer are much more diverse than images of standing or waving classes of KTH dataset

3.4

Detecting Slow motion in a shot

Slow motion replays are a strong indicator of when an event of interest occurs in a video. An intuitive approach to detecting slow motion sequences is to compare the rate of change of motion in a shot compared to a non-slow motion sequence. However, detected motion is dependent on viewpoint. A long shot of the eld may have as much motion as a slow motion closeup of a goal. Because of this it is necessary to compare slow motion sequences to other sequences taken from the same or similar viewpoint. In context of the previous sections it means - belonging to the same shot clusters. A way of detecting slow motion replays would be to add an optical ow measure in addition to the shot cluster features. These two features combined will help dene a unique signature for slow motion events. In addition Optical ow by itself is a strong indicator of the type of shot in a football video. We found that high values of optical ow corresponded to close-up shots of players and managers while low values correspond to long shots of the eld (See Fig. 8).

Figure 8: Left image shows close shot with high overall Optical ow, Right image shows long shot of ground with low overall Optical ow We calculate an Optical ow measure for a given shot by calculating the average magnitude of optical ow between consecutive frames. When we plot this Optical ow measure for all shots we can see rapid changes in Optical ow across individual shots which is expected as football telecast often has shot changes from long shot to close-ups. We also see correspondences between close shots and local maximas in the plot (Fig. 9) and long shots and local minimas (Fig. 10).

Figure 9: Highlighted point shows Optical ow corresponding to close shot in Fig. 8

Figure 10: Highlighted point shows Optical ow corresponding to long shot in Fig. 8

3.5

Detecting certain objects in a shot

Some objects may be easy to detect in single frames and they will point to particular events in the fragment. One example is a goal net in large scale (See Fig 11). Because of its periodic structure and consistent color, it may be easy to detect in a frame. At the same time, if the net is shown in large scale then there has probably been a shot on goal or at least, a dangerous situation near the goalpost. These events are denitely of interest.

Figure 11: Left: Frame with goal net, Right: Frame without goal net In our project we looked at detecting goal nets and team shirt colors with a view of using them as indicators for interesting shots. 3.5.1 Goal Net Detection

The net can be detected with horizontal and vertical negative bi-modal lter applied to a frame (see Fig. 12).

Figure 12: Bimodal kernel. Distance between the two modes varies from 10 to 128 pxl An image is convolved with the horizontal kernel and the sum of absolute values is taken. This is a response of the lter on current distance between modes. Fig. 13 shows response as a function of distance for an image with and without goal net. Maximum distance between modes is set to 128 pixels. At that distance the periods gradually become blurred.

Figure 13: Filter response is periodic for images with net (left) and without goal net(right) To detect periodic structure Fourier analysis is used. See the Fast Fourier Transform of the responses in Fig 14. The peak of the real values of the Fast Fourier Transform function is found on the interval (min possible frequency, max possible frequency). min possible frequency is set to correspond to 5 pixels period, and max possible frequency is set so that response has 3 full periods. After that the mean Fast Fourier Transform value is compared to the peak value. The ratio of those values (typically, 0.05 with net vs 0.2 without net) is the output metrics.

10

Figure 14: Fourier Transform of lter response for images with net (left) and without goal net(right) This metric alone is not enough for differentiation between frames with the goal net and without it. Two numbers - the metrics for horizontal and vertical lters are taken from every frame to be analyzed in context of a video segment. Next, several frames from a video segment need to be extracted for processing. The problem here is to extract the least blurred frames. A simple and effective method to estimate blurriness is to convolve the image with the Laplacian lter [0,-1,0;-1,4,-1;0,-1,0]. The sum of the absolute values of the convolution is a metric of sharpness. It is scene dependent - bright images with many details will be found more sharp than a dark image with few details. See Fig. 15 for how sharpness changes from shot to shot in video. Therefore, this sharpness metric should be used very locally. We use a sliding window of 1 second with step size of 0.5 second and detect a peak of sharpness. After that we are left with 7 percent of frames for processing from original video segment.

Figure 15: Sharpness metric changes from shot to shot in video The result of net detection algorithm is seen in Fig. 16. Note that horizontal and vertical kernels detect nets at different frames. Therefore, the two curves should be used independently. We used a simple thresholding to nd the peaks.

11

Figure 16: Goalnet detector responses for different non-blurred frames. Green and blue correspond to horizontal and vertical kernel respectively 3.5.2 Team Color Detection

Team shirt colors can be an important indicator of when signicant events happen in matches. Goalkeepers and referees wear shirts that are of a different color from the team colors. These key people are prominent in shots that often coincide with signicant match events such as goals, near misses and bookings. In order to detect team colors, we want to use frames from the shot clusters that correspond to close up shots on a green background. These close up shots correspond to the shot clusters that have the lowest image gradient magnitude to pixels ratio. For every 3 frames in every shot in this cluster, the dominant colors in the image are detected. We detect four dominant colors by clustering in HSV color space. The centers of each cluster are chosen to be the representative dominant colors in the frame. Fig. 17 shows the dominant colors detected on a single frame.

12

Figure 17: A frame and detected dominant colors We make the following crucial observation: team colors are more likely to recur across shots. The other colors are dubbed as anomalous colors. In order to nd colors that are anomalous, we perform a second stage of clustering on the dominant colors found in individual frames of entire match. The results of these clusters are shown in Fig. 18, along side frames from the corresponding match. The second and third clusters from the left correspond closely to the team colors, while the rst cluster corresponds to the background color.

Figure 18: Clustering over dominant colors found in individual frames of entire match Rather than attempt to detect the team colors, we use two simple heuristics to detect the anomalous cluster. Firstly the cluster with the most members will belong to the background cluster. Secondly the cluster that is least uniform or has a high spread would be the anomalous cluster. In Fig. 18, the detected green cluster is the rst one, and the anomalous cluster is the last one. Once the anomalous cluster is detected, we rank how anomalous a shot is by summing its nonbackground dominant colors distance to the anomalous cluster center. Fig. 19 and Fig. 20 shows some examples of shots that are detected as anomalous in different videos. Fig. 19 shows that the above approach is not only good at detecting goal keepers and referees, but is also good at detecting unusual poses such as the players lying down since these poses cause more of the non-primary team color (the shorts and socks color) to be present in the image frame. However as Fig. 20 shows, the approach is sensitive to bad close up shot clustering, shot fading (which causes bad shot boundary detection), and non-uniform background color.

13

Figure 19: Good outcomes of Anomalous shot detection using color information. Goalkeepers, referees, injuries are detected

Figure 20: Not so useful outcomes of Anomalous shot detection using color information. Shot fading, non-uniform backgrounds cause confusion

3.6

Audio Volume and Pitch

Goals and near misses are often accompanied by cheers from the audience and excited commentary from the match commentators. This increase in the overall level of excitement translates to qualitative changes in the audio volume and pitch of a match. Our initial approach in using audio to detect goals was to attempt to detect changes in volume levels. However, this approach was not robust. Volume levels are very sensitive to audience chanting and whistling which do not necessarily correspond to goal moments. Further more, goal events may occur when the audience is generally noisy, so it might not be accompanied by a detectable shift in volume. As an alternative we implemented the approach highlighted in [4].

14

The paper uses Composite Fourier transform (CFT) to lter out noise and detect periods of sustained excitement. First the Fast Fourier Transform of the audio signal is calculated. Then FFT of the magnitude of the result within the 100-400 Hz band calculated. The magnitude of the result is the peak prole of the CFT. A sliding window of 100 msec with 75 percent overlap is used to get the CFT peak prole of the entire audio signal. Dense peaks correspond to increases in pitch and volume - or the audio excitement level. Sustained increases in excitement correspond to horizontal striations in the CFT prole. A 2D bump detecting kernel is used highlight points in the audio signal where these horizontal striations occur.

Figure 21: Audio excitement detection process on a minute long audio clip Fig. 21 shows the process on a minute long audio clip. Top shows the actual audio signal. Middle shows the CFT prole of a sliding window over the signal. The horizontal striations can be seen in the CFT prole and correspond to commentary. Bottom shows the peak prole of the clip, where dense peaks correspond to periods of sustained excitement. The minimum and maximum value of the peak prole is used to form thresholds over which every seconds excitement is judged. At low thresholds all seconds in the match are classied as exciting, and at higher levels, only a few seconds of the match are classied as exciting. We set the cut-off for excitement to be at half of the difference between the minimum and maximum density peak values. For all thresholds above this value, the highest threshold at which a second remains exciting is recorded. This is used to form the excitement per second prole of the match. Actual exciting moments in the match will be exciting for high thresholds and will have sustained excitement over some time. Fig. 22 shows the excitement per second prole of a match. The red lines correspond to goal moments and occur around the time of cluster of exciting seconds, or peaks. The peaks near the centre of the x axis, and towards the beginning and end of the match correspond to half time replays, and pre-match and post-match analysis when there is excited commentary. The other clusters of excitement correspond closely to near misses.

15

Figure 22: Audio excitement per second of one match. Red lines correspond to goal events, and light blue lines correspond to other events in the match such as bookings and substitutions.

Dataset Used

We used a combination of full length matches and goal highlights for the data set. We used full length matches from the English Premier League 2013-2014. We used goal clips from Soccer World Cup 2010 as well as earlier League seasons in England and Spain. We annotate the full length matches in our dataset by parsing the on line match commentary available on the BBC Football website [1]. The match events are accurate to the second, and can be automatically collected for all the match reports listed on the website. Website parsing was done using Python and the Beautiful Soup package.

5
5.1

Overall Goal Detection Results


Training and Test Set

We use four full matches and 111 goals as training set for each test match. Testing is done on ve full length matches.

5.2

Features

We use shot cluster features (clustering based on GIST and HOG) and audio features for classication of shots. Audio features for a shot are represented as sum of excitement levels during a shot. Shot cluster features are represented using a ve length vector - distance measurements to each of the four semantic clusters and how long the shot lasted in seconds.

5.3

Training and Testing

For training, ve continuous shots are combined to form a fragment and each fragment is labeled as goal or non-goal. A sliding window with a step size of one shot is used to form these fragement based feature vectors. We train a Support Vector Machine with an Radial basis function (RBF) kernel, and use grid search to set the hyper parameters C and . 16

Because we use sliding windows, a single goal causes ve fragments to be labeled as positive examples. Therefore, during evaluation positive predictions within a ve fragment neighborhood are counted as a single goal prediction. Continuous fragment predictions are coalesced into one positive goal prediction. For determining true positives, predictions within a ve fragment window of a true positive are counted as valid detections. The following table summarizes our results: Table 1: Overall goal detection results for ve matches Match number 1 2 3 4 5 Actual Goals Predicted Goals True positives 4 3 1 1 4 1 4 9 0 0 12 0 5 10 2

5.4

Problems

Initially we attempted to test and train on a smaller dataset comprising entirely of full length matches. However, despite using cross validation to set the weight term bias for positive examples we were unable to come up with reasonable results. Adding more positive examples through goal shots is what made our model eventually tick. Another problem that made the process of testing and training extremely time consuming was the actual pre-processing of videos so that features could be extracted from them. Video codec problems made it hard to create a dataset of full length matches that could be cut up into smaller shots that could in turn actually be read in as images and processed. Despite having the annotations for all the English Premier League matches of this year, we were only able to use the annotations for ve matches. This also prevented us from incorporating more features and doing more detailed analysis during testing and training.

Conclusion

During the course of this project we have explored how bias in soccer match lming could be exploited to extract relevant information about the game. The limited range of shots, as well as the uniform background of the football eld provided us with cues that we used to cluster shots. Through the process we learned to rely on gradient based features over color based features for image clustering. We also attempted to use spatio temporal features for shot clustering and found that while these features may capture information in time, the complexity of camera and subject motion in soccer matches could not be adequately captured. This is also indicative of the simplicity of the dataset the feature used was tested on and points to the problem of data set bias and problems with generalizing across different domains. We attempted slow motion detection by using optical ow and found that on its own optical ow is inadequate in detecting the amount of motion. However, it provides a good measure of the quality of shot (long shot, or close up shot). 17

Inspired by object detection and the relatively constrained environment in which objects occur in soccer matches, we explored how objects of interest can be found. For goal net detection we developed a feature specically aimed at detecting goal nets at different scales. For detecting anomalous events, we built on information from shot clustering and used unsupervised clustering in combination with color based features and simple heuristics to detect the presence of infrequently occurring frames. We found that by developing features and approaches specic to problems with a constrained domain one can achieve promising results. Finally, we moved away from visual features and implemented a system to detect the level of audio excitement in a match. The use of audio to accompany visual features is similar to multi-modal approaches to semi supervised learning in object detection that use textual information. Implementing the audio feature taught us a lot about a domain of video processing that has great pragmatic value, as well as how to merge features from different domains for training. During testing we dealt with the problems of pre-processing large unwieldy data, and balancing the ratio of positive and negative examples. Future work on this project could include the incorporation of the features excluded during testing in this iteration, along with the use of larger datasets across different football leagues and tournaments.

References
[1] BBC. Bbc football, December 2013. URL http://www.bbc.com/sport/0/football/. [2] A. Ekin, A.M. Tekalp, and R. Mehrotra. Automatic soccer video analysis and summarization. Image Processing, IEEE Transactions on, 12(7):796807, 2003. ISSN 1057-7149. doi: 10.1109/TIP.2003. 812758. [3] Ivan Laptev. Ivan laptevs page, November 2013. URL http://www.di.ens.fr/laptev/ download.html. [4] Kongwah Wan and Changsheng Xu. Robust soccer highlight generation with a novel dominantspeech feature extractor. In Multimedia and Expo, 2004. ICME 04. 2004 IEEE International Conference on, volume 1, pages 591594 Vol.1, 2004. doi: 10.1109/ICME.2004.1394261. [5] Heng Wang, Muhammad Muneeb Ullah, Alexander Klaser, Ivan Laptev, and Cordelia Schmid. Evaluation of local spatio-temporal features for action recognition. In University of Central Florida, U.S.A, 2009.

18

You might also like