Using Mel-Frequency Audio Features From Footstep Sound and Spatial Segmentation Techniques To Improve Frame-Based Moving Object Detection

IET Computer Vision
Research Article
Using mel-frequency audio features from ISSN 1751-9632

Received on 12th May 2017
Revised 19th September 2017
footstep sound and spatial segmentation Accepted on 1st November 2017
E-First on 13th February 2018
techniques to improve frame-based moving doi: 10.1049/iet-cvi.2017.0209
www.ietdl.org
object detection
Aditya Roshan1 , Yun Zhang1
1Department of Geodesy and Geomatics Engineering, University of New Brunswick, Fredericton, Canada
E-mail: roshan.aditya@unb.ca
Abstract: Moving object detection in video streams is a challenging and integral part of computer vision which is used in
surveillance, traffic and site monitoring, and navigation. Compared with the background-based techniques, frame differencing
technique is computationally inexpensive. However, frame differencing technique only detects the boundary of a moving object.
Due to changing light conditions, shadows, poor contrast between object and background, and a slow-moving object, object
detection rate from frame differencing technique reduces. This is because the number of noisy frames and frames with missing/
partially detected object increases. Application of large kernel size morphological operations fails to remove noise as they might
remove the boundary (or part) of a moving object. In this study, the authors propose a methodology to improve the frame
differencing technique using footstep sound generated by a moving object. Audio recorded with the video system is processed
and footstep sound is detected using audio features computed as mel-frequency cepstral coefficients. Number of frames within
each footstep sound are counted and processed. Spatial segmentation is used to find the moving object in noisy frames. A
missing or partially detected object is recovered by modelling an ellipse using a moving object from other neighbourhood
frames.
1 Introduction this sound is used to improve the frame differencing technique

whenever it fails to detect a moving object due to various reasons.
Moving object detection in video streams is an integral part of Camera inbuilt microphone system records sound which is
computer vision. An object of interest is detected in the video (or synchronised with the video. The recorded sound is processed and
image sequence) using a computer program. This topic also has footsteps are detected. Presence of footstep sound is used as a
great significance in surveillance, traffic, site monitoring, face confirmation of the presence of moving object in video frames
detection, robotics, navigation, infrastructure and military even if moving object detection technique has failed to detect the
applications [1, 2]. Moving object detection techniques can be object. After the confirmation of presence of object using footstep
divided into two categories: background-based and frame-based sound, frames adjacent to the frame of interest are used to model
techniques. Background-based techniques are affected by light the moving object (Fig. 1).
change, shadows, clutter and camouflage, and are computationally Audio amplitude and background noise vary from one camera
expensive. On the other hand, frame-based techniques are scene to another. For successful detection of footstep sound, high-
computationally inexpensive, less complex, better in susceptibility frequency background noise is filtered out using low-pass filter.
to light change and less affected by shadows [3]. Frame Audio features from processed audio data are computed as mel-
differencing and optical flow are the two most commonly used frequency cepstral coefficients (MFCC) [4]. The coefficients
frame-based techniques of moving object detection. Frame include audio features of all frequencies. Lower order MFCC
differencing is computationally less expensive than optical flow. coefficients corresponding to low-frequency audio features are
Though computationally less complex and fast in processing, used and processed to detect footstep sound. The number of video
frame differencing technique has not been used much mainly frames within the detected footstep sound are counted and analysed
because of poor outputs. This technique detects only the boundary for the presence of the moving object.
of the moving object. For slow moving objects, this technique The proposed methodology discussed in this paper improves the
either fails to detect or partially detects the moving object, which moving object detection rate of frame differencing technique.
makes it hard to determine if the object is detected or not. This is Object modelling performed using spatial segmentation and
also affected by poor contrast between object and its background neighbourhood frames to model the moving object to the noisy
(stationary or dynamic), camouflage, clutter, noise and shadows. frames or frames with missing/partially detected object. A partially
Multiple light sources are used in indoor conditions causing detected object refers to any extent of incomplete detection of the
multiple shadows and reflections. Under all these conditions, frame moving object. As a result, application of footstep sound results
differencing technique produces a noisy object, where the noise is into more number of frames with the detected moving object. This
distributed all over the frame. This makes it quite difficult to find can also be extended to other frame and background-based moving
the moving object. This paper is focused on improvements of the object techniques.
frame differencing technique for detection of a moving object (a
human). Human detection is referred to as moving object detection
in the rest of the paper. 2 Literature review
In certain conditions, a moving object (a human) produces A significant amount of research has been conducted to date
footstep sound as it moves. Most modern video recording camera focusing on the detection of moving objects using videos from
systems come with an inbuilt microphone. Some cameras come different cameras. However, very limited data is available on the
with omnidirectional and some with stereo (or bidirectional) applications of footstep sound to improve the shortcomings of
microphone which record sound present in the surroundings. frame differencing method. Background subtraction [5] and frame
Presence of footstep sound indicates the presence of an object and
IET Comput. Vis., 2018, Vol. 12 Iss. 3, pp. 341-349 341
© The Institution of Engineering and Technology 2018
Fig. 1 Process showing the improvement of frame differencing (FD) moving object detection technique using footstep sound recorded with video recording
system
differencing [6] are two basic techniques used for moving object there is also an increase in the complexity of these systems. Neither
detection. Background subtraction uses a video frame without an the multimodal systems nor video-only based techniques have
object as background and subtracts each incoming frame from the addressed the problem of missing or partially detected moving
background. The frame differencing technique computes the objects.
difference between two consecutive frames and finds the moving
object. Due to the presence of noise, slow-moving objects, 3 Methodology
shadows, changing background, poor contrast between object and
background, dynamic background, clutter and camouflage [7], final First, the frame differencing technique of moving object detection
outcomes were not useful. These basic techniques have been is used to detect a moving object in video frames. Later, sound
further developed. Optical flow [8], Gaussian mixture model [9] recorded with the video recording system is processed and footstep
and wavelet transform based method [10] are some of the other sound is detected. Duration, start and end time of each detected
most commonly used moving object detection techniques. There footstep is calculated and used to improve the moving object
might be various reasons to choose one technique over another but detection rate using frame differencing technique. Spatial
due to different variables and conditions, it is hard to find a single segmentation is developed to find the moving object in noisy
technique suitable for all conditions. Researchers are still working frames and missing or partially detected object is modelled as a
on improving these techniques. Recently, Zhou and Jin [11] used complete object using neighbourhood frames to the frame under
two-dimensional principle component analysis to update the consideration.
background frame and Subudhi et al. [12] used spatio-temporal
multilayer compound Markov random field for histogram-based 3.1 Moving object detection
moving object detection.
A video-only based moving object detection technique uses A frame differencing technique subtracts two consecutive frames
different properties from video frames to detect the moving object. and applies a threshold to the absolute difference
Some use two or three consecutive frames to detect the change and
others use a static background with incoming frames. The detailed 1, if f t − f t − 1 > Th
Output pixel value = (1)
literature on most of the basic moving object detection techniques 0, otherwise
is available [5, 7, 13, 14]. Gaussian mixture model is one of the
most popular background subtraction techniques because of its where ft is frame at time t, ft−1 is frame at time t − 1 and Th is
susceptibility to adopt light change and shadows [13]. This has threshold. The threshold value plays a critical role [27] and should
been further improved by using colour information [15, 16]. An be updated with change in the camera scene. Different techniques
automatic graph cut method has been introduced to detect a exist for image thresholding for multi-band (or colour) and grey
moving object which uses frame difference between consecutive images [28]. Grey image histogram based Otsu method of
frames as seed (or initial detection) [6]. This method will fail when automatic threshold [29] and manual threshold are popular in
there is no seed due to a slow-moving object or when the seed moving object detection. Manual selection of threshold is critical
region is large due to noise in the frame. A neural network is as it may differ for different objects and camera scenes. Only one
implemented to learn about background change and track the manual threshold is used for the entire dataset whereas each
moving object [17, 18]. This requires neural network training when absolute difference of consecutive frames uses different threshold
the scenario is altered. Researchers have also fused two or more when automatic Otsu method of thresholding is applied. Both the
techniques to obtain improved final outputs [19–21]. manual and Otsu thresholding methods are applied to dataset and
Besides using videos, researchers have also considered sound as results are discussed in Section 4.
a variable for moving object detection. Although sound-only
techniques of moving object detection have not been used in 3.1.1 Morphology: Random noise is present in original video
computer vision, they are used to detect a falling person and frames and its spatial location changes from frame to frame. For a
provide medical help to the elderly [22]. A sound array is used to given threshold value, depending on scene condition amplitude of
detect and track a moving object in traffic, and to monitor noise also changes from one frame to another. These unwanted
pedestrians [22]. Drawbacks of sound-only methods are that the noise pixels are removed (or minimised) using morphological
precise location of a moving object is unknown and tracking of an operations. The most commonly used morphological operation
object requires a larger network of audio sensors. Sound sensors used in moving object detection is erosion followed by dilation
are also used with video in multimodal systems to track the [27]. A small square kernel of 3 × 3 is used in morphological
speaking person in the camera scene such that the camera is operations such that the partial or complete boundary of detected
focused on the person [23]. Multi-modal systems are further object is least affected.
developed by using multiple video and audio sensors to detect and
track the moving object in the scene [24]. Other sensors such as
infrared video, radar and light detection and ranging have also been 3.1.2 Ellipse fitting: Most moving object detection techniques
used with optical imaging systems in multi-modal systems [25, 26]. represent a moving object as a bounding rectangle. However, a
With the increasing number of sensors in multimodal systems, moving object is better represented as an ellipse than a rectangle. A
least-square ellipse is fitted to the group of pixels detected as a
342 IET Comput. Vis., 2018, Vol. 12 Iss. 3, pp. 341-349

Fig. 2 Audio processing and video frame counting for each peak location in input sound
Fig. 3 Video 1 footstep sound waveform and detection of peak locations corresponding to footstep sound using first MFCC coefficients
moving object. In order to find a bounding ellipse, object boundary MFCC coefficients represent audio features are computed for each
pixels are found as convex hull points. Matlab command frame. These coefficients are arranged in increasing order of
‘convexhull’ is used to find object boundary pixels and least square frequencies where zeroth coefficient is the average of 12
ellipse E(x, y, a, b, θ) is modelled [30]. On the ellipse E, x is the x coefficients. Lower order coefficients (e.g. 1st MFCC coefficient)
centre, y is the y centre, a is the semi-major axis, b is the semi- represent low-frequency audio features and high-order coefficients
minor axis and θ is the orientation angle. When the pixels detected (e.g. 12th MFCC coefficient) represent high-frequency audio
as a moving object are randomly distributed all over the frame, the features [4].
frame is considered noisy and a noise removal algorithm (Section
3.4) is applied. 3.2.3 First MFCC coefficients: A normal moving human produce
a footstep sound in each 0.25–1 s. In terms of frequency, footsteps
3.2 Audio processing for footstep sound detection are generated at a frequency of 1–4 Hz. This is low-frequency
sound and lower order MFCC coefficients are useful. First MFCC
Sound recorded with video is extracted and processed to detect coefficients of audio features are used to detect footstep sound.
footsteps (Fig. 2). Noise and unwanted sounds are also recorded. Time versus first MFCC coefficients plot shows that lower (below
Initially, high-frequency noise is suppressed using low-pass filter, the average first coefficients) peaks correspond to the time when
and audio features of filtered audio wave are extracted as MFCC footstep sound is produced (Fig. 3). These peaks are detected and
coefficients. These MFCC coefficients are used to detect footstep their start and end times are calculated as described in the next
sound, their locations (or times) and durations with start and end section.
timestamps. Significance of first MFCC coefficient in Fig. 2 is
explained in the following sections.
3.2.4 Peak locations: The average of first MFCC coefficients is
calculated and the coefficients values below average are found.
3.2.1 Low-pass filter: High-frequency background noise recorded These coefficients form peaks of lower half of a sinusoidal wave
with camera microphone system is reduced by applying low-pass (Fig. 3) starting from the average line of coefficients. As shown in
filter to the audio data. A low-pass filter with rational transfer Fig. 3, start and end timestamps for each peak are computed at the
function [31] is applied using the command ‘filter’ in Matlab. average line. Coefficients are discrete in time, and may not have a
Filter uses 0.05 as numerator coefficient which is equivalent to value at average line. In these cases, start and/or end timestamps of
averaging of audio signal with a window size of 20 samples. peaks are estimated by finding the intersection point of two lines:
one as the average line and second the line joining the two adjacent
3.2.2 MFCC coefficients: Filtered audio data is used and audio coefficients above and below the average. Duration of footstep
features are extracted as MFCC coefficients which are computed sound is computed using start and end timestamps of detected
using Matlab [32]. Audio features are computed for each 200 ms peaks.
audio frame (audio is divided into audio frames). The 13 (0–12)

Table 1 Camera and audio details
Camera iPhone4, 5 MP still camera [34] iPhone5s, 8 MP still camera [35]
video resolution 568 × 320 1920 × 1080
video frame rate 30 fps 30 fps
audio bit rate 65 kbps 63 kbps
audio sample rate 44.1 kHz 44.1 kHz
3.3 Video frames counting Clustering and weighting: K-means clustering is applied to the
detected pixels and image frame is divided into 20 clusters. The
A video is captured at 30 frames/s and audio is sampled at 44.1 number of pixels in each cluster and their mean location (row,
kHz. The audio sampling rate is 1470 times higher than the video column) are determined. A weight (2) is assigned to each cluster
frame rate. The timestamp of each frame in the video is calculated based on the number of pixels it contains. With the assumptions
using the video frame rate. Start and end timestamps for each made earlier, three clusters with maximum weight are found. All
footstep are known and these are later fused with the video frames the pixels of these three clusters are combined and their x and y
timestamp. This gives the number of video frames which fall standard deviations (sx and sy) are computed
within the start and end timestamps of each footstep sound (Fig. 2).
Number of detected pixels in the cluster
3.4 Missing, partially and noisy detected object W cluster = (2)
Total number of detected pixels in frame
Due to the slow movement of an object, light change, random Noise removal: A cluster belonging to boundary pixels in the
noise, light source, threshold and other camera scene conditions, object has more weight than other clusters with noise pixels only. If
the output object can be partially detected, not detected at all, the whole object is not within one cluster, part of the object is
noisy, or a combination thereof. Under these conditions, it is contained in the neighbouring clusters. Maximum weight among
difficult to confirm the presence of a moving object in certain all clusters is found (Wmax) and two-dimensional standard
frames. However, the presence of footstep sound helps to detect a deviations (sx and sy) are used to find the clusters which belong to
moving object. This section describes the methodology used to the moving object (3)
improve the detection of a moving object in frames with a noisy,
missing/partially detected object.
3.4.1 Missing/partially detected moving object: It is natural that Moving object cluster =
an object does not move at constant speed over time and the object
movement itself is non-uniform. The slow and non-uniform
Cluster weight > 0.05 ∗ W max (3)
movement of an object and/or non-ideal value of threshold used in
image thresholding results in a missing/partially detected object. Yes, if and
When presence of a footstep sound is confirmed and a frame ( f t)
Cluster distance < 3 ∗ Dth
with a missing/partially detected object is found within the time
duration of footstep sound, an object in frame f t is modelled with No, otherwise
the help of neighbouring frames f t − 1 and f t + 1 with a fully detected
object. Ellipses Et − 1(x, y, a, b, θ) and Et + 1(x, y, a, b, θ) are fitted to where threshold distance, Dth = sx2 + sy2. Threshold weight is 5%
the points detected as moving object in frames f t − 1 and f t + 1, of maximum weight and threshold distance is three times of two-
respectively. Next, the object in f t is modelled as an average ellipse dimensional standard deviation of three clusters with maximum
weight.
Et(x, y, a, b, θ) of Et − 1 and Et + 1. If the frame f t + 1 does not contain
a fully detected object or does not fall within the duration of
present footstep sound, the next frame with a complete detected 4 Results and discussion
object is considered in modelling the moving object as an ellipse Et 4.1 Camera system
in frame f t.
Videos of a moving object were captured in indoor conditions
using an iPhone4 and iPhone5s mobile cameras (Table 1). Phone
3.4.2 Noisy output: Different reflections and multiple shadows inbuilt microphone system was used to capture audio generated
from due to multiple light sources close to the object cause rapid from the moving object.
light changes in the camera scene. These generate noisy outputs
which cannot be completely removed using the morphological
4.2 Camera scene
operations applied during moving object detection (Section 3.1).
Inappropriate threshold values can also generate noisy outputs. Four indoor scenes were selected and video with a single moving
Spatial segmentation technique is developed to remove this noise. object was captured. Regular fluorescent light tubes were used as a
light source. The object in each camera scene started moving from
Morphological operations: A large kernel size (5 × 5 or 7 × 7 or the far end and began moving towards the camera in a zig-zag
higher) structuring element is required to remove the accumulated pattern with normal speed (Fig. 4). Video of a moving object in
noisy pixels. A frame-based moving object detection technique outdoor conditions can also be used if moving object footstep
detects only the object boundary and application of a large kernel sound is detectable out of background noise (traffic, wind, music
size will make the detected object disappear (or partially etc.).
disappear). Due to this limitation, a 3 × 3 kernel is used for
morphological operations and salt and pepper noise is further 4.3 Manual and Otsu thresholding for frame differencing
reduced.
Spatial segmentation: Due to the limitations of morphological Frame differencing technique of moving object detection is applied
operations, noise is not completely removed from the frame of a to the captured videos. As discussed in Section 3.1, threshold plays
detected object. It is assumed that noise is distributed across the an important role in moving object detection. Both manual (using
frame and multiple pixels belonging to the moving object are close trial and error) and automatic (using Otsu method) thresholds are
to each other. Pixels in noisy frame are spatially segmented using applied (Fig. 5) and moving object detection results are evaluated.
K-means clustering [33]. Different threshold values for each absolute difference of
consecutive frames are used in Otsu method (determined

Fig. 4 Camera scenes in captured videos
(a) Video 1: object moving in indoor room 1, (b) Video 2: object moving in indoor room 2, (c) Video 3: object moving in indoor hallway 1, (d) Video 4: object moving in indoor
hallway 2
Fig. 5 Frame differencing moving object detection results using manual and Otsu thresholding. Row 1: manual threshold and row 2: Otsu threshold (frames
from left to right: video 1, frame 30; video 1, frame 184; video 2, frame 30 and video 2, frame 254). Row 3: manual threshold and row 4: Otsu threshold
(frames from left to right: video 3, frame 147; video 3, frame 482; video 4, frame 132 and video 4, frame 214)
automatically using Otsu algorithm [29]) whereas single threshold of moving object in these frames. Further, only 60, 176, 91 and 77
value is used for all absolute difference of consecutive frames in frames out of 166, 260, 205 and 167 frames, respectively, are
manual thresholding. Optimal value of manual threshold is selected detected with a complete moving object. It shows that 64, 32, 56
such that the object is detected successfully with minimal noise. and 54% frames within the duration of footstep sounds for videos 1
Otsu method requires less user interaction and determines to 4, respectively, are either noisy or with a missing/partially
threshold based on frame difference but the results show that object detected object or a combination thereof (Fig. 6). These frames are
detection rate is better using manual thresholding than Otsu further processed and the missing or partially detected object is
thresholding. Sample frames from four videos are shown in Fig. 5. modelled.
As compared to manual thresholding, a large number of output
frames are noisy and partially detected using Otsu threshold. As a 4.5 Modelling moving object in frames with partially detected,
result, manual thresholding is used with frame differencing missing object and noisy output
technique and final outputs are analysed based on it.
There are frames with a missing object, a partially detected object
4.4 Footstep sound and frame counting and noisy output. An object is modelled to the frames which have a
missing or partially detected moving object and which fall during
Audio signal (Fig. 3) is extracted from each of the four videos and the presence of footstep sound. Fig. 7 shows three consecutive
processed for footstep sound detection. Noise from each audio frames each from videos 1 to 4. All three frames occurred during
signal is removed by applying low-pass filter and audio features as the presence of footstep sound and the frame differencing
MFCC coefficients are computed. First MFCC coefficient values technique partially detected the moving object in the frame 167 of
are processed using the methodology discussed in Sections 3.2 and video 1, 349 of video 2, 360 of video 3 and 172 of video 4. In most
3.3. For each audio signal, start and end timestamps of each of the cases, object in these frames is ignored as noise or
footstep sound are fused with the timestamps of video frames eliminated from morphological operations. As discussed in Section
detected using frame differencing technique. Thereafter the number 3.4, a partially detected object in frame 167 of video 1 is modelled
of frames during each footstep sound are counted (Table 2). Based as a complete object (Fig. 7) using the object detected in
on this analysis; 166 out of 354, 260 out of 605, 205 out of 576 and neighbouring frames 166 and 168 (Section 3.4). When one (or
167 out of 441 frames fall within the duration of detected footstep both) of the neighbouring frames to the frame under consideration
sounds for videos 1 to 4, respectively. This confirms the presence (e.g. frame 359 and 361 of video 3 in Fig. 7) is noisy, the noise is
Fig. 6 Frame differencing output frame analysis based on fully detected object falling within the duration of detected footstep sound
Table 2 Frame counts in videos during footstep sound and frame differencing detection analysis
Description Video 1 Video 2 Video 3 Video 4
video duration, s 11.07 20.10 19.12 13.62
number of footsteps 20 32 35 25
total number of video frames 354 605 576 441
number of frames within footstep 166 260 205 167
number of frames with complete object within footstep 60 176 91 77
failure rate (%) of frame differencing within footstep 64% 32% 56% 54%
first removed using spatial segmentation. If the subsequent frame moving object is then modelled as discussed in Section 4.5. Fig. 9
fails to detect an ellipse or does not contain a complete object, the shows video frames with moving object modelled as an ellipse. All
next available frame with a detected object is used. For real-time the frames are extracted during the presence of footstep sound in
application, moving object detection will be delayed until next respective videos. Neighbourhood frames with a complete detected
frame is detected with a complete moving object. object are used to model the object (as an ellipse) in frames with
A noisy output frame generated from the frame differencing the missing/partially detected object. The object is extracted and
technique makes it difficult to find a moving object present in the modelled as an ellipse for noisy frames using spatial segmentation.
frame. Noise is removed using spatial segmentation discussed in As discussed in Section 2, similar problems of missing/partially
Section 3.4. Fig. 8 shows the noisy frames 187, 3, 570 and 314 detected moving object and noisy outputs also exist with other
from videos 1 to 4, respectively. These frames occurred during the frame-based methods of moving object detection. Object detected
detected footstep sound. If noise is not removed, object is detected video frames with a missing, partial and noisy output frames using
as an entire image frame containing all the noisy pixels. After optical flow and wavelet transform based moving object detection
applying morphological operations and spatial segmentation, the techniques are shown in Fig. 10.
noise is removed and an ellipse is fitted to represent the moving A missing, partial and noisy output from optical flow and
object. wavelet transform based techniques can also be improved using
footstep sound. While different moving object detection techniques
4.6 Moving object detection have different complexities, and take different processing times,
processing steps and computation time for the application of
Using frame differencing technique, 64, 32, 56 and 54% of frames footstep sound remains the same. For video 1, out of the 166
(within the presence of footstep sound) of videos 1 to 4, frames occurring during the presence of footstep sound, a complete
respectively, are found to be noisy or with missing/partially object is detected in 105 frames using optical flow based technique,
detected outputs or a combination thereof. Footstep sound is used in 117 frames using wavelet transform based technique, and in 60
to confirm the presence of a moving object in these frames. The frames using frame differencing technique. With the application of

Fig. 7 Moving object detection and modelling of missing/partially/noisy moving object using neighbouring frames. Three consecutive frames within the
presence of footstep sound are shown. Object in the middle frame is modelled using previous and next frames. Rows 1 and 2: video 1, frames 166, 167 and
168. Rows 3 and 4: video 2, frames 348, 349 and 350. Rows 5 and 6: video 3, frames 359, 360 and 361. Rows 7 and 8: video 4, frames 171, 172 and 173
footstep sound, a complete moving object is detected in all 166 missing/partially detected object and noisy outputs. A missing/
frames of video 1. A similar analysis is applied to videos 2, 3 and 4 partially detected object in frame (ft) is modelled as a complete
which leads to the conclusion that on average frame differencing object using the ellipses from neighbourhood frames ft−1 and ft+1.
technique is improved by 52% in comparison to 31% of optical An object is modelled as an ellipse which is fitted using the least
flow and 45% of wavelet transform based techniques (Table 3). square technique on pixels detected as moving object. When object
As a result, frame differencing technique performs poorly in is not detected in neighbouring frames, the next available frame
moving object detection when compared to other frame-based with the fitted ellipse (complete detected object) is used. For
techniques. This observation significantly improved with the frames with noisy outputs, the object is found by removing noise
application of footstep sound. Detection of footstep sound makes it with spatial segmentation. Application of spatial segmentation to
certain that the object is present in the camera scene and the results remove noise and model the missing object can also be extended
are based only on the duration of detected footstep sound. for frames between two consecutives footsteps if the presence of a
Remaining frames whose timestamps do not coincide with the moving object is confirmed. As a result, frame differencing
footstep sound are not considered. The performance and results technique can be further improved. The methodology discussed in
may vary among different video and scene conditions. this paper used a camera with an inbuilt microphone and the
application of footstep sound depends on its being generated from
5 Conclusions the moving object. There are circumstances when footstep sound is
generated from off the scene objects. The accuracy and certainty of
Frame differencing technique based moving object detection is
the presence of a moving object in a camera scene can be further
improved by using footstep sound when the former fails to detect
improved by using advanced directional microphones, which can
the moving object, partially detects the object, or when the frame is
record sound only from the camera scene. This may require
noisy. By computing start and end timestamps of each footstep
additional footstep sound analysis, sensors or computation power
sound, it is made certain that the moving object is present
to confirm the presence of a moving object in the camera scene.
irrespective of frame differencing results. The video frames within
The discussed method can also be used with other moving object
the duration of footstep sound are found and analysed for a

Fig. 8 Noisy frames and noise removal using spatial segmentation. Row 1: frame 187 of video 1, row 2: frame 3 of video 2, row 3: frame 570 of video 3 and
row 4: frame 314 of video 4. Column 1: frame differencing output frame, Column 2: spatially segmented pixels, Column 3: frames after removing noise, and
column 4: object modelled after noise removal
Fig. 9 Object modelled as ellipse to frames with missing/partially detected object and noisy outputs. Row 1: video 1, frames 27, 74, 199, 234. Row 2: video 2,
frames 260, 308, 316, 385. Row 3: video 3, frames 250, 270, 419, 540. Row 4: video 4, frames 21, 133, 150, 332
detection techniques (including background-based) depending on 6 Acknowledgment

the application, algorithm complexity and payload. This technique
can be extended for multiple moving objects in the camera scene This research was sponsored by the Atlantic Innovation Fund (AIF)
which produce footstep sound at a high frequency. Higher of the Atlantic Canada Opportunities Agency (ACOA) with Yun
frequency footstep sound can be analysed using higher order Zhang as the Principal Investigator.
MFCC coefficients.

Fig. 10 Output frames of moving object detection using optical flow and wavelet transform based moving object detection techniques. Sample video frames
within duration of footstep sound from videos 1 to 4. Row 1: video output frames using optical flow and row 2: video output frames using wavelet transform
Table 3 Improvements to the frame-based moving object detection techniques when footstep sound is detected
Description Frame differencing Optical flow Wavelet transform
video 1 2 3 4 1 2 3 4 1 2 3 4
object detected frames within footstep sound 60 176 91 77 105 202 116 129 117 142 83 92
video frames within footstep sound 166 260 205 167 166 260 205 167 166 260 205 167
improvement, % 64 32 56 54 37 22 43 23 30 45 60 45
average improvement, % 52 31 45
7 References [18] Qin, H., Zhen, Z., Ma, H.: ‘Moving object detection based on optical flow and
neural network fusion’, Int. J. Intell. Comput. Cybern., 2016, 9, (4), pp. 325–
[1] Kumar, P., Singhal, A., Mehta, S., et al.: ‘Real-time moving object detection 335
algorithm on high-resolution videos using GPUs’, Real-Time Image Process., [19] Siebel, N.T., Maybank, S.: ‘Fusion of multiple tracking algorithms for robust
2016, 11, (1), pp. 93–109 people tracking’. Computer Vision – ECCV 2002, 2002 (LNCS), (Lecture
[2] Kamate, S., Yilmazer, N.: ‘Application of object detection and tracking Notes in Computer Science, Springer, Berlin, 2002), 2353, pp. 373–387
techniques for unmanned aerial vehicles’, Procedia Comput. Sci., 2015, 61, [20] Zhou, D., Zhang, H.: ‘Modified GMM background modeling and optical flow
pp. 436–441 for detection of moving objects’. IEEE Int. Conf. on Systems, Man and
[3] Roshan, A., Zhang, Y.: ‘A comparison of moving object detection methods Cybernetics, 2005, vol. 3, pp. 2224–2229
for real-time moving object detection’. SPIE Defense+ Security, Baltimore, [21] Heikkila, M., Pietikainen, M.: ‘A texture-based method for modeling the
USA, 5 May 2014, pp. 907609–907609-6 background and detecting moving objects’, IEEE Trans. Pattern Anal. Mach.
[4] Logan, B.: ‘Mel frequency cepstral coefficients for music modeling’. Int. Intell., 2006, 28, (4), pp. 657–662
Symp. Music Information Retrieval (ISMIR), 2000 [22] Doukas, C., Maglogiannis, I.: ‘Advanced patient or elder fall detection based
[5] McIvor, A.: ‘Background subtraction technique’. Proc. of Image & Vision on movement and sound data’. Second Int. Conf. on Pervasive Computing
Computing, New Zealand, November 2000, pp. 147–153 Technologies for Healthcare, January 2008, pp. 103–107
[6] Cui, Y., Zeng, Z., Cui, W., et al.: ‘Moving object detection based frame [23] Chellappa, R., Qian, G., Zheng, Q.: ‘Vehicle detection and tracking using
difference and graph cuts’, J. Comput. Inf. Syst., 2012, 8, (1), pp. 21–29 acoustic and video sensors’. IEEE Int. Conf. on Acoustics, Speech, and Signal
[7] Shaikh, S.H., Saeed, K., Chaki, N.: ‘Moving object detection, approaches, Processing, 17–21 May 2004, p. iii-793–iii-796
challenges and object tracking’ in Shaikh, S.H., Saeed, K., Chaki, N. (Eds.): [24] Zotkin, D.N., Duraiswami, R., Davis, L.S.: ‘Joint audio-visual tracking using
Moving Object Detection Using Background Subtraction (Springer, London, particle filters’, EURASIP J. Adv. Signal Process., 2002, 2002, (1), pp. 1154–
2014), pp. 5–14 1164
[8] Shafie, A.A., Hafiz, F., Ali, M.H.: ‘Motion detection techniques using optical [25] Han, J., Bhanu, B.: ‘Fusion of color and infrared video for moving human
flow’, Int. J. Electr. Comput. Energ. Electron. Commun. Eng., 2009, 3, (8), detection’, Pattern Recognit., 2007, 40, (6), pp. 1771–1784
pp. 1561–1563 [26] Chavez-Garcia, R.O., Aycard, O.: ‘Multiple sensor fusion and classification
[9] Stauffer, C., Grimson, W.E.L.: ‘Adaptive background mixture models for real- for moving object detection and tracking’, IEEE Trans. Intell. Transp. Syst.,
time tracking’. 1999 IEEE Computer Society Conf. on Computer Vision and 2016, 17, (2), pp. 525–534
Pattern Recognition, 23–25 June 1999, pp. 246–252 [27] Dedeoglu, Y.: ‘Moving object detection, tracking and classification for smart
[10] Crnojević, V., Antić, B., Ćulibrk, D.: ‘Optimal wavelet differencing method video surveillance’. Department of Computer Engineering and the Institute of
for robust motion detection’. 16th IEEE Int. Conf. on Image Processing Engineering and Science of Bilkent University, 2004
(ICIP), 2009, pp. 645–648 [28] Mehmet, S., Bulent, S.: ‘Survey over image thresholding techniques and
[11] Zhou, Z., Jin, Z.: ‘Two-dimension principal component analysis-based motion quantitative performance evaluation’, J. Electron. Imaging, 2004, 13, (1), pp.
detection framework with subspace update of background’, IET Comput. Vis., 146–168
2016, 10, (6), pp. 603–612 [29] Otsu, N.: ‘A threshold selection method from gray-level histograms’, IEEE
[12] Subudhi, B.N., Ghosh, S., Nanda, P.K., et al.: ‘Moving object detection using Trans. Syst. Man Cybern., 1979, 9, (1), pp. 62–66
spatio-temporal multilayer compound Markov Random Field and histogram [30] ‘fit_ellipse’, https://www.mathworks.com/matlabcentral/fileexchange/3215-
thresholding based change detection’, Multimedia Tools Appl., 2017, 76, (11), fit-ellipse, accessed 10 April 2017
pp. 13511–13543 [31] Oppenheim, A.V., Ronald, W.S., John, R.B.: ‘Discrete-time signal processing’
[13] Piccardi, M.: ‘Background subtraction techniques: a review’. 2004 IEEE Int. (Prentice-Hall, Upper Saddle River, NJ, 1999)
Conf. on Systems, Man and Cybernetics, 10–13 October 2004, pp. 3099–3104 [32] ‘HTK MFCC MATLAB’, https://www.mathworks.com/matlabcentral/
[14] Hu, W., Tan, T., Wang, L., et al.: ‘A survey on visual surveillance of object fileexchange/32849-htk-mfcc-matlab, accessed 10 April 2017
motion and behaviors’, IEEE Trans. Syst. Man Cybern. C, Appl. Rev., 2004, [33] David, A., Sergi, V.: ‘K-means++: the advantage of careful seeding’. SODA
34, (3), pp. 334–352 ‘07: Proc. of the Eighteenth Annual ACM-SIAM Symp. on Discrete
[15] Sadarangani, N.: ‘An improved Gaussian mixture model algorithm for Algorithms, 2007, pp. 1027–1035
background subtraction’. Master of Engineering, Massachusetts Institute of [34] iPhone 4 – Technical Specifications., https://support.apple.com/kb/sp587?
Technology, 2002 locale=en_US, accessed 10 April 2017
[16] Han, H., Zhu, J., Liao, S., et al.: ‘Moving object detection revisited: speed and [35] iPhone 5s – Technical Specifications., https://support.apple.com/kb/sp685?
robustness’, IEEE Trans. Circuits Syst. Video Technol., 2015, 25, (6), pp. 910– locale=en_US, accessed 31 August 2017
921
[17] Song, B., Cunwu, H., Dehui, S.: ‘Neural network based method for
background modeling and detecting moving objects’, J. China Univ. Posts
Telecommun., 2015, 22, (3), pp. 100–109


Using Mel-Frequency Audio Features From Footstep Sound and Spatial Segmentation Techniques To Improve Frame-Based Moving Object Detection

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Using Mel-Frequency Audio Features From Footstep Sound and Spatial Segmentation Techniques To Improve Frame-Based Moving Object Detection

Uploaded by

Copyright:

Available Formats

IET Computer Vision

Using mel-frequency audio features from ISSN 1751-9632

1 Introduction this sound is used to improve the frame differencing technique

342 IET Comput. Vis., 2018, Vol. 12 Iss. 3, pp. 341-349

IET Comput. Vis., 2018, Vol. 12 Iss. 3, pp. 341-349 343

344 IET Comput. Vis., 2018, Vol. 12 Iss. 3, pp. 341-349

346 IET Comput. Vis., 2018, Vol. 12 Iss. 3, pp. 341-349

IET Comput. Vis., 2018, Vol. 12 Iss. 3, pp. 341-349 347

detection techniques (including background-based) depending on 6 Acknowledgment

348 IET Comput. Vis., 2018, Vol. 12 Iss. 3, pp. 341-349

IET Comput. Vis., 2018, Vol. 12 Iss. 3, pp. 341-349 349

You might also like