Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Expert Systems With Applications 183 (2021) 115400

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

Car crash detection using ensemble deep learning and multimodal data
from dashboard cameras
Jae Gyeong Choi a, 1, Chan Woo Kong a, 2, Gyeongho Kim a, 3, Sunghoon Lim a, b, *, 4
a
Department of Industrial Engineering, Ulsan National Institute of Science and Technology, Ulsan 44919, Republic of Korea
b
Institute for the 4th Industrial Revolution, Ulsan National Institute of Science and Technology, Ulsan 44919, Republic of Korea

A R T I C L E I N F O A B S T R A C T

Keywords: Due to the increase in motor vehicle accidents, there is a growing need for high-performance car crash detection
Dashboard camera systems. The authors of this research propose a car crash detection system that uses both video data and audio
Car crash data from dashboard cameras in order to improve car crash detection performance. While most existing car crash
Multimodal data
detection systems depend on single modal data (i.e., video data or audio data only), the proposed car crash
Deep learning
Ensemble technique
detection system uses an ensemble deep learning model based on multimodal data (i.e., both video and audio
data), because different types of data extracted from one information source (e.g., dashboard cameras) can be
regarded as different views of the same source. These different views complement one another and improve
detection performance, because one view may have information that the other view does not contain. In this
research, deep learning techniques, gated recurrent unit (GRU) and convolutional neural network (CNN), are
used to develop a car crash detection system. A weighted average ensemble is used as an ensemble technique.
The proposed car crash detection system, which is based on multiple classifiers that use both video and audio
data from dashboard cameras, is validated using a comparison with single classifiers that use video data or audio
data only. Car accident YouTube clips are used to validate this research. The experimental results indicate that
the proposed car crash detection system performs significantly better than single classifiers. It is expected that the
proposed car crash detection system can be used as part of an emergency road call service that recognizes traffic
accidents automatically and allows immediate rescue after transmission to emergency recovery agencies.

1. Introduction or automatically applies the brakes for emergency braking (Chang et al.,
2010; Milanés et al., 2012a). Installing a front CAS to every motor
Motor vehicle accidents have consistently had serious consequences, vehicle is legislated in Europe and Japan, while installing a lane de­
including loss of life and property (Durduran, 2010; Gang & Zhuping, parture warning system to large-sized buses and trucks is legislated in
2011). In particular, while the number of traffic accidents is on the the Republic of Korea (Ministry of Land, Infrastructure and Transport,
relative decline, traffic accidents are still one of the leading causes of Republic of Korea, 2018). The sensors used to prevent collisions include
death worldwide (Heron, 2016; Sarraf & McGuire, 2020). There are two radar sensors, light detection and ranging (LIDAR) sensors, ultrasonic
major technical solutions that help resolve this issue: collision avoidance waves sensors, camera vision sensors, and sensor fuses that mix different
systems (CASs) (Habib & Ridella, 2017; Milanés et al., 2012b; Yadav kinds of sensors (Yadav et al., 2020). The performance of a front CAS is
et al., 2020) and emergency road call services (Smirnov et al., 2013; more significant than other types of CASs (e.g., a back CAS, a lane de­
White et al., 2011). parture warning system). For example, the Insurance Institute for
First, a CAS, which can be installed in motor vehicles, senses the Highway Safety announced the latest statistics that a front CAS can
distances between vehicles through sensors, then adjusts the accelerator reduce rear-end collisions by 40 percent (Habib & Ridella, 2017).

* Corresponding author at: Department of Industrial Engineering, Ulsan National Institute of Science and Technology, Ulsan 44919, Republic of Korea.
E-mail address: sunghoonlim@unist.ac.kr (S. Lim).
1
ORCID ID: 0000-0002-4733-1937.
2
ORCID ID: 0000-0001-7137-0592.
3
ORCID ID: 0000-0001-7486-8628.
4
ORCID ID: 0000-0001-9534-7397.

https://doi.org/10.1016/j.eswa.2021.115400
Received 7 July 2020; Received in revised form 7 June 2021; Accepted 8 June 2021
Available online 13 June 2021
0957-4174/© 2021 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
J.G. Choi et al. Expert Systems With Applications 183 (2021) 115400

Previously, developing CASs had only been possible for major com­ flow (ViF) descriptor and SVMs using closed-circuit television (CCTV)
panies that could heavily invest and employ experts in sensor and video data, which are detected through CNN.
related software development. However, due to the advancement of State-of-the-art deep learning techniques have been recently applied
artificial intelligence, high-tech startup companies (e.g., Drive.ai (Drive. to car crash detection using video data. Chan et al. (2017) offer a
ai, 2020), comma.ai (comma.ai, 2020)) can also develop CASs based on dynamic-spatial-attention (DSA) recurrent neural network (RNN) model
deep learning and off-the-shelf sensors. to predict car accidents from dashboard cameras. Their model is trained
Second, an emergency road call service recognizes traffic accidents to distribute soft-attention to object candidates gathering subtle cues
automatically and allows immediate rescue after transmission to emer­ dynamically as well as modeling the temporal dependencies of all cues
gency recovery agencies. This service is very useful, because more than to robustly predict accidents. Naidenov & Sysoev (2019) develop a car
70 percent of traffic fatalities are drivers and passengers. The death rate accident detecting system based on CNN using video capture recordings.
can decrease by 6 percent if the time to contact emergency medical Yao et al. (2019) present an unsupervised deep learning framework for
service is shortened (White et al., 2011). In particular, e-call terminal traffic accident detection using dashboard cameras. In particular, their
installation is legislated in Europe and Russia (Smirnov et al., 2013). In approach can detect traffic accidents by predicting traffic participant
the event of an accident, an emergency road call service is activated, trajectories as well as their future locations.
which requires car crash detection systems to operate correctly. In this CrashCatcher (Wagner-Kaiser, 2020) is one well-known car crash
research, a car crash detection system is developed as part of an emer­ detection model that uses video data from dashboard cameras. A hier­
gency road call service in order to recognize traffic accidents automat­ archical recurrent neural network (HRNN), which is based on two
ically and allow immediate rescue. different layers of long short-term memory (LSTM), is used to train the
Existing systems detect accidents through analyzing sound data or model. In particular, the first layer analyzes a time series of video data
video data. They use machine learning/deep learning techniques, such from dashboard cameras. The second layer encodes the results of the
as support vector machines (SVMs), Gaussian mixture models (GMMs), first layer and trains the model using the labeled video data (i.e., crash
and learning vector quantization (LVQ) classifiers (1) to determine data and non-crash data).
whether or not video information indicates that an accident has
occurred or (2) to classify various sounds (e.g., window breaking, 2.2. Car crash detection using audio data
screening, wire skidding) that may occur during an accident. However,
using one type of data, either video data or audio data, may not fully Existing studies indicate that not only video data but also different
utilize information that the data give. This potentially causes the car kinds of audio data on roads can be applied to detect car accidents
crash detection system to perform poorly. (Crocco et al., 2016). Carletti et al. (2013) use a bag-of-words method
The proposed car crash detection system can improve this perfor­ for event detection using audio data, which can be applied to car crash
mance by using both video and audio data, which provide different but detection. The first-level features of their method are computed on a
complementary views, available in a single accident. By dividing the short time interval, which is similar to the words of a text. The second-
data from one source into different types and distinguishing whether a level features characterize a longer time interval, which is based on the
motor vehicle is involved in a collision, missing information from one actual sounds to be recognized. Foggia et al. (2015, 2016) propose a
type can be supplemented by the other type of data. Deep learning novel method to detect road accidents by analyzing sounds captured by
techniques (i.e., gated recurrent unit (GRU) and convolutional neural microphones in order to identify hazardous situations on roads, such as
network (CNN)) are used to develop the proposed car crash detection tire skidding and car crashes. First, their method extracts a set of features
system, and a weighted average ensemble is used as an ensemble tech­ that can identify the discriminant properties of the events of interest.
nique. The proposed car crash detection system can be applied to various Then, a bag-of-words approach is also exploited to detect both short and
areas, including insurance companies, police departments, and legal sustained events. Saggese et al. (2016) develop a sound analysis method
proceedings. based on K-means algorithm and SVM in order to detect audio events in
This paper is structured as follows: Section 2 provides the literature surveillance applications, including car crash detection.
review for this research. Section 3 introduces the proposed car crash In particular, CNNs provide high performance for sound classifica­
detection system using both video and audio data from dashboard tion using spectrogram images (Grill & Schlüter, 2015; Wyse, 2017).
cameras. Section 4 introduces case studies, and Section 5 provides the Salamon & Bello (2017) analyze spectrogram images through a CNN
experimental results and discussion. Section 6 concludes the paper. architecture with localized kernels for classifying environmental sounds,
such as air conditioners, car horns, children playing, dogs barking,
2. Literature review drilling, engines idling, gun shots, jackhammers, sirens, and street
music. Khamparia et al. (2019) also use CNN as well as the tensor deep
The literature review section illustrates literature related to (1) car stacking network (TDSN) to analyze the spectrogram images of envi­
crash detection using video data and (2) car crash detection using audio ronmental sounds. Crashzam is a novel car crash detection system that
data. uses two different types of audio data: (1) audio features and (2) spec­
trogram images. An ensemble machine learning technique (i.e., random
2.1. Car crash detection using video data forest) is used to combine two different types of audio data for car crash
classification (Sammarco & Detyniecki, 2018).
Various machine learning techniques have been widely used to While several existing studies based on the machine learning tech­
detect car crashes based on different kinds of video data. Ki (2007) niques above consider video data or audio data in order to detect car
proposes a vision-based traffic accident detection system using charge crashes, limited contributions have been made to consider both video
coupled device (CCD) cameras in order to detect, record, and report data and audio data on roads for car crash detection. Such consideration
traffic accidents automatically. Vishnu and Rajalakshmi (2016) exploit is important to improve the performance of car crash detection, because
linear discriminant analysis (LDA) and SVM for monitoring traffic using one type of data may have information that the other type of data does
live video files from surveillance cameras. Ravindran et al. (2016) pro­ not contain. The main contribution of this research is to provide an
pose a novel supervised learning model based on machine vision tech­ ensemble deep learning-based car crash detection system that considers
niques and five SVMs trained with histogram of oriented gradients both video data and audio data from dashboard cameras in order to
(HOG) and gray-level co-occurrence matrix (GLCM) features, which improve the performance of car crash detection.
successfully detect road accidents from static images. Arceda and Riv­
eros (2018) propose a car crash detection system that combines a violent

2
J.G. Choi et al. Expert Systems With Applications 183 (2021) 115400

3. Method and height, and C is the number of channels, respectively. Fig. 2 shows
two examples of sequence video generators for the proposed car crash
Fig. 1 outlines this research. First, data acquisition and data pre­ detection classifier using video data from dashboard cameras, with the
processing are implemented using video data and audio data from batch size as 1, the number of frames as 5, width and height as
dashboard cameras. Then, car crash detection classifiers are developed (122,122), and the number of channels as 3 containing red, green, and
using video data and audio data, respectively. In particular, a CNN-and- blue (RGB) color levels.
GRU-based classifier is developed for car crash detection using video
data from dashboard cameras. In addition, GRU-based and CNN–based 3.2. Car crash detection using audio data from dashboard cameras
classifiers are developed for car crash detection using two different types
of audio data from dashboard cameras: audio features and spectrogram In this research, two different types of audio data from dashboard
images, respectively. An ensemble model (i.e., a weighted average cameras (i.e., audio features and spectrogram images) are used for car
ensemble) is then developed in order to combine three different classi­ crash detection.
fiers. Finally, a classification performance evaluation is conducted by
comparing the proposed car crash detection system with the car crash 3.2.1. Car crash detection based on audio features
detection models that use only one type of data. In each audio signal, different audio features that contain valuable
sound information can be extracted from the time (temporal) domain
3.1. Car crash detection using video data from dashboard cameras and the frequency (spectral) domain. Table 1 illustrates and describes
the audio features most widely used for audio feature analysis consid­
A CNN-and-GRU-based classifier is proposed for car crash detection ered in this research (Sammarco & Detyniecki, 2018). The proposed
using video data from dashboard cameras. A convolutional recurrent GRU-based classifier, which is robust not only for short-term de­
neural network (CRNN), which is a CNN-and-GRU-based classifier, is pendencies but also for long-term dependencies, is developed to classify
designed for image-based sequence recognition tasks related to car crashes using those temporal features. The input data layer is
capturing spatial and temporal elements (Shi et al., 2017). This com­ reformatted into three-dimensional vectors comprised of samples, time
bined architecture (i.e., a CNN-and-GRU-based classifier) has been stamps, and audio features to match the architecture of the GRU-based
performed for video classification as well as audio processing, such as classifier. These vectors are normalized to a 0–1 range and labeled as
speech recognition and music classification (Wu et al., 2015; Bartz et al., 0 or 1 for binary classification.
2017; Choi et al., 2017). In particular, CNN extracts features from each
frame of a video file independently, then pools their predictions across 3.2.2. Car crash detection based on spectrogram images
the whole video file. As the classifier ignores the full temporal footprint, In each audio signal, a spectrogram image is extracted. The original
CNN’s output is connected to a time distributed layer encoded in GRU. spectrogram images are then converted to mel-spectrogram images and
GRU is a gating mechanism in recurrent neural networks like LSTM but normalized to a 0–1 range. Fig. 3 shows an example of a mel-
has fewer parameters than LSTM, as it reduces the calculations for spectrogram image that illustrates car crashes. On the mel-
updating hidden states (Cho et al., 2014). The main concept is to capture spectrogram image, X-coordinates indicate temporal information, Y-
temporal ordering and long-range dependencies passing through CNN. coordinates indicate frequencies based on mel-scales, and amplitudes
Since one video file consists of more than 120 frames, converting based on log-scales are expressed with different colors.
hundreds of video files into the frames and returning an array that A CNN-based classifier is then developed for classifying car crashes
contains all values at once are likely to cause out of memory errors. In using spectrogram images of audio data. In particular, a two-
order to prevent these errors, a data generator is applied for the pro­ dimensional CNN is used for the proposed CNN-based classifier,
posed CNN-and-GRU-based classifier that uses video data from dash­ because it handles the visual patterns of space better than a one-
board cameras. Through a data generator, only one value is returned for dimensional CNN. The input data layer is reformatted into four-
each call, which solves the memory shortage issue. However, the dimensional tensors for the architecture of the proposed CNN-based
generator does not take the sequence of the frames into account, which classifier. The input tensor is composed of batch size, height, width,
could cause video classification problems. As a more efficient way to and channels, respectively.
address the memory shortage issue and also correctly order the frames,
extracting some of the frames in the video that contain the entire flow in 3.3. Ensemble model
the sequence, rather than using the entire frame set, can be utilized.
Making a set of the sparsely sequential frames per video data through a A weighted average ensemble is used as an ensemble model in order
video generator can satisfy all the above requirements (Ferlet, 2020). to combine different types of data (i.e., video data from dashboard
The needed shape of the video generator is (N, F, W, H, C), where N is the cameras, audio features of audio data from dashboard cameras, spec­
batch size, F is the number of frames for sequence, W and H are width trogram images of audio data from dashboard cameras) for the proposed

Fig. 1. Overview of this research.

3
J.G. Choi et al. Expert Systems With Applications 183 (2021) 115400

Fig. 2. Two examples of sequence video generators for the proposed car crash detection model using video data from dashboard cameras.

weighted average ensemble is better than the performance of a simple


Table 1
average ensemble, which allocates the uniform value to the weight of
Audio features with descriptions.
each classifier (Escalante et al., 2018). In this research, weight optimi­
Feature Domain Feature Name Descriptions zation search is performed with randomized search based on the
ID
Dirichlet distribution on validation data.
1 Time Zero Crossing Rate The rate of sign-changes of the
Domain (ZCR) signal during the duration of a
4. Applications
particular frame
2 Energy The sum of squares of the signal
values that are normalized by This section introduces two case studies involving real-world car
the respective frame length accident YouTube clips that include video data and audio data from
3 Entropy The entropy of sub-frames’
dashboard cameras. In particular, YouTube clips from various channels,
normalized energies, which can
be interpreted as a measure of
such as Car Crashes Time (2020) and J Utah (2020), are collected by the
abrupt changes authors for these case studies. They consist of positive clips, which
4 Frequency Spectral Centroid The center of gravity of the contain car crashes, and negative clips, which do not contain car crashes.
Domain (SC) spectrum Fig. 4 shows sample screenshots of negative (left) and positive (right)
5 Spectral Spread The second central moment of
clips, respectively. In particular, the second case study also use YouTube
(SS) the spectrum
6 Spectral Entropy Entropy of the normalized clips illustrating situations that are near crashes in order to identify
(SE) spectral energies for a set of whether or not the proposed car crash detection system can classify
sub-frames crashes and near crashes, while the first case study does not use them.
7 Spectral Flux (SF) The squared difference between
The lengths and video frame rates of every clips are set as 4 s and 30 fps,
the normalized magnitudes of
the spectra of the two successive
respectively, through data preprocessing. The values of audio features of
frames each clip are normalized to a 0–1 range.
8 Spectral Roll-off The frequency below which
(SR) 90% of the magnitude
distribution of the spectrum is 4.1. Case Study 1 (not containing clips with near crashes)
concentrated
9 ~ 21 Mel Frequency MFCCs form a cepstral A total of 300 YouTube clips consisting of 150 positive clips and 150
Cepstral representation, where the
negative clips is used for the first case study. 150 positive and 150
Coefficients frequency bands are not linear
(MFCCs) but distributed according to the negative clips are partitioned into training (60%), validation (20%), and
mel-scale test (20%) sets for five-fold cross-validation, respectively.
22 ~ 33 Chroma Vector A 12-element representation of
(CV) the spectral energy, where the
bins represent the 12 equal- 4.2. Case Study 2 (containing clips with near crashes)
tempered pitch classes of
western-type music In addition to the 300 YouTube clips used in the first case study, an
34 Chroma Deviation The standard deviation of the
additional 100 clips illustrating the situations that are near crashes (i.e.,
12 Chroma coefficients
negative clips), and an additional 100 positive clips, for a total of 500
YouTube clips, are used for the second case study. 250 positive and 250
car crash detection system. A weighted average ensemble is an extension negative clips, including 100 clips with near crashes, are also partitioned
of an averaging ensemble, where the contribution of each base classifier into training (60%), validation (20%), and test (20%) sets for five-fold
to the final prediction is weighted by the performance of the base clas­ cross-validation, respectively.
sifier on validation data, so the weights indicate the percentage of trust A CNN-and-GRU-based classifier, which is used for car crash detec­
or expected performance from each base classifier (Acar, 2015). An tion using video data from dashboard cameras in these case studies,
ensemble model takes every single output of the base classifiers as a mainly consists of three parts (i.e., CNN, a time distributed layer, and
training instance in order to find the appropriate weights that maximize GRU). First, CNN is composed of five Conv2D layers with a rectified
the output of the ensemble model. In many cases, the performance of a linear unit (ReLU) activation function, four batch normalization layers

4
J.G. Choi et al. Expert Systems With Applications 183 (2021) 115400

Fig. 3. An example of a mel-spectrogram image illustrating car crashes.

Fig. 4. Sample screenshots of negative (left) and positive (right) clips.

with 0.9 momentum, three max pooling layers, and a global max pooling number of time stamps is set to 81.
layer, which reduce the number of outputs to one dimension by getting A CNN-based classifier, which is used for car crash detection using
only maximum values from the last convolution. The input shape of CNN spectrogram images of audio data from dashboard cameras in these case
is (112, 112, 3), which indicates width, height of the image, and the studies, consists of two Conv2Ds, two max pooling layers, two dropout
number of channels, respectively. The output shape is (16, 512), which layers with dropout rates of 0.25 and 0.5, respectively, and two dense
represents the batch size and the filter size of the last layer in CNN, layers with ReLU and softmax activation functions, respectively. Adam
respectively. Second, CNN is injected to the time distributed layer. The and binary cross-entropy are used as an optimizer and a loss function,
time distributed layer, which needs input shape of (16, 112, 112, 3), respectively. The batch size, the number of epochs, and the learning rate
which indicates the batch size, width, height, and the number of chan­ are set to 16, 50, and 0.01, respectively. The shape for the first case study
nels, respectively, is connected to a GRU layer for treating time series. is expressed as (300, height, width, 3), which are the number of samples,
Finally, the rest consists of a GRU layer and five dense layers with ReLU the height of the spectrogram image, the width of the spectrogram
and softmax activation functions with three dropout layers. For the loss image, and the number of channels, respectively. In a similar manner,
function, binary focal loss is implemented in Keras as a custom loss the shape for the second case study is expressed as (500, height, width,
function with γ as 2 and α as 0.25. Binary focal loss is used to lower loss 3).
of the class, which is relatively easy to classify. It has the net effect of The performance of the proposed car crash detection system using all
putting more training emphasis on those data, which are hard to classify. three types of data (i.e., video data from dashboard cameras, audio
It dramatically reduces loss from about 0.8 to 0.1. The input shape of a features of audio data from dashboard cameras, spectrogram images of
GRU layer is (16, 20, 512), which are the batch size, the number of audio data from dashboard cameras) is compared with the performance
frames, and the filter size of the last layer in CNN, respectively. The of the models using only one or two types of data. In these case studies,
output is fully connected to the two different classes (i.e., positive and Methods 1, 2, and 3 are defined as the results of a CNN-and-GRU-based
negative) in order to predict the result. classifier using video data only, the results of a GRU-based classifier
A GRU-based classifier, which is used for car crash detection using using audio features of audio data only, and the results of a CNN-based
audio features of audio data from dashboard cameras in these case classifier using spectrogram images of audio data only, respectively.
studies, consists of two GRUs with 64 memory cells, three dropout layers Method 4 is defined as the results of a weighted average ensemble based
with dropout rates of 0.2, and two dense layers with ReLU and softmax on a CNN-and-GRU-based classifier using video data and a GRU-based
activation functions, respectively. Adam and binary cross-entropy are classifier using audio features of audio data. Method 5 is defined as
used as an optimizer and a loss function, respectively. The batch size, the the results of a weighted average ensemble based on a CNN-and-GRU-
number of epochs, and the learning rate are set to 16, 50, and 0.01, based classifier using video data and a CNN-based classifier using
respectively. The input shape of GRU for the first case study is (300, 81, spectrogram images of audio data. Method 6 is defined as the results of a
34), which indicates the number of samples, the number of time stamps, weighted average ensemble based on a GRU-based classifier using audio
and the number of features, respectively. In a similar manner, the input features of audio data and a CNN-based classifier using spectrogram
shape of GRU for the second case study is (500, 81, 34). Each time stamp images of audio data. Finally, Method 7 is defined as the results of the
consists of 20 per second, and the length of the sample is about 4 s, so the proposed car crash detection system, which uses a weighted average

5
J.G. Choi et al. Expert Systems With Applications 183 (2021) 115400

ensemble on the basis of a CNN-and-GRU-based classifier using video Table 3


data, a GRU-based classifier using audio features of audio data, and a The classification results of Case Study 1 (not containing YouTube clips with
CNN-based classifier using spectrogram images of audio data. Table 2 near crashes) and Case Study 2 (containing YouTube clips with near crashes).
illustrates seven methods and existing state-of-the-art car crash detec­ Method ROC-AUC (Case Study 1) ROC-AUC (Case Study 2)
tion models that are used for the case studies. A Receiver Operating 1 87.80 79.89
Characteristic - Area Under the Curve (ROC-AUC) is used as a perfor­ 2 94.41 83.69
mance measure in these case studies. 3 97.13 87.16
4 96.93 86.82
5 97.96 89.04
5. Results and discussion 6 98.11 88.84
7 (proposed) 98.60 89.86
Table 3 and Fig. 5 show the classification results of Case Study 1 and Ghosh et al. 83.33 72.22
Case Study 2 based on ROC-AUC. According to Table 3 and Fig. 5, for Sammarco & Detyniecki (A) 90.00 82.00
Sammarco & Detyniecki (B) 95.56 87.00
both Case Study 1, which does not contain YouTube clips with near
crashes, and Case Study 2, which contains YouTube clips with near
crashes, the proposed car crash detection system (i.e., Method 7) pro­ and ROC-AUC = 87.16 for Case Study 2) are significantly higher than
vides high classification performance (i.e., ROC-AUC = 98.60 for Case the classification performances of Method 1 for both case studies. It is
Study 1 and ROC-AUC = 89.86 for Case Study 2). It is expected that the therefore concluded that both types of audio data from dashboard
proposed car crash detection system can be used as part of an emergency cameras (i.e., audio features and spectrogram images) are more signif­
road call service that recognizes traffic accidents automatically and al­ icant for car crash detection in this research than video data from
lows immediate rescue, because the proposed car crash detection system dashboard cameras. It is postulated that extracting 11 different types of
can classify crashes and non-crashes, including near crashes, with high hand-crafted audio features already known as significant features for
accuracy. On the other hand, the classification results of Case Study 2 processing audio data for car crash detection with the GRU-based clas­
are relatively less than the classification results of Case Study 1 for all sifier (see Table 1), as well as using preprocessing and feature extraction
methods. Future work will therefore improve the performance of the of the CNN-based classifier with spectrogram images (i.e., converting
classification of crashes and near crashes. the original spectrogram images to mel-spectrogram images), will
Table 3 and Fig. 5 also indicate the classification performances of improve classification performances. On the other hand, it is known that
Method 1. The classification performances of the CNN-and-GRU-based the task of understanding video data does not have well-performing
classifier that uses video data from dashboard cameras only (i.e., ROC- approaches, because the pixels representing objects in video data
AUC = 87.80 for Case Study 1 and ROC-AUC = 79.89 for Case Study 2) include both temporal and spatial components, whereas audio processes
are substantially worse than the classification performances of the other only include temporal factors (Conway et al., 2021). In short, video
models (i.e., Methods 2–7). In contrast, the classification performances representation is a much more complex task and thus still does not have
of Method 2, the GRU-based classifier that uses audio features from as a high level of understanding as that of audio data. Further investi­
dashboard cameras only (i.e., ROC-AUC = 94.41 for Case Study 1 and gation is necessary.
ROC-AUC = 83.69 for Case Study 2), and the classification performances Table 3 and Fig. 5 also illustrate that the classification performances
of Method 3, the CNN-based classifier that uses spectrogram images of Method 4, the ensemble model that uses video data and audio features
from dashboard cameras only (i.e., ROC-AUC = 97.13 for Case Study 1 from dashboard cameras (i.e., ROC-AUC = 96.93 for Case Study 1 and
ROC-AUC = 86.82 for Case Study 2), Method 5, the ensemble model that
Table 2 uses video data and spectrogram images from dashboard cameras (i.e.,
Seven methods and existing state-of-the-art car crash detection models that are ROC-AUC = 97.96 for Case Study 1 and ROC-AUC = 89.04 for Case
used for the case studies. Study 2), and Method 6, the ensemble model that uses audio features
Method Model and spectrogram images from dashboard cameras (i.e., ROC-AUC =
1 A CNN-and-GRU-based classifier using video data only
98.11 for Case Study 1 and ROC-AUC = 88.84 for Case Study 2) are
2 A GRU-based classifier using audio features of audio data relatively better than the classification performances of base classifiers
only (i.e., Methods 1, 2, and 3). In addition, the differences between the
3 A CNN-based classifier using spectrogram images of audio classification performances of Method 7, the proposed car crash detec­
data only
tion system that uses video data, audio features, and spectrogram images
4 A CNN-and-GRU-based classifier using video data and a
GRU-based classifier using audio features of audio data (i.e., ROC-AUC = 98.60 for Case Study 1 and ROC-AUC = 89.86 for Case
5 A CNN-and-GRU-based classifier using video data and a Study 2) and the classification performances of Methods 5 and 6 are not
CNN-based classifier using spectrogram images of audio significant. Based on these results, it is postulated that using base clas­
data sifiers that provide high performances (e.g., Method 3) is substantial for
6 A GRU-based classifier using audio features of audio data
and a CNN-based classifier using spectrogram images of
developing an ensemble model that provides high performance (e.g.,
audio data Methods 5, 6, and 7).
7 (proposed) A CNN-and-GRU-based classifier using video data, a GRU- The additional experiment is conducted to compare the perfor­
based classifier using audio features of audio data, and a mances of Methods 1, 2, 3, and 7 with several state-of-the-art deep
CNN-based classifier using spectrogram images of audio
learning models for car crash detection. To our best knowledge, multi­
data
Ghosh et al. An existing state-of-the-art unimodal car crash detection modal car crash detection models that use both video and audio data
model (Ghosh et al., 2019), which is a CNN-and-LSTM- have not been implemented in existing research, so three different state-
based classifier using video data only, for comparison with of-the-art unimodal car crash detection models, one model that only
Method 1 utilizes video data (Ghosh et al., 2019) and two models that only utilize
Sammarco & An existing state-of-the-art unimodal car crash detection
audio data (Sammarco & Detyniecki, 2018), are implemented using the
Detyniecki (A) model (Sammarco & Detyniecki, 2018), which is a random
forest classifier using audio features of audio data only, for same dataset used in this research. (see Table 2). Table 3 and Fig. 6 also
comparison with Method 2 show the performance comparison results of Methods 1, 2, 3, and 7 as
Sammarco & An existing state-of-the-art unimodal car crash detection well as existing state-of-the-art unimodal car crash detection models.
Detyniecki (B) model (Sammarco & Detyniecki, 2018), which is a random
Table 3 and Fig. 6 indicate that all base classifiers utilizing video data
forest classifier using spectrogram images of audio data
only, for comparison with Method 3 or audio data (i.e., Methods 1, 2, and 3) proposed in this research show

6
J.G. Choi et al. Expert Systems With Applications 183 (2021) 115400

Fig. 5. The classification results of Case Study 1 (not containing YouTube clips with near crashes) and Case Study 2 (containing YouTube clips with near crashes).

Fig. 6. Performance comparison with existing state-of-the-art unimodal car crash detection models.

higher classification performances for both Case Study 1 and Case Study the video-based classifier’s lower performances (i.e., Method 1) can be
2 compared with existing state-of-the-art unimodal car crash detection attributed to its ambiguity in analyzing YouTube clips that include near
models. The results of a state-of-the-art unimodal car crash detection crashes as well as its difficulty in interpreting. In turn, the proposed car
model that only uses video data (Ghosh et al., 2019) also show that crash detection system that uses three different data types (i.e., ROC-
video-based classification performances are worse than audio-based AUC = 98.60 for Case Study 1 and ROC-AUC = 89.86 for Case Study 2)
classification performances. In addition, especially in Case Study 2, has the most significant classification performances for car crash

7
J.G. Choi et al. Expert Systems With Applications 183 (2021) 115400

detection compared with any other models. Therefore, the proposed car Validation, Visualization. Sunghoon Lim: Conceptualization, Method­
crash detection system sets state-of-the-art records for car crash ology, Validation, Writing - original draft, Writing - review & editing,
detection. Supervision, Project administration, Funding acquisition.

6. Conclusions and future work Declaration of Competing Interest

The objective of this research is to propose a car crash detection The authors declare that they have no known competing financial
system, based on ensemble deep learning and multimodal data from interests or personal relationships that could have appeared to influence
dashboard cameras, for an emergency road call service that recognizes the work reported in this paper.
traffic accidents automatically and allows immediate rescue after
transmission to emergency recovery agencies. In particular, the pro­ Acknowledgments
posed car crash detection system uses video data from dashboard cam­
eras and two different types of audio data from dashboard cameras, This work was supported by the National Research Foundation of
audio features and spectrogram images, in order to improve the per­ Korea (NRF) grant funded by the Korea government (MSIT) (No.
formance of car crash detection. 2019R1F1A1059346).
The proposed research is comprised of four main steps. First, data
acquisition and data preprocessing using video data, audio features of References
audio data, and spectrogram images of audio data from dashboard
camera are conducted. Car crash detection classifiers using video data, Acar, E. (2015). Effect of error metrics on optimum weight factor selection for ensemble
of metamodels. Expert Systems with Applications, 42(5), 2703–2709. https://doi.org/
audio features of audio data, and spectrogram images of audio data are 10.1016/j.eswa.2014.11.020
then developed based on GRU and CNN. A weighted average ensemble is Arceda, V. M., & Riveros, E. L. (2018). Fast car Crash Detection in Video. XLIV Latin
used as an ensemble technique for combining the three different clas­ American Computer Conference (CLEI), 2018, 632–637. https://doi.org/10.1109/
CLEI.2018.00081
sifiers. Finally, the classification performance of the proposed car crash Bartz, C., Herold, T., Yang, H., & Meinel, C. (2017, November). Language identification
detection system that uses three different types of data (i.e., video data using deep convolutional recurrent neural networks. In International conference on
and two different types of audio data) is compared with the classification neural information processing (pp. 880-889). Springer, Cham.
Car Crashes Time. (2020, April 25). https://www.youtube.com/user/CarCrashesTime.
performances of the base classifiers that use one type of data only and
Carletti, V., Foggia, P., Percannella, G., Saggese, A., Strisciuglio, N., & Vento, M. (2013).
the ensemble models that use two different types of data. Audio surveillance using a bag of aural words classifier. In 2013 10th IEEE
Case studies involving real-world car accident YouTube clips are International Conference on Advanced Video and Signal Based Surveillance (pp. 81–86).
used to verify the proposed car crash detection system. The performance https://doi.org/10.1109/AVSS.2013.6636620
Chan, F.-H., Chen, Y.-T., Xiang, Y., & Sun, M. (2017). Anticipating Accidents in Dashcam
of the proposed car crash detection system that uses three different types Videos. In Computer Vision – ACCV 2016 (Vol. 10114, pp. 136–153). Springer
of data (i.e., video data from dashboard cameras, audio features of audio International Publishing. doi: 10.1007/978-3-319-54190-7_9.
data from dashboard cameras, spectrogram images of audio data from Chang, B. R., Tsai, H. F., & Young, C.-P. (2010). Intelligent data fusion system for
predicting vehicle collision warning using vision/GPS sensing. Expert Systems with
dashboard cameras) is better than the performances of the base classi­ Applications, 37(3), 2439–2450. https://doi.org/10.1016/j.eswa.2009.07.036
fiers that use one type of data only and the ensemble models that use two Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., &
different types of data to classify crashes and non-crashes, including Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder-Decoder for
Statistical Machine Translation. ArXiv:1406.1078 [Cs, Stat]. http://arxiv.org/abs/
near crashes. Also, the additional experiment’s results demonstrate that 1406.1078.
the proposed car crash detection system establishes state-of-the-art Choi, K., Fazekas, G., Sandler, M., & Cho, K. (2017, March). Convolutional recurrent
classification performances. neural networks for music classification. In 2017 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP) (pp. 2392-2396). IEEE.
The authors will develop an advanced version of the proposed car comma.ai. (2020, April 25). https://comma.ai.
crash detection system for improving the performance of the classifi­ Conway, A. M., Durbach, I. N., McInnes, A., & Harris, R. N. (2021). Frame-by-frame
cation of crashes and near crashes. In particular, it is also expected to annotation of video recordings using deep neural networks. Ecosphere, 12(3).
https://doi.org/10.1002/ecs2.v12.310.1002/ecs2.3384
enhance the base classifier that only uses video data from dashboard
Crocco, M., Cristani, M., Trucco, A., & Murino, V. (2016). Audio Surveillance: A
cameras. Currently, this base classifier uses a frame-level feature Systematic Review. ACM Computing Surveys, 48(4), 1–46. https://doi.org/10.1145/
extraction method combined with video-level understanding, with GRU 2871183
incorporating temporal information. To further improve the video-based Drive.ai. (2020, April 25). https://drive.ai.
Durduran, S. S. (2010). A decision making system to automatic recognize of traffic
stream, a more intuitive way of applying CNN, which is a three- accidents on the basis of a GIS platform. Expert Systems with Applications, 37(12),
dimensional CNN with three-dimensional filters (i.e., two-dimensional 7729–7736. https://doi.org/10.1016/j.eswa.2010.04.068
filters for images and one-dimensional filter for sequences), can be Escalante, H. J., Escalera, S., Guyon, I., Baró, X., Güçlütürk, Y., Güçlü, U., &
Gerven, M. V. (2018). Explainable and interpretable models in computer vision and
considered. Due to realistic limitations (e.g., data collection, computa­ machine learning. Springer International Publishing.
tion resources, hardware constraints on dashboard cameras), applying a Ferlet, P. (2020). Keras Sequence Video Generators source code (Version 1.0.13) [Source
three-dimensional CNN would require more time to be realized (Yao code]. https://github.com/metal3d/keras-video-generators.
Foggia, P., Petkov, N., Saggese, A., Strisciuglio, N., & Vento, M. (2016). Audio
et al., 2019). In addition, rather than a weighted average ensemble that Surveillance of roads: A system for detecting anomalous sounds. IEEE Transactions on
is used in this research, an alternative ensemble technique can be Intelligent Transportation Systems, 17(1), 279–288. https://doi.org/10.1109/
considered to improve the performance of car crash detection in the TITS.2015.2470216
Foggia, P., Saggese, A., Strisciuglio, N., Vento, M., & Petkov, N. (2015). Car crashes
future. The authors will also consider how the proposed deep learning detection by audio analysis in crowded roads. In 2015 12th IEEE International
ensemble system that uses video and audio data can be applied to other Conference on Advanced Video and Signal Based Surveillance (AVSS) (pp. 1–6). https://
domains containing multimodal data, such as safety management in doi.org/10.1109/AVSS.2015.7301731
Gang, R., & Zhuping, Z. (2011). Traffic safety forecasting method by particle swarm
construction and manufacturing.
optimization and support vector machine. Expert Systems with Applications, 38(8),
10420–10424. https://doi.org/10.1016/j.eswa.2011.02.066
CRediT authorship contribution statement Ghosh, S., Sunny, S. J., & Roney, R. (2019, March). Accident detection using
convolutional neural networks. In 2019 International Conference on Data Science
and Communication (IconDSC) (pp. 1-6). IEEE.
Jae Gyeong Choi: Conceptualization, Methodology, Formal anal­ Grill, T., & Schlüter, J. (2015). Structural Segmentation with Convolutional Neural
ysis, Investigation, Resources, Data curation, Validation, Visualization, Networks MIREX Submission. 3.
Writing - original draft, Writing - review & editing. Chan Woo Kong: Habib, K., & Ridella, S. (2017). Automatic vehicle control systems (p. 13).
Heron, M. (2018). Deaths: Leading Causes for 2016. 77.
Formal analysis, Investigation, Resources, Data curation, Validation, J Utah. (2020, April 25). https://www.youtube.com/channel/UCBcVQr-07MH-
Visualization. Gyeongho Kim: Investigation, Resources, Data curation, p9e2kRTdB3A.

8
J.G. Choi et al. Expert Systems With Applications 183 (2021) 115400

Khamparia, A., Gupta, D., Nguyen, N. G., Khanna, A., Pandey, B., & Tiwari, P. (2019). Sarraf, R., & McGuire, M. P. (2020). Integration and comparison of multi-criteria decision
Sound classification using convolutional neural network and tensor deep stacking making methods in safe route planner. Expert Systems with Applications, 154, 113399.
network. IEEE Access, 7, 7717–7727. https://doi.org/10.1109/ https://doi.org/10.1016/j.eswa.2020.113399
ACCESS.2018.2888882 Shi, B., Bai, X., & Yao, C. (2017). An end-to-end trainable neural network for image-based
Ki, Y.-K. (2007). Accident Detection System using Image Processing and MDR. 5. sequence recognition and its application to scene text recognition. IEEE transactions
Milanés, V., Llorca, D. F., Villagrá, J., Pérez, J., Parra, I., González, C., & Sotelo, M. A. on pattern analysis and machine intelligence, 39(11), 2298–2304.
(2012a). Vision-based active safety system for automatic stopping. Expert Systems Smirnov, A., Kashevnik, A., Shilov, N., Makklya, A., & Gusikhin, O. (2013). Context-
with Applications, 39(12), 11234–11242. https://doi.org/10.1016/j. aware service composition in cyber physical human system for transportation safety.
eswa.2012.03.047 In 2013 13th International Conference on ITS Telecommunications (ITST) (pp.
Milanés, V., Pérez, J., Godoy, J., & Onieva, E. (2012b). A fuzzy aid rear-end collision 139–144). https://doi.org/10.1109/ITST.2013.6685535
warning/avoidance system. Expert Systems with Applications, 39(10), 9097–9107. Vishnu, V. M., & Rajalakshmi, M. (2016). Road side video surveillance in traffic scenes
https://doi.org/10.1016/j.eswa.2012.02.054 using map-reduce framework for accident analysis. Biomedical Research, 257–266.
Ministry of Land, Infrastructure and Transport, Republic of Korea. (2018, January 7). Wagner-Kaiser, R. (2020, April 25). CrashCatcher. https://github.com/rwk506/
https://www.molit.go.kr/USR/NEWS/m_71/dtl.jsp?lcmspage=7&id=95080994. CrashCatcher.
Naidenov, A., & Sysoev, A. (2019). Developing Car Accident Detecting System Based on White, J., Thompson, C., Turner, H., Dougherty, B., & Schmidt, D. C. (2011).
Machine Learning Algorithms Applied to Video Recordings Data. 1–12. WreckWatch: Automatic traffic accident detection and notification with
Ravindran, V., Viswanathan, L., & Rangaswamy, S. (2016). A novel approach to smartphones. Mobile Networks and Applications, 16(3), 285–303. https://doi.org/
automatic road-accident detection using machine vision techniques. International 10.1007/s11036-011-0304-8
Journal of Advanced Computer Science and Applications, 7(11). https://doi.org/ Wu, Z., Wang, X., Jiang, Y. G., Ye, H., & Xue, X. (2015). Modeling spatial-temporal clues
10.14569/IJACSA.2016.071130 in a hybrid deep learning framework for video classification. In In Proceedings of the
Saggese, A., Strisciuglio, N., Vento, M., & Petkov, N. (2016). Time-frequency analysis for 23rd ACM international conference on Multimedia (pp. 461–470).
audio event detection in real scenarios. In 2016 13th IEEE International Conference on Wyse, L. (2017). Audio Spectrogram Representations for Processing with Convolutional
Advanced Video and Signal Based Surveillance (AVSS) (pp. 438–443). https://doi.org/ Neural Networks. ArXiv:1706.09559 [Cs]. http://arxiv.org/abs/1706.09559.
10.1109/AVSS.2016.7738082 Yadav, R., Dahiya, P. K., & Mishra, R. (2020). Comparative analysis of automotive radar
Salamon, J., & Bello, J. P. (2017). Deep convolutional neural networks and data sensor for collision detection and warning system. International Journal of Information
augmentation for environmental sound classification. IEEE Signal Processing Letters, Technology, 12(1), 289–294. https://doi.org/10.1007/s41870-018-0167-3
24(3), 279–283. https://doi.org/10.1109/LSP.2017.2657381 Yao, G., Lei, T., & Zhong, J. (2019). A review of convolutional-neural-network-based
Sammarco, M., & Detyniecki, M. (2018). Crashzam: Sound-based Car Crash Detection: action recognition. Pattern Recognition Letters, 118, 14–22.
Proceedings of the 4th International Conference on Vehicle Technology and Yao, Y., Xu, M., Wang, Y., Crandall, D. J., & Atkins, E. M. (2019). Unsupervised Traffic
Intelligent Transport Systems, 27–35. doi: 10.5220/0006629200270035. Accident Detection in First-Person Videos. ArXiv:1903.00618 [Cs]. http://arxiv.org/
abs/1903.00618.

You might also like