Professional Documents
Culture Documents
1 s2.0 S095741742100823X Main
1 s2.0 S095741742100823X Main
Car crash detection using ensemble deep learning and multimodal data
from dashboard cameras
Jae Gyeong Choi a, 1, Chan Woo Kong a, 2, Gyeongho Kim a, 3, Sunghoon Lim a, b, *, 4
a
Department of Industrial Engineering, Ulsan National Institute of Science and Technology, Ulsan 44919, Republic of Korea
b
Institute for the 4th Industrial Revolution, Ulsan National Institute of Science and Technology, Ulsan 44919, Republic of Korea
A R T I C L E I N F O A B S T R A C T
Keywords: Due to the increase in motor vehicle accidents, there is a growing need for high-performance car crash detection
Dashboard camera systems. The authors of this research propose a car crash detection system that uses both video data and audio
Car crash data from dashboard cameras in order to improve car crash detection performance. While most existing car crash
Multimodal data
detection systems depend on single modal data (i.e., video data or audio data only), the proposed car crash
Deep learning
Ensemble technique
detection system uses an ensemble deep learning model based on multimodal data (i.e., both video and audio
data), because different types of data extracted from one information source (e.g., dashboard cameras) can be
regarded as different views of the same source. These different views complement one another and improve
detection performance, because one view may have information that the other view does not contain. In this
research, deep learning techniques, gated recurrent unit (GRU) and convolutional neural network (CNN), are
used to develop a car crash detection system. A weighted average ensemble is used as an ensemble technique.
The proposed car crash detection system, which is based on multiple classifiers that use both video and audio
data from dashboard cameras, is validated using a comparison with single classifiers that use video data or audio
data only. Car accident YouTube clips are used to validate this research. The experimental results indicate that
the proposed car crash detection system performs significantly better than single classifiers. It is expected that the
proposed car crash detection system can be used as part of an emergency road call service that recognizes traffic
accidents automatically and allows immediate rescue after transmission to emergency recovery agencies.
1. Introduction or automatically applies the brakes for emergency braking (Chang et al.,
2010; Milanés et al., 2012a). Installing a front CAS to every motor
Motor vehicle accidents have consistently had serious consequences, vehicle is legislated in Europe and Japan, while installing a lane de
including loss of life and property (Durduran, 2010; Gang & Zhuping, parture warning system to large-sized buses and trucks is legislated in
2011). In particular, while the number of traffic accidents is on the the Republic of Korea (Ministry of Land, Infrastructure and Transport,
relative decline, traffic accidents are still one of the leading causes of Republic of Korea, 2018). The sensors used to prevent collisions include
death worldwide (Heron, 2016; Sarraf & McGuire, 2020). There are two radar sensors, light detection and ranging (LIDAR) sensors, ultrasonic
major technical solutions that help resolve this issue: collision avoidance waves sensors, camera vision sensors, and sensor fuses that mix different
systems (CASs) (Habib & Ridella, 2017; Milanés et al., 2012b; Yadav kinds of sensors (Yadav et al., 2020). The performance of a front CAS is
et al., 2020) and emergency road call services (Smirnov et al., 2013; more significant than other types of CASs (e.g., a back CAS, a lane de
White et al., 2011). parture warning system). For example, the Insurance Institute for
First, a CAS, which can be installed in motor vehicles, senses the Highway Safety announced the latest statistics that a front CAS can
distances between vehicles through sensors, then adjusts the accelerator reduce rear-end collisions by 40 percent (Habib & Ridella, 2017).
* Corresponding author at: Department of Industrial Engineering, Ulsan National Institute of Science and Technology, Ulsan 44919, Republic of Korea.
E-mail address: sunghoonlim@unist.ac.kr (S. Lim).
1
ORCID ID: 0000-0002-4733-1937.
2
ORCID ID: 0000-0001-7137-0592.
3
ORCID ID: 0000-0001-7486-8628.
4
ORCID ID: 0000-0001-9534-7397.
https://doi.org/10.1016/j.eswa.2021.115400
Received 7 July 2020; Received in revised form 7 June 2021; Accepted 8 June 2021
Available online 13 June 2021
0957-4174/© 2021 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
J.G. Choi et al. Expert Systems With Applications 183 (2021) 115400
Previously, developing CASs had only been possible for major com flow (ViF) descriptor and SVMs using closed-circuit television (CCTV)
panies that could heavily invest and employ experts in sensor and video data, which are detected through CNN.
related software development. However, due to the advancement of State-of-the-art deep learning techniques have been recently applied
artificial intelligence, high-tech startup companies (e.g., Drive.ai (Drive. to car crash detection using video data. Chan et al. (2017) offer a
ai, 2020), comma.ai (comma.ai, 2020)) can also develop CASs based on dynamic-spatial-attention (DSA) recurrent neural network (RNN) model
deep learning and off-the-shelf sensors. to predict car accidents from dashboard cameras. Their model is trained
Second, an emergency road call service recognizes traffic accidents to distribute soft-attention to object candidates gathering subtle cues
automatically and allows immediate rescue after transmission to emer dynamically as well as modeling the temporal dependencies of all cues
gency recovery agencies. This service is very useful, because more than to robustly predict accidents. Naidenov & Sysoev (2019) develop a car
70 percent of traffic fatalities are drivers and passengers. The death rate accident detecting system based on CNN using video capture recordings.
can decrease by 6 percent if the time to contact emergency medical Yao et al. (2019) present an unsupervised deep learning framework for
service is shortened (White et al., 2011). In particular, e-call terminal traffic accident detection using dashboard cameras. In particular, their
installation is legislated in Europe and Russia (Smirnov et al., 2013). In approach can detect traffic accidents by predicting traffic participant
the event of an accident, an emergency road call service is activated, trajectories as well as their future locations.
which requires car crash detection systems to operate correctly. In this CrashCatcher (Wagner-Kaiser, 2020) is one well-known car crash
research, a car crash detection system is developed as part of an emer detection model that uses video data from dashboard cameras. A hier
gency road call service in order to recognize traffic accidents automat archical recurrent neural network (HRNN), which is based on two
ically and allow immediate rescue. different layers of long short-term memory (LSTM), is used to train the
Existing systems detect accidents through analyzing sound data or model. In particular, the first layer analyzes a time series of video data
video data. They use machine learning/deep learning techniques, such from dashboard cameras. The second layer encodes the results of the
as support vector machines (SVMs), Gaussian mixture models (GMMs), first layer and trains the model using the labeled video data (i.e., crash
and learning vector quantization (LVQ) classifiers (1) to determine data and non-crash data).
whether or not video information indicates that an accident has
occurred or (2) to classify various sounds (e.g., window breaking, 2.2. Car crash detection using audio data
screening, wire skidding) that may occur during an accident. However,
using one type of data, either video data or audio data, may not fully Existing studies indicate that not only video data but also different
utilize information that the data give. This potentially causes the car kinds of audio data on roads can be applied to detect car accidents
crash detection system to perform poorly. (Crocco et al., 2016). Carletti et al. (2013) use a bag-of-words method
The proposed car crash detection system can improve this perfor for event detection using audio data, which can be applied to car crash
mance by using both video and audio data, which provide different but detection. The first-level features of their method are computed on a
complementary views, available in a single accident. By dividing the short time interval, which is similar to the words of a text. The second-
data from one source into different types and distinguishing whether a level features characterize a longer time interval, which is based on the
motor vehicle is involved in a collision, missing information from one actual sounds to be recognized. Foggia et al. (2015, 2016) propose a
type can be supplemented by the other type of data. Deep learning novel method to detect road accidents by analyzing sounds captured by
techniques (i.e., gated recurrent unit (GRU) and convolutional neural microphones in order to identify hazardous situations on roads, such as
network (CNN)) are used to develop the proposed car crash detection tire skidding and car crashes. First, their method extracts a set of features
system, and a weighted average ensemble is used as an ensemble tech that can identify the discriminant properties of the events of interest.
nique. The proposed car crash detection system can be applied to various Then, a bag-of-words approach is also exploited to detect both short and
areas, including insurance companies, police departments, and legal sustained events. Saggese et al. (2016) develop a sound analysis method
proceedings. based on K-means algorithm and SVM in order to detect audio events in
This paper is structured as follows: Section 2 provides the literature surveillance applications, including car crash detection.
review for this research. Section 3 introduces the proposed car crash In particular, CNNs provide high performance for sound classifica
detection system using both video and audio data from dashboard tion using spectrogram images (Grill & Schlüter, 2015; Wyse, 2017).
cameras. Section 4 introduces case studies, and Section 5 provides the Salamon & Bello (2017) analyze spectrogram images through a CNN
experimental results and discussion. Section 6 concludes the paper. architecture with localized kernels for classifying environmental sounds,
such as air conditioners, car horns, children playing, dogs barking,
2. Literature review drilling, engines idling, gun shots, jackhammers, sirens, and street
music. Khamparia et al. (2019) also use CNN as well as the tensor deep
The literature review section illustrates literature related to (1) car stacking network (TDSN) to analyze the spectrogram images of envi
crash detection using video data and (2) car crash detection using audio ronmental sounds. Crashzam is a novel car crash detection system that
data. uses two different types of audio data: (1) audio features and (2) spec
trogram images. An ensemble machine learning technique (i.e., random
2.1. Car crash detection using video data forest) is used to combine two different types of audio data for car crash
classification (Sammarco & Detyniecki, 2018).
Various machine learning techniques have been widely used to While several existing studies based on the machine learning tech
detect car crashes based on different kinds of video data. Ki (2007) niques above consider video data or audio data in order to detect car
proposes a vision-based traffic accident detection system using charge crashes, limited contributions have been made to consider both video
coupled device (CCD) cameras in order to detect, record, and report data and audio data on roads for car crash detection. Such consideration
traffic accidents automatically. Vishnu and Rajalakshmi (2016) exploit is important to improve the performance of car crash detection, because
linear discriminant analysis (LDA) and SVM for monitoring traffic using one type of data may have information that the other type of data does
live video files from surveillance cameras. Ravindran et al. (2016) pro not contain. The main contribution of this research is to provide an
pose a novel supervised learning model based on machine vision tech ensemble deep learning-based car crash detection system that considers
niques and five SVMs trained with histogram of oriented gradients both video data and audio data from dashboard cameras in order to
(HOG) and gray-level co-occurrence matrix (GLCM) features, which improve the performance of car crash detection.
successfully detect road accidents from static images. Arceda and Riv
eros (2018) propose a car crash detection system that combines a violent
2
J.G. Choi et al. Expert Systems With Applications 183 (2021) 115400
3. Method and height, and C is the number of channels, respectively. Fig. 2 shows
two examples of sequence video generators for the proposed car crash
Fig. 1 outlines this research. First, data acquisition and data pre detection classifier using video data from dashboard cameras, with the
processing are implemented using video data and audio data from batch size as 1, the number of frames as 5, width and height as
dashboard cameras. Then, car crash detection classifiers are developed (122,122), and the number of channels as 3 containing red, green, and
using video data and audio data, respectively. In particular, a CNN-and- blue (RGB) color levels.
GRU-based classifier is developed for car crash detection using video
data from dashboard cameras. In addition, GRU-based and CNN–based 3.2. Car crash detection using audio data from dashboard cameras
classifiers are developed for car crash detection using two different types
of audio data from dashboard cameras: audio features and spectrogram In this research, two different types of audio data from dashboard
images, respectively. An ensemble model (i.e., a weighted average cameras (i.e., audio features and spectrogram images) are used for car
ensemble) is then developed in order to combine three different classi crash detection.
fiers. Finally, a classification performance evaluation is conducted by
comparing the proposed car crash detection system with the car crash 3.2.1. Car crash detection based on audio features
detection models that use only one type of data. In each audio signal, different audio features that contain valuable
sound information can be extracted from the time (temporal) domain
3.1. Car crash detection using video data from dashboard cameras and the frequency (spectral) domain. Table 1 illustrates and describes
the audio features most widely used for audio feature analysis consid
A CNN-and-GRU-based classifier is proposed for car crash detection ered in this research (Sammarco & Detyniecki, 2018). The proposed
using video data from dashboard cameras. A convolutional recurrent GRU-based classifier, which is robust not only for short-term de
neural network (CRNN), which is a CNN-and-GRU-based classifier, is pendencies but also for long-term dependencies, is developed to classify
designed for image-based sequence recognition tasks related to car crashes using those temporal features. The input data layer is
capturing spatial and temporal elements (Shi et al., 2017). This com reformatted into three-dimensional vectors comprised of samples, time
bined architecture (i.e., a CNN-and-GRU-based classifier) has been stamps, and audio features to match the architecture of the GRU-based
performed for video classification as well as audio processing, such as classifier. These vectors are normalized to a 0–1 range and labeled as
speech recognition and music classification (Wu et al., 2015; Bartz et al., 0 or 1 for binary classification.
2017; Choi et al., 2017). In particular, CNN extracts features from each
frame of a video file independently, then pools their predictions across 3.2.2. Car crash detection based on spectrogram images
the whole video file. As the classifier ignores the full temporal footprint, In each audio signal, a spectrogram image is extracted. The original
CNN’s output is connected to a time distributed layer encoded in GRU. spectrogram images are then converted to mel-spectrogram images and
GRU is a gating mechanism in recurrent neural networks like LSTM but normalized to a 0–1 range. Fig. 3 shows an example of a mel-
has fewer parameters than LSTM, as it reduces the calculations for spectrogram image that illustrates car crashes. On the mel-
updating hidden states (Cho et al., 2014). The main concept is to capture spectrogram image, X-coordinates indicate temporal information, Y-
temporal ordering and long-range dependencies passing through CNN. coordinates indicate frequencies based on mel-scales, and amplitudes
Since one video file consists of more than 120 frames, converting based on log-scales are expressed with different colors.
hundreds of video files into the frames and returning an array that A CNN-based classifier is then developed for classifying car crashes
contains all values at once are likely to cause out of memory errors. In using spectrogram images of audio data. In particular, a two-
order to prevent these errors, a data generator is applied for the pro dimensional CNN is used for the proposed CNN-based classifier,
posed CNN-and-GRU-based classifier that uses video data from dash because it handles the visual patterns of space better than a one-
board cameras. Through a data generator, only one value is returned for dimensional CNN. The input data layer is reformatted into four-
each call, which solves the memory shortage issue. However, the dimensional tensors for the architecture of the proposed CNN-based
generator does not take the sequence of the frames into account, which classifier. The input tensor is composed of batch size, height, width,
could cause video classification problems. As a more efficient way to and channels, respectively.
address the memory shortage issue and also correctly order the frames,
extracting some of the frames in the video that contain the entire flow in 3.3. Ensemble model
the sequence, rather than using the entire frame set, can be utilized.
Making a set of the sparsely sequential frames per video data through a A weighted average ensemble is used as an ensemble model in order
video generator can satisfy all the above requirements (Ferlet, 2020). to combine different types of data (i.e., video data from dashboard
The needed shape of the video generator is (N, F, W, H, C), where N is the cameras, audio features of audio data from dashboard cameras, spec
batch size, F is the number of frames for sequence, W and H are width trogram images of audio data from dashboard cameras) for the proposed
3
J.G. Choi et al. Expert Systems With Applications 183 (2021) 115400
Fig. 2. Two examples of sequence video generators for the proposed car crash detection model using video data from dashboard cameras.
4
J.G. Choi et al. Expert Systems With Applications 183 (2021) 115400
with 0.9 momentum, three max pooling layers, and a global max pooling number of time stamps is set to 81.
layer, which reduce the number of outputs to one dimension by getting A CNN-based classifier, which is used for car crash detection using
only maximum values from the last convolution. The input shape of CNN spectrogram images of audio data from dashboard cameras in these case
is (112, 112, 3), which indicates width, height of the image, and the studies, consists of two Conv2Ds, two max pooling layers, two dropout
number of channels, respectively. The output shape is (16, 512), which layers with dropout rates of 0.25 and 0.5, respectively, and two dense
represents the batch size and the filter size of the last layer in CNN, layers with ReLU and softmax activation functions, respectively. Adam
respectively. Second, CNN is injected to the time distributed layer. The and binary cross-entropy are used as an optimizer and a loss function,
time distributed layer, which needs input shape of (16, 112, 112, 3), respectively. The batch size, the number of epochs, and the learning rate
which indicates the batch size, width, height, and the number of chan are set to 16, 50, and 0.01, respectively. The shape for the first case study
nels, respectively, is connected to a GRU layer for treating time series. is expressed as (300, height, width, 3), which are the number of samples,
Finally, the rest consists of a GRU layer and five dense layers with ReLU the height of the spectrogram image, the width of the spectrogram
and softmax activation functions with three dropout layers. For the loss image, and the number of channels, respectively. In a similar manner,
function, binary focal loss is implemented in Keras as a custom loss the shape for the second case study is expressed as (500, height, width,
function with γ as 2 and α as 0.25. Binary focal loss is used to lower loss 3).
of the class, which is relatively easy to classify. It has the net effect of The performance of the proposed car crash detection system using all
putting more training emphasis on those data, which are hard to classify. three types of data (i.e., video data from dashboard cameras, audio
It dramatically reduces loss from about 0.8 to 0.1. The input shape of a features of audio data from dashboard cameras, spectrogram images of
GRU layer is (16, 20, 512), which are the batch size, the number of audio data from dashboard cameras) is compared with the performance
frames, and the filter size of the last layer in CNN, respectively. The of the models using only one or two types of data. In these case studies,
output is fully connected to the two different classes (i.e., positive and Methods 1, 2, and 3 are defined as the results of a CNN-and-GRU-based
negative) in order to predict the result. classifier using video data only, the results of a GRU-based classifier
A GRU-based classifier, which is used for car crash detection using using audio features of audio data only, and the results of a CNN-based
audio features of audio data from dashboard cameras in these case classifier using spectrogram images of audio data only, respectively.
studies, consists of two GRUs with 64 memory cells, three dropout layers Method 4 is defined as the results of a weighted average ensemble based
with dropout rates of 0.2, and two dense layers with ReLU and softmax on a CNN-and-GRU-based classifier using video data and a GRU-based
activation functions, respectively. Adam and binary cross-entropy are classifier using audio features of audio data. Method 5 is defined as
used as an optimizer and a loss function, respectively. The batch size, the the results of a weighted average ensemble based on a CNN-and-GRU-
number of epochs, and the learning rate are set to 16, 50, and 0.01, based classifier using video data and a CNN-based classifier using
respectively. The input shape of GRU for the first case study is (300, 81, spectrogram images of audio data. Method 6 is defined as the results of a
34), which indicates the number of samples, the number of time stamps, weighted average ensemble based on a GRU-based classifier using audio
and the number of features, respectively. In a similar manner, the input features of audio data and a CNN-based classifier using spectrogram
shape of GRU for the second case study is (500, 81, 34). Each time stamp images of audio data. Finally, Method 7 is defined as the results of the
consists of 20 per second, and the length of the sample is about 4 s, so the proposed car crash detection system, which uses a weighted average
5
J.G. Choi et al. Expert Systems With Applications 183 (2021) 115400
6
J.G. Choi et al. Expert Systems With Applications 183 (2021) 115400
Fig. 5. The classification results of Case Study 1 (not containing YouTube clips with near crashes) and Case Study 2 (containing YouTube clips with near crashes).
Fig. 6. Performance comparison with existing state-of-the-art unimodal car crash detection models.
higher classification performances for both Case Study 1 and Case Study the video-based classifier’s lower performances (i.e., Method 1) can be
2 compared with existing state-of-the-art unimodal car crash detection attributed to its ambiguity in analyzing YouTube clips that include near
models. The results of a state-of-the-art unimodal car crash detection crashes as well as its difficulty in interpreting. In turn, the proposed car
model that only uses video data (Ghosh et al., 2019) also show that crash detection system that uses three different data types (i.e., ROC-
video-based classification performances are worse than audio-based AUC = 98.60 for Case Study 1 and ROC-AUC = 89.86 for Case Study 2)
classification performances. In addition, especially in Case Study 2, has the most significant classification performances for car crash
7
J.G. Choi et al. Expert Systems With Applications 183 (2021) 115400
detection compared with any other models. Therefore, the proposed car Validation, Visualization. Sunghoon Lim: Conceptualization, Method
crash detection system sets state-of-the-art records for car crash ology, Validation, Writing - original draft, Writing - review & editing,
detection. Supervision, Project administration, Funding acquisition.
The objective of this research is to propose a car crash detection The authors declare that they have no known competing financial
system, based on ensemble deep learning and multimodal data from interests or personal relationships that could have appeared to influence
dashboard cameras, for an emergency road call service that recognizes the work reported in this paper.
traffic accidents automatically and allows immediate rescue after
transmission to emergency recovery agencies. In particular, the pro Acknowledgments
posed car crash detection system uses video data from dashboard cam
eras and two different types of audio data from dashboard cameras, This work was supported by the National Research Foundation of
audio features and spectrogram images, in order to improve the per Korea (NRF) grant funded by the Korea government (MSIT) (No.
formance of car crash detection. 2019R1F1A1059346).
The proposed research is comprised of four main steps. First, data
acquisition and data preprocessing using video data, audio features of References
audio data, and spectrogram images of audio data from dashboard
camera are conducted. Car crash detection classifiers using video data, Acar, E. (2015). Effect of error metrics on optimum weight factor selection for ensemble
of metamodels. Expert Systems with Applications, 42(5), 2703–2709. https://doi.org/
audio features of audio data, and spectrogram images of audio data are 10.1016/j.eswa.2014.11.020
then developed based on GRU and CNN. A weighted average ensemble is Arceda, V. M., & Riveros, E. L. (2018). Fast car Crash Detection in Video. XLIV Latin
used as an ensemble technique for combining the three different clas American Computer Conference (CLEI), 2018, 632–637. https://doi.org/10.1109/
CLEI.2018.00081
sifiers. Finally, the classification performance of the proposed car crash Bartz, C., Herold, T., Yang, H., & Meinel, C. (2017, November). Language identification
detection system that uses three different types of data (i.e., video data using deep convolutional recurrent neural networks. In International conference on
and two different types of audio data) is compared with the classification neural information processing (pp. 880-889). Springer, Cham.
Car Crashes Time. (2020, April 25). https://www.youtube.com/user/CarCrashesTime.
performances of the base classifiers that use one type of data only and
Carletti, V., Foggia, P., Percannella, G., Saggese, A., Strisciuglio, N., & Vento, M. (2013).
the ensemble models that use two different types of data. Audio surveillance using a bag of aural words classifier. In 2013 10th IEEE
Case studies involving real-world car accident YouTube clips are International Conference on Advanced Video and Signal Based Surveillance (pp. 81–86).
used to verify the proposed car crash detection system. The performance https://doi.org/10.1109/AVSS.2013.6636620
Chan, F.-H., Chen, Y.-T., Xiang, Y., & Sun, M. (2017). Anticipating Accidents in Dashcam
of the proposed car crash detection system that uses three different types Videos. In Computer Vision – ACCV 2016 (Vol. 10114, pp. 136–153). Springer
of data (i.e., video data from dashboard cameras, audio features of audio International Publishing. doi: 10.1007/978-3-319-54190-7_9.
data from dashboard cameras, spectrogram images of audio data from Chang, B. R., Tsai, H. F., & Young, C.-P. (2010). Intelligent data fusion system for
predicting vehicle collision warning using vision/GPS sensing. Expert Systems with
dashboard cameras) is better than the performances of the base classi Applications, 37(3), 2439–2450. https://doi.org/10.1016/j.eswa.2009.07.036
fiers that use one type of data only and the ensemble models that use two Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., &
different types of data to classify crashes and non-crashes, including Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder-Decoder for
Statistical Machine Translation. ArXiv:1406.1078 [Cs, Stat]. http://arxiv.org/abs/
near crashes. Also, the additional experiment’s results demonstrate that 1406.1078.
the proposed car crash detection system establishes state-of-the-art Choi, K., Fazekas, G., Sandler, M., & Cho, K. (2017, March). Convolutional recurrent
classification performances. neural networks for music classification. In 2017 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP) (pp. 2392-2396). IEEE.
The authors will develop an advanced version of the proposed car comma.ai. (2020, April 25). https://comma.ai.
crash detection system for improving the performance of the classifi Conway, A. M., Durbach, I. N., McInnes, A., & Harris, R. N. (2021). Frame-by-frame
cation of crashes and near crashes. In particular, it is also expected to annotation of video recordings using deep neural networks. Ecosphere, 12(3).
https://doi.org/10.1002/ecs2.v12.310.1002/ecs2.3384
enhance the base classifier that only uses video data from dashboard
Crocco, M., Cristani, M., Trucco, A., & Murino, V. (2016). Audio Surveillance: A
cameras. Currently, this base classifier uses a frame-level feature Systematic Review. ACM Computing Surveys, 48(4), 1–46. https://doi.org/10.1145/
extraction method combined with video-level understanding, with GRU 2871183
incorporating temporal information. To further improve the video-based Drive.ai. (2020, April 25). https://drive.ai.
Durduran, S. S. (2010). A decision making system to automatic recognize of traffic
stream, a more intuitive way of applying CNN, which is a three- accidents on the basis of a GIS platform. Expert Systems with Applications, 37(12),
dimensional CNN with three-dimensional filters (i.e., two-dimensional 7729–7736. https://doi.org/10.1016/j.eswa.2010.04.068
filters for images and one-dimensional filter for sequences), can be Escalante, H. J., Escalera, S., Guyon, I., Baró, X., Güçlütürk, Y., Güçlü, U., &
Gerven, M. V. (2018). Explainable and interpretable models in computer vision and
considered. Due to realistic limitations (e.g., data collection, computa machine learning. Springer International Publishing.
tion resources, hardware constraints on dashboard cameras), applying a Ferlet, P. (2020). Keras Sequence Video Generators source code (Version 1.0.13) [Source
three-dimensional CNN would require more time to be realized (Yao code]. https://github.com/metal3d/keras-video-generators.
Foggia, P., Petkov, N., Saggese, A., Strisciuglio, N., & Vento, M. (2016). Audio
et al., 2019). In addition, rather than a weighted average ensemble that Surveillance of roads: A system for detecting anomalous sounds. IEEE Transactions on
is used in this research, an alternative ensemble technique can be Intelligent Transportation Systems, 17(1), 279–288. https://doi.org/10.1109/
considered to improve the performance of car crash detection in the TITS.2015.2470216
Foggia, P., Saggese, A., Strisciuglio, N., Vento, M., & Petkov, N. (2015). Car crashes
future. The authors will also consider how the proposed deep learning detection by audio analysis in crowded roads. In 2015 12th IEEE International
ensemble system that uses video and audio data can be applied to other Conference on Advanced Video and Signal Based Surveillance (AVSS) (pp. 1–6). https://
domains containing multimodal data, such as safety management in doi.org/10.1109/AVSS.2015.7301731
Gang, R., & Zhuping, Z. (2011). Traffic safety forecasting method by particle swarm
construction and manufacturing.
optimization and support vector machine. Expert Systems with Applications, 38(8),
10420–10424. https://doi.org/10.1016/j.eswa.2011.02.066
CRediT authorship contribution statement Ghosh, S., Sunny, S. J., & Roney, R. (2019, March). Accident detection using
convolutional neural networks. In 2019 International Conference on Data Science
and Communication (IconDSC) (pp. 1-6). IEEE.
Jae Gyeong Choi: Conceptualization, Methodology, Formal anal Grill, T., & Schlüter, J. (2015). Structural Segmentation with Convolutional Neural
ysis, Investigation, Resources, Data curation, Validation, Visualization, Networks MIREX Submission. 3.
Writing - original draft, Writing - review & editing. Chan Woo Kong: Habib, K., & Ridella, S. (2017). Automatic vehicle control systems (p. 13).
Heron, M. (2018). Deaths: Leading Causes for 2016. 77.
Formal analysis, Investigation, Resources, Data curation, Validation, J Utah. (2020, April 25). https://www.youtube.com/channel/UCBcVQr-07MH-
Visualization. Gyeongho Kim: Investigation, Resources, Data curation, p9e2kRTdB3A.
8
J.G. Choi et al. Expert Systems With Applications 183 (2021) 115400
Khamparia, A., Gupta, D., Nguyen, N. G., Khanna, A., Pandey, B., & Tiwari, P. (2019). Sarraf, R., & McGuire, M. P. (2020). Integration and comparison of multi-criteria decision
Sound classification using convolutional neural network and tensor deep stacking making methods in safe route planner. Expert Systems with Applications, 154, 113399.
network. IEEE Access, 7, 7717–7727. https://doi.org/10.1109/ https://doi.org/10.1016/j.eswa.2020.113399
ACCESS.2018.2888882 Shi, B., Bai, X., & Yao, C. (2017). An end-to-end trainable neural network for image-based
Ki, Y.-K. (2007). Accident Detection System using Image Processing and MDR. 5. sequence recognition and its application to scene text recognition. IEEE transactions
Milanés, V., Llorca, D. F., Villagrá, J., Pérez, J., Parra, I., González, C., & Sotelo, M. A. on pattern analysis and machine intelligence, 39(11), 2298–2304.
(2012a). Vision-based active safety system for automatic stopping. Expert Systems Smirnov, A., Kashevnik, A., Shilov, N., Makklya, A., & Gusikhin, O. (2013). Context-
with Applications, 39(12), 11234–11242. https://doi.org/10.1016/j. aware service composition in cyber physical human system for transportation safety.
eswa.2012.03.047 In 2013 13th International Conference on ITS Telecommunications (ITST) (pp.
Milanés, V., Pérez, J., Godoy, J., & Onieva, E. (2012b). A fuzzy aid rear-end collision 139–144). https://doi.org/10.1109/ITST.2013.6685535
warning/avoidance system. Expert Systems with Applications, 39(10), 9097–9107. Vishnu, V. M., & Rajalakshmi, M. (2016). Road side video surveillance in traffic scenes
https://doi.org/10.1016/j.eswa.2012.02.054 using map-reduce framework for accident analysis. Biomedical Research, 257–266.
Ministry of Land, Infrastructure and Transport, Republic of Korea. (2018, January 7). Wagner-Kaiser, R. (2020, April 25). CrashCatcher. https://github.com/rwk506/
https://www.molit.go.kr/USR/NEWS/m_71/dtl.jsp?lcmspage=7&id=95080994. CrashCatcher.
Naidenov, A., & Sysoev, A. (2019). Developing Car Accident Detecting System Based on White, J., Thompson, C., Turner, H., Dougherty, B., & Schmidt, D. C. (2011).
Machine Learning Algorithms Applied to Video Recordings Data. 1–12. WreckWatch: Automatic traffic accident detection and notification with
Ravindran, V., Viswanathan, L., & Rangaswamy, S. (2016). A novel approach to smartphones. Mobile Networks and Applications, 16(3), 285–303. https://doi.org/
automatic road-accident detection using machine vision techniques. International 10.1007/s11036-011-0304-8
Journal of Advanced Computer Science and Applications, 7(11). https://doi.org/ Wu, Z., Wang, X., Jiang, Y. G., Ye, H., & Xue, X. (2015). Modeling spatial-temporal clues
10.14569/IJACSA.2016.071130 in a hybrid deep learning framework for video classification. In In Proceedings of the
Saggese, A., Strisciuglio, N., Vento, M., & Petkov, N. (2016). Time-frequency analysis for 23rd ACM international conference on Multimedia (pp. 461–470).
audio event detection in real scenarios. In 2016 13th IEEE International Conference on Wyse, L. (2017). Audio Spectrogram Representations for Processing with Convolutional
Advanced Video and Signal Based Surveillance (AVSS) (pp. 438–443). https://doi.org/ Neural Networks. ArXiv:1706.09559 [Cs]. http://arxiv.org/abs/1706.09559.
10.1109/AVSS.2016.7738082 Yadav, R., Dahiya, P. K., & Mishra, R. (2020). Comparative analysis of automotive radar
Salamon, J., & Bello, J. P. (2017). Deep convolutional neural networks and data sensor for collision detection and warning system. International Journal of Information
augmentation for environmental sound classification. IEEE Signal Processing Letters, Technology, 12(1), 289–294. https://doi.org/10.1007/s41870-018-0167-3
24(3), 279–283. https://doi.org/10.1109/LSP.2017.2657381 Yao, G., Lei, T., & Zhong, J. (2019). A review of convolutional-neural-network-based
Sammarco, M., & Detyniecki, M. (2018). Crashzam: Sound-based Car Crash Detection: action recognition. Pattern Recognition Letters, 118, 14–22.
Proceedings of the 4th International Conference on Vehicle Technology and Yao, Y., Xu, M., Wang, Y., Crandall, D. J., & Atkins, E. M. (2019). Unsupervised Traffic
Intelligent Transport Systems, 27–35. doi: 10.5220/0006629200270035. Accident Detection in First-Person Videos. ArXiv:1903.00618 [Cs]. http://arxiv.org/
abs/1903.00618.