Professional Documents
Culture Documents
Sound Event Detection Transactions
Sound Event Detection Transactions
Sound Event Detection Transactions
net/publication/348774367
Article in IEEE/ACM Transactions on Audio Speech and Language Processing · January 2021
DOI: 10.1109/TASLP.2021.3054313
CITATIONS READS
48 63
3 authors:
Kai Yu
Shanghai Jiao Tong University
306 PUBLICATIONS 6,519 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Heinrich Dinkel on 28 January 2021.
Abstract—Sound event detection (SED) is the task of tagging model: Fully supervised SED and weakly supervised SED
the absence or presence of audio events and their corresponding (WSSED). Fully supervised approaches, which potentially
interval within a given audio clip. While SED can be done perform better than weakly supervised ones, require manual
using supervised machine learning, where training data is fully
labeled with access to per event timestamps and duration, time-stamp labeling. However, manual labeling is a significant
our work focuses on weakly-supervised sound event detection hindrance for scaling to large datasets due to the expensive
(WSSED), where prior knowledge about an event’s duration is labor cost. This paper primarily focuses on WSSED, which
unavailable. Recent research within the field focuses on improving only has access to clip event labels during training yet requires
segment- and event-level localization performance for specific to predict onsets and offsets at the inference stage.
datasets regarding specific evaluation metrics. Specifically, well-
performing event-level localization work requires fully labeled Challenges such as the Detection and Classification of
development subsets to obtain event duration estimates, which Acoustic Scenes and Events (DCASE) exemplify the difficul-
significantly benefits localization performance. Moreover, well- ties in training robust SED systems. DCASE challenge datasets
performing segment-level localization models output predictions are real-world recordings (e.g., audio with no quality control
at a coarse-scale (e.g., 1 second), hindering their deployment on and lossy compression), thus containing unknown noises and
datasets containing very short events (< 1 second). This work
proposes a duration robust CRNN (CDur) framework, which scenarios. Specifically, in each challenge since 2017, at least
aims to achieve competitive performance in terms of segment- one task was primarily concerned with WSSED. Most pre-
and event-level localization. In the meantime, this paper proposes vious work focuses on providing single target task-specific
a new post-processing strategy named “Triple Threshold” and solutions for WSSED on either tagging-, segment- or event-
investigates two data augmentation methods along with a label level. Tagging-level solutions are often capable of localizing
smoothing method within the scope of WSSED. Event-, Segment-,
and Tagging-F1 scores are utilized as primary evaluation metrics. event boundaries, yet their temporal consistency is subpar to
Evaluation of our model is done on three publicly available segment- and event-level methods. This has been seen during
datasets: Task 4 of the DCASE2017 and 2018 datasets, as well the DCASE2017 challenge, where no single model could win
as URBAN-SED. Our model outperforms other approaches on both tagging and localization subtasks. Solutions optimized
the DCASE2018 and URBAN-SED datasets without requiring for segment level often utilize a fixed target time resolution
prior duration knowledge. In particular, our model is capable
of similar performance to strongly-labeled supervised models on (e.g., 1 Hz), inhibiting fine-scale localization performance
the URBAN-SED dataset. Lastly, we run a series of ablation (e.g., 50 Hz). Lastly, successful event-level solutions require
experiments to reveal that without post-processing, our model’s prior knowledge about each events’ duration to obtain tempo-
localization performance drop is significantly lower compared rally consistent predictions. Previous work in [5] showed that
with other approaches. successful models such as the DCASE2018 task 4 winner are
Index Terms—weakly supervised sound event detection, con- biased towards predicting tags from long-duration clips, which
volutional neural networks, recurrent neural networks, semi- might limit themselves from generalizing towards different
supervised duration estimation
datasets (e.g., deploy the same model on a new dataset)
since new datasets possibly contain short or unknown duration
I. I NTRODUCTION events. In contrast, we aim to enhance WSSED performance,
OUND event detection (SED) research classifies and
S localizes particular audio events (e.g., dog barking, alarm
ringing) within an audio clip, assigning each event a label
specifically in duration estimation regarding short, abrupt
events, without a pre-estimation of each respective event’s
individual weight.
along with a start point (onset) and an endpoint (offset).
Label assignment is usually referred to as tagging, while the
II. R ELATED W ORK
onset/offset detection is referred to as localization. SED can
be used for query-based sound retrieval [1], smart cities, and Most current approaches within SED and WSSED uti-
homes [2], [3], as well as voice activity detection [4]. Unlike lize neural networks, in particular convolutional neural net-
common classification tasks such as image or speaker recogni- works [6], [7] (CNN) and convolutional recurrent neural
tion, a single audio clip might contain multiple different sound networks [5], [4] (CRNN). CNN models generally excel at
events (multi-output), sometimes occurring simultaneously audio tagging [8], [9] and scale with data, yet falling behind
(multi-label). In particular, the localization task escalates the CRNN approaches in onset and offset estimations [10].
difficulty within the scope of SED, since different sound events Apart from different modeling methods, many recent works
have various time lengths, and each occurrence is unique. propose other approaches for the localization conundrum. A
Two main approaches exist to train an effective localization plethora of temporal pooling strategies are proposed, aiming to
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2
It should be evident that within WSSED, localizing an event similar to a chunking mechanism [20], which summarizes
is much more complicated than tagging, since for tagging, parts of a time-sequence. 3) By subsampling in the time-
training and evaluation criteria match, while for localization, frequency domain via L-norm pooling, abstract time-frequency
important information such as duration is missing, meaning representations can be learned by the model.
that a model needs to “learn” this information without guid- Please note that, in essence, subsampling and pooling are
ance. Due to this mismatch between training and evaluation, identical operations. In our work, the term subsampling refers
post-processing is vital to improve performance [5]. The to intermediate operations within a local kernel (i.e., 2 × 2) on
main idea in our work extends to the notion of duration a spectrogram’s time-frequency domain. In contrast, pooling
robustness [5]. We refer to duration robustness as the capability refers to reducing a time variable signal to a single value (i.e.,
to precisely tag and, more importantly, estimate on- and off- 500 probabilities 7→ 1 probability).
sets of the aforementioned tagged event. A duration robust Subsampling is commonly done using average or max
model should also be capable of predicting on- and off-sets operators. However, previous work in [5] showed that L-norm
for both short and long duration events. subsampling largely benefits duration robustness, specifically
Our approach’s framework can be seen in Figure 1, where enabling the detection of short, sporadic events. L-norm sub-
the backbone CRNN with temporal pooling and subsampling sampling within a local kernel with size K is defined as:
layers learns clip-level event probabilities during training. sX
For inference, we propose triple thresholding to enhance the Lp (x) = p xp ,
duration robustness. x∈K
Our proposed model, which we further refer to as CDur where, Lp reduces to mean subsampling for a norm factor of
(CRNN duration, see Table I), consists of a five-layer CNN p = 0 and to max subsampling for p = inf. Lp -norm subsam-
followed by a Gated Recurrent Unit (GRU). The model is pling is an interpolation between mean and max operations,
based on the results in [5], yet it is modified to enhance perfor- preserving temporal consistency similar to the mean operation
mance further. Specifically, CDur utilizes double convolution while also extracting the most meaningful feature similar to
blocks and three subsampling stages instead of five. Moreover, the max operation. In line with [5], we exclusively set p = 4
our model experiences substantial gains by utilizing a batch- (referred to as L4-Sub) for CDur. An ablation study on the
norm, convolution, activation block structure, especially when subsampling factor is also provided in Section VI-B.
paired with Leaky rectified linear unit (LeakyReLU). The Subsampling is done in three stages (s1 , s2 , s3 ), where
abstract representations extracted by the CNN front-end are each stage subsamples the temporal dimension by a factor of
then processed by a bidirectional GRU with 128 hidden units. sj , j ∈ [1, 2, 3]. The time resolution is overall subsampled
by a factor of v = 4, with the three subsampling stages
TABLE I being (2, 2, 1). At each of the three subsampling stages, the
T HE CD UR ARCHITECTURE USED IN THIS WORK . O NE BLOCK REFERS TO
AN INITIAL BATCH NORMALIZATION , THEN A CONVOLUTION , AND
frequency dimension is reduced by a factor of 4, thus after the
LASTLY, A L EAKY R E LU ( SLOPE -0.1) ACTIVATION . A LL CONVOLUTIONS last stage the input frequency dimension is reduced by a factor
USE PADDING IN ORDER TO PRESERVE THE INPUT SIZE . T HE NOTATION of 64. Additionally, a linear upsampling operation is added
t ↑ / ↓ d REPRESENTS UP / DOWN - SAMPLING TIME DIMENSION BY t AND
THE FREQUENCY DIMENSION BY d. T HE MODEL HAS TWO OUTPUTS : O NE
after the final parameter layer, which restores the original
CLIP - LEVEL , WHICH CAN BE UPDATED DURING TRAINING , AND ONE temporal input resolution Tv → T . We investigated utilizing
FRAME - LEVEL USED FOR EVALUATION . upsampling during training and parameterized upsampling
Layer Parameter operations (e.g., transposed convolutions) but did not observe
Block1 32 Channel, 3 × 3 Kernel any performance gains. Due to the above remarks, linear
L4-Sub 2↓4 upsampling is only utilized during inference. Therefore, CDur
Block2 128 Channel, 3 × 3 Kernel outputs predictions at a resolution of 50 Hz (or 20 ms/frame),
Block3 128 Channel, 3 × 3 Kernel
L4-Sub 2↓4 enabling short sound detection.
Block4 128 Channel, 3 × 3 Kernel
Block5 128 Channel, 3 × 3 Kernel
L4-Sub 1↓4
B. Temporal Pooling
Dropout 30% CDur outputs a frame-level probability yt (e) ∈ [0, 1] for
BiGRU 128 Units each input feature xt at time t. With temporal pooling, for
Linear E Units
LinSoft Upsample 4 ↑ 1 event e, its event probability at clip-level y(e) ∈ [0, 1]
Output Clip-level Frame-level is an aggregated probability of frame-level outputs y(e) =
aggt∈T (yt (e)). As introduced, temporal pooling is one of
the significant contributors to tagging and localization per-
A. Temporal subsampling formance. In our work, we exclusively use linear softmax
We propose the use of subsampling towards practical, gen- (LinSoft), introduced in [11] (Equation (1)).
eralized sound event detection, based on the following three PT
yt (e)2
reasons: 1) Subsampling discourages disjoint predictions, e.g., y(e) = PtT (1)
short (frame), consecutive zero-one outputs [0, 1, 0, 1, . . .] → t yt (e)
[0, 0, 0, 0, . . .]. 2) Reduces the number of time-steps a recurrent In contrast to the attention pooling approach [8], [9], [6],
neural network (RNN) needs to remember; therefore, it is [17], as well as AutoPool [7], LinSoft is a weighted average
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
algorithm that is not learned. Instead, it can be interpreted as Triple Threshold: One possible disadvantage of double
a self-weighted average algorithm. thresholding is that it does not consider the clip-level output
CDur has two outputs: 1) The clip-level aggregate y(e) and probability. We propose triple thresholding, which extends
2) The frame-level sequence prediction yt (e). Since only clip- towards double thresholding by incorporating an additional
level labels are provided during training, only the clip-level threshold on clip-level φclip . This threshold firstly removes all
aggregate y(e) can be back-propagated and model parameters frame-level (yt (e)) probabilities which were not predicted on
updated. The frame-level output yt (e) is solely utilized for clip-level (y(e)). Triple thresholding is defined as Equation (3).
evaluation. (
yt (e), if y(e) > φclip
yt (e) =
0, otherwise (3)
C. Post-processing
ȳt (e) = dt(yt (e))
Since WSSED is a multi-label multi-class classification task,
with overlapping events, post-processing is required during Therefore, we first remove events not predicted by the model
inference to transform a per event probability estimate yt (e) on clip level before proceeding with double threshold post-
to a binary representation ȳt (e), which determines whether a processing. The default φclip threshold is set to be 0.5, since
label is present (ȳt (e) = 1) or absent (ȳt (e) = 0). It can help this is also the most reasonable choice when assessing tagging
smoothen or enhance model performance by removing noisy performance. Experiments denoted as +Triple, utilize φclip =
outputs (e.g., single frame outputs). 0.5, φlow = 0.2, φhi = 0.75.
We categorize post-processing algorithms into two Median Filtering: Median filtering is a conventional hard
branches: 1) Probability-level post-processing such as double label post-processing method, meaning it is applied after a
and triple thresholding; 2) Hard label post-processing such thresholding operation. Let Y = (y1 (e), . . . , yt (e), . . . , yT (e))
as median filtering. Most post-processing methods, such as be a probability sequence for event e, φbin be a thresh-
median filtering or double thresholding, only consider frame- old and let its corresponding binary sequence be Ȳ =
level outputs for post-processing. In our work, we propose a (ȳ1 (e), . . . , ȳt (e), . . . , ȳT (e)) with the relation:
novel triple thresholding technique that considers clip-level (
and frame-level outputs. Though this work exclusively 1, if yt (e) > φbin
ȳt (e) =
utilizes double and triple thresholding, we compare it with 0, otherwise.
median filtering since it is the most common post-processing
algorithm for WSSED. A median filter then acts on the sequence Ȳ by computing
Double Threshold: Double threshold [5], [14] is a proba- the median value within a window of size ω. In practice,
bility smoothing technique defined via two thresholds φhi , φlow . a median filter
will remove any event segments with frame
duration ≤ ω2 and merge two segments with distance ≤ ω2 .
The algorithm first sweeps over an output probability sequence
and marks all values larger than φhi as being valid predictions. In our view, median filtering skews model performance since
Then it enlarges the marked frames by searching for all it can delete or insert non-existing model predictions, thus
adjacent, continuous predictions (clusters) being larger than making fair model comparisons unfeasible. Many successful
φlow . An arbitrary point in time t belongs to a cluster between models utilize an event-specific median filter, where ω(e) is
two time steps t1 , t2 if the following holds: estimated relative to each event’s duration within the labeled
development set.
1, ∃(t1 , t2 ) s.t., t1 ≤ t ≤ t2
IV. E XPERIMENTAL S ETUP
cluster(t) = and ∀t0 ∈ [t1 , t2 ], φlow ≤ yt0 ≤ φhi
All deep neural networks were implemented in Py-
0, otherwise.
Torch [21], front-end feature extraction utilized librosa [22]
The double threshold algorithm (dt(·)) can be seen in Equa- and data pre-processing used gnu-parallel [23]. Even though
tion (2). a plethora of front-features exist, log-Mel spectrograms are
most commonly used by the SED community due to their low
memory footprint (compared to spectrograms) and excellent
1,
if yt (e) > φhi performance [24]. If not further specified, all our experiments
1, if yt (e) > φlow use 64-dimensional log-Mel power spectrograms (LMS) front-
ȳt (e) = dt(yt (e)) = (2) end features. Each LMS sample was extracted by a 2048
and cluster(t(e)) = 1
point Fourier transform every 20 ms with a Hann window
0,
otherwise.
size of 40 ms using the librosa library [22]. During training,
One potential benefit of double thresholding is that its pa- zero padding to the longest sample-length within a batch is
rameters are less susceptible to duration variations and can applied, whereas a batch-size of 1 is utilized during infer-
effectively remove additional post-training optimization (e.g., ence, meaning no padding. A batch-size of 64 is utilized
window size for median filter) [5]. However, double thresh- during training for all experiments. Moreover, training uses
olding relies on robust predictions from the underlying model, AdamW [25], [26] optimization with a starting learning rate
since it cannot, contrary to median filtering, remove erroneous of 1e-4, and successive learning rate reduction if the cross-
predictions (e.g., yt (e) > φhi , see Section VI). validation loss did not improve for three epochs. Training
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5
2.78
er 2.16 orn g 2.10
ion k_h 6.28 gin
it r uc ren) _ri
n
nd t
rn_ _(si
ll
_co rn 1.30 3.10 _be 4.97
air _ho ce cle rm nd
er
_ho air ulan bicy 6.77 Ala Ble
car am
b
bu
s
2.25 5.86 t 1.61
ing ca
r Ca
ay
_pl 3.95
en 1.95 m 0.62
ildr ark la r
he
s
ch r_a 3.39 Dis
g_b ca g_by
do n 8.57
2.08 sa si ren g 1.50
ll ing r_p si Do
dri ca nse_ n) 6.11
2.26 efe sire 7.23 7.97
ling il_ k_( le
d
rus
h
_id civ _truc rcyc thb
ine e to ) 7.79 oo
ng 1.48 _fir mo siren r_t 7.79
e ho
t ine r_( s 3.11 a ve ing
n_s ng _sh Fry
gu _ e _ca eep ric
2.15 fire lice _b 2.54 ect 5.22
r po rsing ming El ate
r
me e a 3.38 _w
kh
am rev scre ard ing
o n 1.48
jac en
2.23 teb n 5.36 Ru
n
ch
sir ska trai Sp
ee
rn
2.05
2.22 ho r 8.37
usi
c in_ 5.59 ne
tra truck ea
tre
et_
m
0 1 2.00
2 3 4 5 6 7 8 9 10 0 1 2 3 4 4.92
5 6 7 8 9 10 u um
_cl 0 1 2.03
2 3 4 5 6 7 8 9 10
s duration duration V ac duration
400
Figure 3. 200
URBAN-SED: URBAN-SED [28] is a sound event de-
100
tection dataset within an urban setting, having E = 10 50
event labels (see Figure 2a). This dataset’s source material 0 2
1 2 3 4
is the UrbanSound8k dataset [29] containing 27.8 hours of # Events per clip
data split into 10-second clip segments. The URBAN-SED
dataset encompasses 10,000 soundscapes generated using the Fig. 3. Number of events in a clip for each evaluation dataset.
Scaper soundscape synthesis library [28], being split into
6000 training, 2000 validation and 2000 evaluation clips.
The training set contains mostly 10-second excerpts, which and neglect the validation one. An essential characteristic of
are weakly-labeled, whereas each clip contains between one URBAN-SED is that since both the audio and annotations were
and nine events. The evaluation dataset is strongly labeled, generated computationally, the annotations are guaranteed to
containing up to nine events per clip, as indicated in Figure 3. be correct and unbiased from human perception. As it can
In our work, we only utilize the training and evaluation dataset be seen in Figure 2a, the evaluation set of the URBAN-SED
dataset seems to be artificially truncated due to the soundscape
1 Source code is available https://github.com/RicherMans/CDur composition.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6
DCASE2017: The DCASE2017 task 4 – Large-scale SpecAug: Recently, SpecAug, a cheap yet effective
weakly supervised sound event detection for smart cars dataset data augmentation method for spectrograms, has been intro-
is also utilized. The dataset consists of 10-second clips split duced [31]. SpecAug randomly sets time-frequency regions to
into training (51,172 clips), development (488 clips), and zero within an input log-Mel spectrogram with D (here 64)
evaluation subsets (1103 clips). Development and evaluation number of frequency bins and T frames. Time modification is
sets are strongly labeled, whereas the training set is weakly- applied by masking γt times ηt consecutive time frames, where
labeled. For our work, we only utilize the training and eval- ηt is chosen to be uniformly distributed between [t0 , t0 + ηt0 ]
uation subsets and neglect the development one. Different and t0 is uniformly distributed within [0, T − ηt ). Frequency
from URBAN-SED, DCASE2017 is subsampled from Au- modification is applied by masking γf times ηf consecutive
dioSet [30], whereas E = 17 different car-related events frequency bins [f0 , f0 + ηf ), where ηf is randomly chosen
need to be estimated. However, as seen from the evaluation from a uniform distribution in the range of [0, ηf 0 ] and
distribution in Figure 2b, each event’s duration is distributed f0 is uniformly chosen from the range [0, D − ηf ). For all
much more naturally (between 2.5 s and 8.5 s). In particular, experiments labeled +SpecAug we set γt = 2, ηt0 = 60, γf =
the event “civil defense siren” seems to be the event with 2, ηf 0 = 12.
the most substantial duration variance, having, on average, a Time Shifting: Another beneficial augmentation method
duration of 8.57 s, yet some samples have a duration of < 1 s. for WSSED is time-shifting. The goal is to encourage the
In our view, the difficulty within this dataset is the ambiguous model to learn coherent predictions. Effectively, given a
labels, e.g., “car” and “car passing by” or the four types of clip of multiple audio frames X = [x1 , . . . , xT ], X ∈
sirens. Please note that due to this dataset’s naturalness, it RT ×D , time rolling of length ηsh will shift (and wrap
might contain non-target events, which URBAN-SED might around) the entire sequence by ηsh frames to X0 =
not. Most clips in this dataset contain one distinct event, as [xηsh , . . . , xT , . . . , x1 , . . . , xηsh −1 ]. For each audio-clip, we
indicated in Figure 3. draw ηsh from a normal distribution N (0, 10), meaning that
DCASE2018: This dataset was utilized during the we randomly either shift the audio clip forward or backward
DCASE2018 Task4 challenge – Large-scale weakly labeled by ηsh frames. Time shifting is labeled as +Time in the
semi-supervised sound event detection in domestic environ- experiments.
ments and contains E = 10 unique events. Different from the Label Smoothing: As our default training criterion binary
other datasets used in our work, the training data is partially cross entropy (BCE), defined as:
unlabelled/incomplete. While the entire training dataset con-
L(y, ŷ) = −ŷ log(y) + (1 − ŷ) log(1 − y), (4)
sists of 55,990 clips, being similarly sized as the DCASE2017
dataset, only 1578 clips contain labels. The rest 54,412 clips between the clip-level prediction y(e) and the ground truth
are split into in-domain data (14,413 clips), where it is ŷ(e) is utilized. BCE encourages the model to learn the
guaranteed that training events occur within the subset, and provided training ground truth labels. However, in many real-
out-domain data (39,999 clips), which might contain unknown world WSSED datasets, the ground truth labels are not neces-
labels. Even though the data is drawn from AudioSet, the sarily correct and open to interpretation (e.g., is child babbling
ten events have been manually annotated. The evaluation data considered Speech?). Since labels in WSSED are often noisy
duration distribution can be seen in Figure 2c. Most clips in and thus unreliable, the BCE criterion (Equation (4)) wrongly
this dataset contain one or two distinct events, as indicated in encourages overfitting towards the training ground truth labels,
Figure 3. Compared to other datasets, DCASE2018 has the which might, in turn, decrease inference performance. Label
largest average difference duration-wise between the shortest smoothing [32] (LS) is a commonly used technique for regular-
(dishes, 0.62 s) and longest (vacuum cleaner, 8.37 s) events, ization, which relaxes the BCE criterion to assume the ground
respectively. Moreover, it also hosts the largest variance within truth itself is noisy. Label smoothing modifies the ground truth
an event, e.g., Speech can be shorter than 1 s, but also 10 label (ŷ(e)) for each event e ∈ [1, . . . , E] by introducing a
s long, indicated by a large number of outliers (Figure 2c). smoothing constant , as seen in Equation (5).
Our work utilizes the training (weak) subset of the dataset
1
for model estimation if not otherwise specified. It should be ŷLS (e) = LS(ŷ(e)) = (1 − )ŷ(e) + (5)
noted that the evaluation data is identical to the DCASE2019 E
challenge, which, in addition to DCASE2018, adds synthetic In our work, we exclusively set = 0.1 for all experiments
hard labeled data for training. labeled +LS.
clip into multiple fixed sized segments [33]. Seg-F1 can TABLE II
be seen as a coarse localization metric since precise time- URBAN-SED RESULTS USING OUR PROPOSED CD UR MODEL . +All
REFERS TO UTILIZING ALL AUGMENTATIONS . F1 SCORES ARE
stamps are not required. MACRO - AVERAGED . R REPRESENTS THE OUTPUT TIME RESOLUTION FOR
3) Event-F1 score. This metric measures on- and off-set A RESPECTIVE APPROACH IN H Z .
overlap between prediction and ground truth thus is not Approach R Tagging-F1 Seg-F1 Event-F1
bound to a time-resolution (like Seg-F1). The Event-F1 Base-CNN [28] 1 - 56.00 -
specifically describes a model’s capability to estimate a SoftPool [7] 2.69 63.00 49.20 -
duration (i.e., predict on- and off-set). MaxPool [7] 2.69 74.30 46.30 -
AutoPool [7] 2.69 75.70 50.40 -
Audio tagging is done by thresholding the clip-level output Multi-Branch [13] 50 - 61.60 -
(after LinSoft) y(e) of each event with a fixed threshold of Supervised SED [34] 43.1 - 64.70 -
φtag = 0.5 in order to obtain a many-hot vector, which is then Ours 76.09 62.83 19.92
+LS 75.00 61.69 18.74
evaluated. The threshold φtag is fixed since this work focuses +Time 76.13 62.61 19.69
on improving Seg- and Event-F1 performance. Segment and 50
+SpecAug 76.49 64.19 20.58
event F1-scores are tagging dependent, meaning that they re- +All 77.13 64.75 21.73
+All +Triple 77.13 64.75 22.54
quire at least the correct prediction of an event to be assessed.
Thus, we perceive audio tagging as the least difficult metric
to improve since its optimization directly correlates with the
observed clip-level training criterion. Since our work mainly by default double thresholding with φhi = 0.75, φlow = 0.2 and
focuses on duration robust estimation, Event-F1 [33] is used BCE (Equation (4)) as the criterion.
as our primary evaluation metric, which requires predictions
to be smooth (contiguous) and penalizes irregular or disjoint
predictions. To loosen the strictness of this measure, a flexible A. URBAN-SED
time onset (time collar, t-collar) of 200ms, as well as an offset
of at most 20% of the events‘ duration, is considered valid. In Table II, we compare previous approaches on the
Furthermore, we use Seg-F1 [33] with a segment size of 1 URBAN-SED corpus to our CDur approach.
second. Lastly, in theory, two Tagging-F1 scores exist. One Our baseline CDur result can be seen to outperform all
can be retrieved from the frame-level prediction output (i.e., compared approaches in terms of Tagging-F1. Moreover, its
max1:T yt (e)), while the other can be calculated after temporal Seg-F1 score is also significantly higher (62.83%) than all
pooling (i.e., y(e)). We exclusively report the Tagging-F1 other previous WSSED approaches. Even though CDur is ca-
score from the clip-level output (y(e)) of the model in this pable of estimating clip- and segment-level events, it falls short
work. Note that the clip-level output is unaffected by post- of providing a competitive Event-F1 performance. We believe
processing. that the comparatively low Event-F1 performance stems from
Since each proposed F1-score metric is a summarization the nature of urban auditory scenes, where most events occur
of individual scores, two main averaging approaches exist. in a much more random fashion compared to, e.g., domestic
Micro scores are averaged across the number of instances (e.g., ones. Moreover, the performance discrepancy between Seg-
number of samples), whereas macro scores are averaged across F1 and Event-F1 further exemplifies the difficulty in WSSED
each event (e.g., first compute F1 score on sample basis per to obtain fine-scale onset and offset estimates. When adding
event, then compute the average of all scores). additional augmentation methods, the performance further
The sed eval toolbox [33] is utilized for score calculation improves (77.13 Tagging-F1, 64.75 Seg-F1). Most notably,
(Seg-F1 and Event-F1 scores). Each respective dataset is our augmented CDur also outperforms the fully-supervised
evaluated using the following default evaluation metrics: SED system in [34] in terms of Seg-F1 (64.70%). Lastly, by
DCASE2017: The DCASE2017 challenge originally con- further replacing double thresholding with triple thresholding
sisted of two subtasks (tagging and localization). The default as the post-processing method, the Event-F1 score improves
evaluation metric on the DCASE2017 dataset is 1 second from 21.73 to 22.54%. We provide a per event breakdown of
segment-level micro F1-score. We, therefore, report all our our model performance in Figure 4 and compare CDur to the
metrics on the micro-level (Event, Segment, Tagging). next best (supervised SED) model from [34]. As seen, CDur
DCASE2018: All metrics for the DCASE2018 dataset are performs evenly across all ten events in terms of Tagging-
macro-averaged, and the primary metric during the challenge F1, averaging ≈ 70% across most events. A similar obser-
was Event-F1. vation can also be made regarding the Seg-F1 score, where
Urban-SED: The default evaluation metric on this dataset most events obtain a score of ≈ 65%. Even though CDur
is 1 second segment-level macro F1-scores, thus on average, only slightly outperforms the supervised approach from [34],
only two segments need to be estimated for each event (see analyzing the per event Seg-F1 scores reveals that CDur
Figure 2a). achieves an F1 score of 71.9 compared to 66.0 on the shortest
event within the dataset (car horn, here car). Event-F1 results
V. R ESULTS are also shown to be evenly distributed, reaching from the
In this section, we report and compare our results on the lowest 12.0% for “dog” to 41.5% for “jackhammer”. Future
three publicly available datasets (see Section IV-A). If not oth- work is still required to exclusively estimate very short events
erwise specficied, all our reported results in this section utilize effectively, such as in this dataset.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8
F1
40 31.0 37.6 33.3
24.6
F1
40 20
7.2 8.8
0 0.0 0.0
20
Segment-F1
0 80 74.9 79.2 84.7
Segment-F1 67.9 64.1 65.5 67.0
77.9 78.3 60 58.1 55.8 57.6 53.2 55.9 58.1 61.2
80
71.9
52.2 51.7 48.3 54.4 47.0
68.1 71.5 68.1 69.9 68.2 37.8 38.2 34.933.5 38.6
42.3 39.1
F1
66.0 63.2 60.3 66.0 66.3 64.5 59.7 61.1
40 32.431.2
60 59.1 56.9 26.4 25.1
20
47.8 48.3 9.4
0 0.0 0.5 1.3
40
F1
Event-F1
20 51.7
40 34.2
0 29.6
26.1
F1
Event-F1 18.4 21.4
20 14.8 16.1
41.5 9.3 9.6 12.4 13.3
40 7.7 7.0
1.5 3.7 4.4
0
30 29.5 29.3
fire_engine_fire_truck_(siren)
screaming
air_horn_truck_horn
reversing_beeps
skateboard
ambulance_(siren)
motorcycle
police_car_(siren)
civil_defense_siren
train_horn
car_alarm
bicycle
car_passing_by
train
truck
car
bus
24.8
19.4 19.2
F1
Fig. 4. CDur per event F1 score results on the URBAN-SED dataset. Segment- Fig. 5. CDur per event evaluation F1-scores on DCASE2017. The scores are
F1 scores are compared to the next best (supervised) approach [34].The events compared to the winning SED model [18] from the DCASE2017 challenge.
are sorted from (left) short to (right) long average duration. The events are sorted from (left) short to (right) long average duration.
TABLE III
DCASE2017 RESULTS COMPARED TO OUR APPROACH . F1- SCORES ARE of at least 1 s (1 Hz), matching the segment-level criterion.
MICRO AVERAGED . B EST RESULTS HIGHLIGHTED IN BOLD . R ESULTS Besides, CDur is the only approach preserving acceptable
UNDERLINED ARE FUSION SYSTEMS . R REPRESENTS THE OUTPUT TIME
RESOLUTION FOR A RESPECTIVE APPROACH IN H Z .
performance, yet with a high time-resolution of 50 Hz. A
closer look at the per event F1 scores in Figure 5 reveals
Approach R Tagging-F1 Seg-F1 Event-F1 that CDur has difficulties predicting ambiguous events such
MaxPool [7] 2.69 25.70 25.20 - as “car passing by” and “car”. Regarding joint tagging and
AutoPool [7] 2.69 45.40 42.50 -
Stacked CRNN [35] 50 43.30 48.90 - localization performance, CDur shows promising results for
Fusion GCRNN [36] 24 55.60 51.80 - the events “air horn” (2.78 s), “civil defense siren” (8.57 s)
GCRNN [36] 24 54.20 47.50 - and “motorcycle” (7.23 s), achieving ≈ 68% Tagging-F1, ≈
GCCaps [37] 24 58.60 46.30 -
Winner SED [18] 1 52.60 55.50 - 65% Seg-F1 and ≈ 38.5% Event-F1 average scores. Notably,
Ours 52.39 46.12 15.12 the best comparable SED model [18] can be seen to miss some
+LS 52.07 46.73 15.60 events in terms of Seg-F1. Those events are “car passing by”,
+Time 49.83 46.60 16.15
50 “ambulance” and “bus”. In stark contrast, CDur’s worst Seg-
+SpecAug 55.07 49.94 15.46
+All 55.29 50.79 15.26 F1 performance is also on “car passing by”, but CDur still
+All +Triple 55.29 49.93 15.73 manages a Seg-F1 score of 9.4% (against 0.0%). Compared
with other approaches, one distinct feature of CDur is that
it does not miss any event, regarding all utilized metrics.
B. DCASE2017
Moreover, performance for the shortest duration events “air
We here provide the results on the DCASE2017 dataset, horn” (2.78 s, Event-F1 34.2%), “train horn” (2.05 s, Event-
where all reported metrics are micro averaged. Regarding the F1 26.1%), “screaming” (2.54 s, Event-F1 18.4%) further
DCASE2017 dataset results in Table III, it can be observed exemplifies our model’s capability in detecting short duration
that even though the dataset contains the most available events.
training data, Tagging- (52.39%), Seg- (46.12%), and Event-F1
(15.12%) performance is sub-optimal. After adding augmen-
tation to CDur training, Tagging- (55.29%), Seg- (50.79%) C. DCASE2018
and Event-F1 (15.26%) performance significantly increases. For these experiments, we utilize two given DCASE2018
However, by replacing double threshold with triple threshold, subsets for model training, being the common training subset
no gains can be observed. We believe that our performance (weak) and the unlabeled in-domain subset. Weak labels were
on this dataset could improve by increasing the number of estimated on the in-domain dataset for further training using
trainable parameters since the most comparative approach to our best performing CDur model (+All, 70.56 Tagging-F1) by
our in [36] utilized at least six convolutional layers. Moreover, thresholding the clip-level output with a conservative value of
note that our approach is a single model for both Tagging- φtag = 0.75. Since not all clips obtained a label (no event has
and Seg-F1 evaluation, while the best comparable model [36] a probability higher than 0.75), we report that our predicted
trained two networks with different time resolutions for each in-domain subset consists of 9266/14412 clips. Note that other
respective task. Other approaches are specialized for one approaches estimating labels for the in-domain data are likely
individual task (tagging, localization), e.g., the winning local- to be different from ours (such as [6], [38], [17], [10]). We
ization system [18], a large CNN fusion model, outperforms verify that our predicted labels are usable by training CDur
our method only regarding Seg-F1. This is to be expected, exclusively on the generated in-domain labels (Table IV).
since [18] outputs their predictions on a coarse time resolution Further, we refer to “Weak+” as the merged dataset of “Weak”
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9
CDur cATP-SDS
and the predicted “In-domain” data, containing 10844 semi- Tagging-F1
87.0 86.4
noisy clips. 80
68.4 68.4
77.6 76.2
67.3 67.2
75.7 72.0
62.3 61.4 66.7 65.5
60 56.5 58.3 61.2
55.4 54.7
40.8
F1
40
TABLE IV 20
F1
40 36.1
20
Approach Data R Tagging- Seg-F1 Event-F1
0
F1 Event-F1
60 58.6
Hybrid-CRNN [10] Weak+ 50 - - 25.40 50.0 47.6 52.0 51.2 52.4
50 45.7
43.6 41.9 42.1
Second’18 [38] Weak+ 50 - - 29.90 40 36.4 36.5 36.8 37.4
30
F1
Winner’18 [17] Weak+ 16 - - 32.40 21.6 21.6 23.3 24.0
20 15.8 15.1
CRNN [5] Weak 25 - - 32.50 10
0
Multi-Branch [13] Weak 50 - - 34.60 Dishes Speech Dog Cat Alarm Blender Water Frying Electric Vacuum
cATP-SDS [6] Weak+ 50 65.20 - 38.60 Short Average Event Duration Long
14
32
Event-F1
Event-F1
Event-F1
20
30
12
28
10 18
low=0.3
low=0.1 26 hi=0.75
hi=0.9 F1=25.23
F1=8.43 low=0.1
8 24
hi=0.9
F1=16.15
16
0.0 0.1 0.25 0.5 0.75 0.9 0.0 0.1 0.25 0.5 0.75 0.9 0.0 0.1 0.25 0.5 0.75 0.9
clip clip clip
Our proposed setting with φclip = 0.5 and φhi = whereas Attention pooling [11] uses a per timestep, per event
0.75, φlow = 0.2 can be seen to indeed perform best on weight wt (e) ∈ [0, 1].
the DCASE2018 dataset (see Figure 7b). However, for the
other two datasets, a lower threshold of φlow = 0.1 seem to TABLE V
be favorable on the DCASE2017 and URBAN-SED datasets, T EMPORAL POOLING ALTERNATIVES TO L IN S OFT INVESTIGATED IN THIS
WORK . S OFT- AND M AX POOLING FUNCTIONS ARE PARAMETER - FREE ,
culminating in Event-F1 scores of 18.97 (DCASE2017, micro) WHILE AUTO AND ATTENTION POOLING ARE PARAMETERIZED .
and 24.95 (URBAN-SED, macro) respectively. Again, please
note that considerable gains can be obtained by optimizing our Temporal pooling Formulation
exp yt (e)
y(e) = T
P
thresholds if we choose to optimize towards a specific dataset. Soft t yt (e) PT exp yj (e)
j
For example, our best reported DCASE2017 model (micro Max y(e) = max yt (e)
Event-F1 15.73) can be improved to 18.97, by modifying Pt exp (α(e)yt (e))
Auto y(e) = T t yt (e) PT exp (α(e)yj (e))
φlow = 0.1. The same observation can be made for URBAN- PT j
tPwt (e)yt (e)
SED (22.54 → 24.95). Attention y(e) = T w (e)
j j
(A) YbUDfTkJCg0o_20.000_30.000
in Event-F1 score when removing double/triple thresholding.
However, even though event-level performance is negatively
affected, segment-level performance is enhanced for both
our approaches. This is likely due to default post-processing
(B) Ground truth
method using a conservative threshold choice of φhi = 0.75 Speech
Frying
(Seg-Precision 74.51%, Seg-Recall 56.80%), while φbin = 0.5 Dishes
(C) Ours
enhances recall performance (Seg-Precision 73.70%, Seg- 1.00 Dishes
Recall 58.35%), thus improving Seg-F1. Conservative thresh- Frying
0.75 Speech
probability
old values have also been reported to impact Event-F1 perfor- 0.50
mance [39] positively. Therefore, we can observe an absolute
0.25
increase from 0.9 to 3.0% Seg-F1 for our models. In contrast to 0.10
our approach, cATP-SDS [6] seems to be strongly dependent (D) cATP-SDS
on the post-processing method. Its event and segment level 1.00
probability
weak+ (34.07 → 2.74) training sets. The reason for this 0.50
behavior is their dependence on clip-level post-processing, 0.25
which effectively filters out false-positives. Different from 0.10
cATP-SDS, CDur does not require post-processing in order 0 50 100 150 200 250 300 350 400 450 500
Frame (20ms)
to effectively localize sounds and estimate each sound’s re-
spective duration. Fig. 8. (A) A sample clip comparison on the (B) DCASE2018 evaluation
URBAN-SED: Since most previous work, such as [7] only dataset between (C) CDur and (D) cATP-SDS. Best viewed in color.
utilized segment-level micro F1 scores. We reimplemented
those models to compare the Event-F1 performance to ours.
We also utilized the same training schema (Pitch augmenta- randomly sampled. Regarding DCASE2018, our sampled
tion, binarization post-processing) as in [7]. These results show clip (bUDfTkJCg0o) contains three different events (speech
that our reimplementations perform worse in terms of Tagging- [green], frying [blue], dishes [red]). We compare the proba-
F1 performance, but the Seg-F1 score improves compared to bility outputs yt (e) of CDur (Event-F1 39.42) against cATP-
the original publication (see Table II). It can be expected that SDS (Event-F1 34.07). The comparison can be seen in Fig-
an improved Seg-F1 can correlate with an improved Event-F1. ure 8. At first glance, it can be seen that both methods are
incapable of producing perfect results. Therefore we focus on
TABLE X the errors made by each individual approach. In particular,
P OST- PROCESSING ABLATION (-Post) RESULTS COMPARING [7] ( NO for the speech event (green), one can notice that cATP-
POST- PROCESSING ) APPROACHES WITH CD UR ON URBAN-SED.
M ODELS WITH A “O UR ” PREFIX ARE REIMPLEMENTATIONS FOR METRIC SDS predictions exhibit a peaking behavior with no apparent
COMPARISON PURPOSES . notion of the speech duration. Therefore, cATP-SDS requires
Approach Tagging-F1 Seg-F1 Event-F1
median filtering to remove false-positives and connect (or
SoftPool CNN [7] 63.00 49.20 -
remove) its disjoint predictions. In contrast, CDur seems to
MaxPool CNN [7] 74.30 46.30 - predict onset and offsets accurately without the need for post-
AutoPool CNN [7] 75.70 50.40 - processing. This behavior is in accord with our previous
Our SoftPool CNN [7] 59.86 45.80 2.45
Our MaxPool CNN [7] 74.34 58.76 13.04
observation in Table IX. Additionally, the “dishes” event
Our AutoPool CNN [7] 72.74 56.68 6.89 (a sharp cling sound of a fork hitting porcelain) is hard
Ours (Best) 77.13 64.75 21.73 to estimate for both models. However, cATP-SDS predicts
-Post 77.13 63.86 14.54 dishes as an always present background noise, meaning that
it failed to learn the characteristics of the short and sharp
Please note that all models reported in [7] only utilize a
“dishes” sound. Lastly, we also provide five distinct samples
binary threshold strategy, identical to our (-Post) result, as
for the shortest duration events within DCASE2018 (“dishes”,
post-processing. By removing our post-processing, we can
“cat”,“dog”,“speech”,“alarm bell ringing”) in Figure 9. These
compare both approaches from a common point of view. As it
samples further demonstrate that CDur is capable of detecting
can be seen in Table X, CDur is capable of outperforming the
and accurately predicting short and sporadic events, such as
best localization model (MaxPool CNN), in terms of segment
“alarm bell ringing”, “cat”, and “dishes”. Our second sampled
and Event-F1, as well as capable of outperforming the best
clip (soundscape test unimodal585) is from the URBAN-SED
tagging model (AutoPool CNN).
dataset (see Figure 10). We compare CDur to a reimplementa-
tion of [7], using our best performing Event-F1 MaxPool CNN
D. Quantitative results model (Table II). Note that our reimplementation uses our
Since our metrics are threshold-dependent, meaning that LMS features with 50 Hz (20 ms/frame) frame rate, whereas
only hard labels were evaluated, we provide a quantita- the original work used 43 Hz (23 ms/frame). At first glance,
tive analysis to strengthen the point of our models’ dura- one can observe that this dataset is indeed artificial since the
tion robustness. Two distinct clips containing at least three spectrogram appears to contain only the target events, without
events from DCASE 2018 and URBAN-SED datasets are any other natural background noises. As this sample shows, the
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13
Ground truth Ground truth Ground truth Ground truth Ground truth
he ying ech
Dis Fr Spe
he ech
Dis Spe
ing
s
Ca
Do
(A) (B) (C) (D) (E)
ing
ll_r
Fig. 9. Predictions for five samples for each of the shortest duration events (“dishes”, “cat”,“dog”,“speech”,“alarm bell ringing”) in DCASE2018. Best viewed
in color.
(C) Ours
CDur is then compared to other approaches in terms of
1.00 Segment, Event, and Tagging performance. Experiments con-
dog_bark
il
engine_idling
probability
gun_shot
0.50 jackhammer imply promising performance. The DCASE2018 results show
0.25 that CDur can outperform previous SOTA models in terms of
0.10 Tagging- (69.11), Event- (39.42), and Seg-F1 (63.53). Besides,
(D) MaxPool CNN
1.00 on the URBAN-SED dataset CDur outperforms supervised
0.75 methods in terms of Seg- (64.75) and Tagging-F1 (77.13).
probability
0.50
A series of ablation experiments reveal the models’ inherent
0.25
capability to correctly localize sounds with effective onset and
0.10 offset estimation. In our future work, we would like to inves-
0 50 100 150 200 250 300 350 400 450 500 tigate the new polyphonic scene detection score (psds) [39] as
Time (20ms)
a metric, which seems to provide promising insights towards
a model’s duration robustness and overall performance given
Fig. 10. (A) A sample clip comparison on the (B) URBAN-SED evaluation a specific scenario.
dataset between (C) CDur and (D) MaxPool CNN. Best viewed in color.
ACKNOWLEDGMENT
MaxPool CNN model is indeed capable of sound localization, This work has been supported by the Major Program of
specifically for events with a duration of ≈ 1.5 s, such as National Social Science Foundation of China (No.18ZDA293).
“jackhammer”, “gun shot” and “children playing”. However, Experiments have been carried out on the PI supercomputer
it seems to struggle with longer events, such as “dog bark”, at Shanghai Jiao Tong University.
at which it exhibits a peaking behavior, chunking the event
into small pieces (see orange line). On the contrary, CDur can
R EFERENCES
predict and localize both short and long events for this sample.
Specifically, MaxPool CNN was unable to notice the short [1] F. Font, G. Roma, and X. Serra, Sound Sharing and Retrieval.
Springer International Publishing, 2018, pp. 279–301. [Online].
“engine idling” event (around 800 ms), yet CDur predicted its Available: https://doi.org/10.1007/978-3-319-63450-0{ }10
presence. We believe that this is due to the low time-resolution [2] J. P. Bello, C. Mydlarz, and J. Salamon, Sound Analysis in Smart
of the MaxPool CNN model (320 ms/frame, 3.125 Hz), which Cities. Cham: Springer International Publishing, 2018, pp. 373–397.
[Online]. Available: https://doi.org/10.1007/978-3-319-63450-0{ }13
could skip over the presence of short events. [3] S. Krstulović, Audio Event Recognition in the Smart Home. Cham:
Springer International Publishing, 2018, pp. 335–371. [Online].
VII. C ONCLUSION Available: https://doi.org/10.1007/978-3-319-63450-0{ }12
[4] H. Dinkel, Y. Chen, M. Wu, and K. Yu, “Voice activity detection in the
This work proposed CDur, a duration robust sound event de- wild via weakly supervised sound event detection,” mar 2020. [Online].
tection CRNN model. CDur aims to be as flexible as possible Available: http://arxiv.org/abs/2003.12222
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14
[5] H. Dinkel and K. Yu, “Duration Robust Weakly Supervised Sound Event D. Ellis, R. Yamamoto, S. Seyfarth, E. Battenberg, V. Morozov,
Detection,” in ICASSP 2020 - 2020 IEEE International Conference on R. Bittner, K. Choi, J. Moore, Z. Wei, S. Hidaka, Nullmightybofo,
Acoustics, Speech and Signal Processing (ICASSP). IEEE, may 2020, P. Friesch, F.-R. Stöter, D. Hereñú, T. Kim, M. Vollrath, and
pp. 311–315. [Online]. Available: https://ieeexplore.ieee.org/document/ A. Weiss, “librosa/librosa: 0.7.2,” jan 2020. [Online]. Available:
9053459/ https://zenodo.org/record/3606573
[6] L. Lin, X. Wang, H. Liu, and Y. Qian, “Specialized Decision Surface [23] O. Tange, “Gnu parallel - the command-line power tool,” ;login: The
and Disentangled Feature for Weakly-Supervised Polyphonic Sound USENIX Magazine, vol. 36, no. 1, pp. 42–47, Feb. 2011. [Online].
Event Detection,” IEEE/ACM Transactions on Audio Speech and Available: http://www.gnu.org/s/parallel
Language Processing, vol. 28, pp. 1466–1478, may 2020. [Online]. [24] E. Cakir and T. Virtanen, “End-to-End Polyphonic Sound Event
Available: http://arxiv.org/abs/1905.10091 Detection Using Convolutional Recurrent Neural Networks with
[7] B. McFee, J. Salamon, and J. P. Bello, “Adaptive Pooling Operators for Learned Time-Frequency Representation Input,” Proceedings of the
Weakly Labeled Sound Event Detection,” IEEE/ACM Transactions on International Joint Conference on Neural Networks, vol. 2018-July,
Audio Speech and Language Processing, vol. 26, no. 11, pp. 2180–2193, 2018. [Online]. Available: https://arxiv.org/pdf/1805.03647.pdf
nov 2018. [25] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”
[8] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. 7th International Conference on Learning Representations, ICLR 2019,
Plumbley, “PANNs: Large-Scale Pretrained Audio Neural Networks p. 13, Feb 2019. [Online]. Available: https://arxiv.org/pdf/1711.05101.
for Audio Pattern Recognition,” dec 2019. [Online]. Available: pdf
http://arxiv.org/abs/1912.10211 [26] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
[9] Q. Kong, C. Yu, Y. Xu, T. Iqbal, W. Wang, and M. D. Plumbley, in 3rd International Conference on Learning Representations, ICLR
“Weakly labelled audioset tagging with attention neural networks,” 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track
IEEE/ACM Transactions on Audio Speech and Language Processing, Proceedings, Y. Bengio and Y. LeCun, Eds., 2015. [Online]. Available:
vol. 27, no. 11, pp. 1791–1802, mar 2019. [Online]. Available: http://arxiv.org/abs/1412.6980
http://arxiv.org/abs/1903.00765 [27] K. Sechidis, G. Tsoumakas, and I. Vlahavas, “On the stratification
[10] S. Kothinti, K. Imoto, D. Chakrabarty, G. Sell, S. Watanabe, of multi-label data,” Machine Learning and Knowledge Discovery in
and M. Elhilali, “Joint Acoustic and Class Inference for Weakly Databases, pp. 145–158, 2011.
Supervised Sound Event Detection,” ICASSP, IEEE International [28] J. Salamon, D. MacConnell, M. Cartwright, P. Li, and J. P. Bello,
Conference on Acoustics, Speech and Signal Processing - Proceedings, “Scaper: A library for soundscape synthesis and augmentation,” in IEEE
vol. 2019-May, pp. 36–40, nov 2019. [Online]. Available: https: Workshop on Applications of Signal Processing to Audio and Acoustics,
//arxiv.org/abs/1811.04048 vol. 2017-October. Institute of Electrical and Electronics Engineers
[11] Y. Wang, J. Li, and F. Metze, “A Comparison of Five Multiple Inc., dec 2017, pp. 344–348.
Instance Learning Pooling Functions for Sound Event Detection with [29] J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for
Weak Labeling,” ICASSP, IEEE International Conference on Acoustics, urban sound research,” in Proceedings of the 22nd ACM International
Speech and Signal Processing - Proceedings, vol. 2019-May, pp. Conference on Multimedia, ser. MM ’14. New York, NY, USA:
31–35, oct 2019. [Online]. Available: http://arxiv.org/abs/1810.09050 Association for Computing Machinery, 2014, p. 1041–1044. [Online].
[12] C.-C. Kao, M. Sun, W. Wang, and C. Wang, “A Comparison of Pooling Available: https://doi.org/10.1145/2647868.2655045
Methods on LSTM Models for Rare Acoustic Event Classification,” [30] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence,
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, R. C. Moore, M. Plakal, and M. Ritter, “Audio Set: An ontology and
Speech and Signal Processing (ICASSP), pp. 316–320, feb 2020. human-labeled dataset for audio events,” in ICASSP, IEEE International
[Online]. Available: http://arxiv.org/abs/2002.06279 Conference on Acoustics, Speech and Signal Processing - Proceedings.
[13] Y. Huang, X. Wang, L. Lin, H. Liu, and Y. Qian, “Multi-Branch Institute of Electrical and Electronics Engineers Inc., jun 2017, pp. 776–
Learning for Weakly-Labeled Sound Event Detection,” ICASSP 2020 - 780.
2020 IEEE International Conference on Acoustics, Speech and Signal [31] D. S. Park, W. Chan, Y. Zhang, C. C. Chiu, B. Zoph, E. D. Cubuk,
Processing (ICASSP), pp. 641–645, feb 2020. [Online]. Available: and Q. V. Le, “Specaugment: A simple data augmentation method for
http://arxiv.org/abs/2002.09661 automatic speech recognition,” Proceedings of the Annual Conference of
[14] Q. Kong, Y. Xu, I. Sobieraj, W. Wang, and M. D. Plumbley, the International Speech Communication Association, INTERSPEECH,
“Sound Event Detection and Time-Frequency Segmentation from vol. 2019-September, pp. 2613–2617, apr 2019. [Online]. Available:
Weakly Labelled Data,” IEEE/ACM Transactions on Audio Speech and http://dx.doi.org/10.21437/Interspeech.2019-2680
Language Processing, vol. 27, no. 4, pp. 777–787, apr 2019. [Online]. [32] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
Available: http://arxiv.org/abs/1804.04715 the inception architecture for computer vision,” in Proceedings of the
[15] T. Pellegrini and L. Cances, “Cosine-similarity penalty to discriminate IEEE conference on computer vision and pattern recognition, 2016, pp.
sound classes in weakly-supervised sound event detection,” Proceedings 2818–2826.
of the International Joint Conference on Neural Networks, vol. 2019- [33] A. Mesaros, T. Heittola, and T. Virtanen, “Metrics for polyphonic
July, jan 2019. [Online]. Available: http://arxiv.org/abs/1901.03146 sound event detection,” Applied Sciences, vol. 6, no. 6, p. 162, 2016.
[16] W. Liu, Y. Wen, Z. Yu, and M. Yang, “Large-margin softmax loss for [Online]. Available: http://www.mdpi.com/2076-3417/6/6/162
convolutional neural networks.” in ICML, vol. 2, no. 3, 2016, p. 7. [34] I. Martin-Morato, A. Mesaros, T. Heittola, T. Virtanen, M. Cobos, and
[17] L. JiaKai, “Mean teacher convolution system for dcase 2018 task 4,” F. J. Ferri, “Sound Event Envelope Estimation in Polyphonic Mixtures,”
DCASE2018 Challenge, Tech. Rep., September 2018. in ICASSP, IEEE International Conference on Acoustics, Speech and
[18] D. Lee, S. Lee, Y. Han, and K. Lee, “Ensemble of convolutional neural Signal Processing - Proceedings, vol. 2019-May. Institute of Electrical
networks for weakly-supervised sound event detection using multiple and Electronics Engineers Inc., may 2019, pp. 935–939.
scale input,” in Proceedings of the Detection and Classification of [35] S. Adavanne and T. Virtanen, “Sound event detection using weakly
Acoustic Scenes and Events 2017 Workshop (DCASE2017), 2017, pp. labeled dataset with stacked convolutional and recurrent neural network,”
74–79. in Proceedings of the Detection and Classification of Acoustic Scenes
[19] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez, “Solving the and Events 2017 Workshop (DCASE2017). Tampere University of
multiple instance problem with axis-parallel rectangles,” Artificial In- Technology. Laboratory of Signal Processing, 2017, pp. 12–16.
telligence, vol. 89, no. 1-2, pp. 31–71, jan 1997. [36] Y. Xu, Q. Kong, W. Wang, and M. D. Plumbley, “Surrey-CVSSP system
[20] F. Zhai, S. Potdar, B. Xiang, and B. Zhou, “Neural models for sequence for DCASE2017 challenge task4,” DCASE2017 Challenge, Tech. Rep.,
chunking,” in Proceedings of the Thirty-First AAAI Conference on September 2017.
Artificial Intelligence, ser. AAAI’17. AAAI Press, 2017, p. 3365–3371. [37] T. Iqbal, Y. Xu, Q. Kong, and W. Wang, “Capsule routing for
[21] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, sound event detection,” European Signal Processing Conference,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, vol. 2018-Septe, pp. 2255–2259, 2018. [Online]. Available: https:
E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, //arxiv.org/pdf/1806.04699.pdfhttp://arxiv.org/abs/1806.04699
L. Fang, J. Bai, and S. Chintala, “PyTorch: An imperative style, high- [38] Y. L. Liu, J. Yan, Y. Song, and J. Du, “Ustc-nelslip system for dcase
performance deep learning library,” in Advances in Neural Information 2018 challenge task 4,” DCASE2018 Challenge, Tech. Rep., September
Processing Systems. Curran Associates, Inc., 2019, vol. 32, pp. 2018.
8026–8037. [Online]. Available: http://arxiv.org/abs/1912.01703 [39] C. Bilen, G. Ferroni, F. Tuveri, J. Azcarreta, and S. Krstulovic, “A
[22] B. McFee, V. Lostanlen, M. McVicar, A. Metsai, S. Balke, C. Thomé, Framework for the Robust Evaluation of Sound Event Detection,” in
C. Raffel, A. Malek, D. Lee, F. Zalkow, K. Lee, O. Nieto, J. Mason, ICASSP 2020 - 2020 IEEE International Conference on Acoustics,
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 15
Speech and Signal Processing (ICASSP). IEEE, may 2019, pp. 61–65.
[Online]. Available: http://arxiv.org/abs/1910.08440