Professional Documents
Culture Documents
Audio Paper
Audio Paper
Audio Paper
BJÖRN LINDQVIST
Abstract
Onset detection is a research area in music anaysis concerned with
determining when in digital audio streams audio events occur. The
area has grown considerably in recent years and several new methods
for onset detection has been developed which have gradually improved
upon the state of the art. Currently, the top spot is held by Schlüter
and Böck, who in 2014 presented a detector based on a convolutional
neural network (CNN) that attained an F-score of 90.3% (Precision
91.7%, 88.9% recall) on a commonly used dataset [1].
But in 2018, another pair of researchers, Gong and Serra, failed to
reproduce the above mentioned’s result. They only got the F-score
86.67% (precision and recall values weren’t reported) [2], a significantly
worse result than Schlüter and Böck’s. In comparison, a 2013 detector
based on a recurrent neural network (RNN), also designed by Schlüter
and Böck, got an F-score of 87.3% [3].
Gong and Serra’s result casts doubt on the 90.3% figure reported by
Schlüter and Böck. We therefore try to shed some light on the question
of what the state of the art performance in musical onset detection
really is by posing and answering the question; can Schlüter and Böck’s
results be reproduced? We do this by implementing the proposed
networks and seeing if they can achieve the same performance on the
same dataset Schlüter and Böck used.
Our answer is “Maybe – but we were unable to!” which is perhaps
the only result possible since you can’t prove a negative. We trained the
CNN architecture three times and obtained F-scores of 85.0%, 85.8%
and 85.6%, about 5% less than Schlüter and Böck’s 90.3%. We also
tried to replicate their result for the RNN and obtained the F-scores
86.3%, 86.3% and 86.3%. These too are worse than Schlüter and Böck’s,
albeit the difference is only 1%.
Some details were missing in the short articles authored by the
mentioned researchers that we read to understand the methods. There-
fore, we had to make “educated guesses” about parameters such as
learning rate. Thus, our implementations are not exact replicas of the
author’s. It could partially explain our worse results.
Nevertheless, we believe that our work is worthwhile because it
demonstrates how infuriatingly difficult it is in deep learning for
researchers to reproduce each others work.
iv
Sammanfattning
Ansatsdetektion är ett område inom musikanalysen som går ut på
att bestämma när händelser i ljuddata inträffar. Området har växt
betydligt på sistone och flera nya metoder för ansatsdetektion har
utvecklats. Dessa har gradvis förbättrat den bästa detektionsförmågan.
Rekordet innehas av Schlüter och Böck som år 2014 presenterade en
detektionsmetod based på ett faltningsnätverk (Convolutional Neural
Network, CNN) som uppnådde ett F-värde på 90,3 % (med precision
91,7 % och täckning 88,9 %) på en ofta använd mängd data [1].
Men år 2018 misslyckades ett annat forskarpar, Gong och Serra,
med att replikera de förstnämndas resultat. Deras F-värde landade på
86,67 % (de rapporterade varken precision eller täckning) [2] vilket är
ett mycket sämre resultat än Schlüter och Böcks. Som jämförelse kan
nämnas att Schlüter och Böck år 2013 presenterade en detektionsme-
tod baserad på ett återkommande neuralt nätverk (Recurrent Neural
Network, RNN) med F-värdet 87,3 % [3].
Gong och Serras resultat gör att Schlüter och Böcks 90,3 %-resultat
kan ifrågasättas. Vi försöker därför ta reda på om de sistnämndas
resultat står sig. Detta gör vi genom att implementera de nätverk
som föreslagits och se om vi kan få lika bra detektionsförmåga på
samma datamängd som Schlüter och Böck använde – kan deras resultat
replikeras?
Vårt svar är “Kanske – men vi kunde det inte!” Möjligtvis är det det
enda säkra som kan sägas eftersom vi misslyckades med att replikera
deras resultat. De tre faltningsnätverk vi tränade fick F-värdena 85,0 %,
85,8 %, och 85,6 %, ungefär fem procentenheter lägre än Schlüter
och Böcks 90,3 %. Vi försökte också replikera deras resultat för det
återkommande neurala nätverket och fick där 86,3 %, 86,3 % och 86,3 %
i F-värde. Även dessa är sämre än Schlüter och Böcks, men skillnaden
är bara en procentenhet.
Detaljinformation saknades i de korta artiklar författade av ovan
nämnda som vi läste för att förstå metoderna. Därför var vi tvungna
att gissa oss till vissa detaljer såsom parametrar för inlärningshastighet
och därför kan vi inte garantera att de implementationer vi utvärderat
är exakta kopior av författarnas. Det kan vara en delförklaring till våra
sämre resultat.
Vi menar ändå att vårt arbete är värdefullt eftersom det visar hur
otroligt svårt det är att replikera resultat inom området djupinlärning.
Contents
1 Introduction 1
1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 3
2.1 Precision, recall and F-score . . . . . . . . . . . . . . . . . 3
2.2 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Onset detection methods . . . . . . . . . . . . . . . . . . . 5
2.4 Detection using supervised learning . . . . . . . . . . . . 6
2.5 RNN-based onset detector . . . . . . . . . . . . . . . . . . 7
2.5.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . 7
2.5.2 Peak-picking . . . . . . . . . . . . . . . . . . . . . 8
2.5.3 Initialization . . . . . . . . . . . . . . . . . . . . . . 8
2.5.4 Training regimen . . . . . . . . . . . . . . . . . . . 8
2.5.5 Improvements and Madmom implementation . . 8
2.6 CNN-based onset detector . . . . . . . . . . . . . . . . . . 9
2.6.1 Audio preprocessing . . . . . . . . . . . . . . . . . 10
2.6.2 Peak-picking . . . . . . . . . . . . . . . . . . . . . 10
2.6.3 Training regimen . . . . . . . . . . . . . . . . . . . 11
2.6.4 2014 architecture . . . . . . . . . . . . . . . . . . . 11
2.7 The Böck dataset . . . . . . . . . . . . . . . . . . . . . . . 11
2.8 Madmom . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
v
vi CONTENTS
3.3.1 Normalization . . . . . . . . . . . . . . . . . . . . . 16
3.3.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . 17
3.3.3 Batching . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.4 Peak-picking . . . . . . . . . . . . . . . . . . . . . 18
3.3.5 Network layers and optimizer . . . . . . . . . . . 19
3.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 Results 22
5 Discussion 24
6 Conclusion 26
6.1 Directions for future research . . . . . . . . . . . . . . . . 27
Bibliography 28
Chapter 1
Introduction
1
2 CHAPTER 1. INTRODUCTION
1.1 Outline
The rest of this thesis is structured as follows; in chapter 2 we briefly
describe what onset detection is, the field’s state of the art and the two
detector designs we have evaluated. We also introduce the Böck dataset
– one of the few freely available moderately sized onset datasets. In
chapter 3 we describe our reimplementations of Schlüter and Böck’s
detectors, including how and why they differ from the designs de-
scribed in the referenced articles and our evaluation method. Then
follows a section discussing our results and finally a short conclusion.
Chapter 2
Background
tp tp 2PR
P= , R= , F= .
tp + f p tp + f n P+R
1 F-score is sometimes denoted F1 -score, but we omit the suffix.
3
4 CHAPTER 2. BACKGROUND
Table 2.1: Onset detectors and their performance on the Böck dataset. Three
variants of the RNN-based and three of the CNN-based detector along with
the SuperFlux detector, relying on traditional signal processing, are shown.
All detectors were developed by Böck and colleagues.
Note that the F-score is the harmonic mean of the precision and
the recall. Often in classification, there is a tradeoff between sensitivity
(recall) and specificity (precision). If such tradeoffs can be made, then
the F-score can be optimized by setting the classification threshold in
such a way that recall and precision is roughly equal.
Figure 2.1: The three data processing steps demonstrated using the Spec-
tralFlux algorithm. The raw waveform is first converted into a power spectro-
gram, then the ODF is applied and onsets are predicted using peak picking
(vertical dotted lines).
2.5.1 Preprocessing
Figure 2.2 shows the preprocessing steps the RNN used. The signal
was first resampled into mono 44.1 kHz and three magnitude spec-
trograms were calculated, each with hop sizes of 10 ms and window
sizes of 11.61, 23.22 and 46.44 ms respectively. The three spectrograms
were then filtered using a Bark scale filterbank and the logarithmic
magnitude of the linear representation was calculated. According to
Böck, Krebs, and Schedl [16], the logarithmic magnitude yielded better
results than using the linear representation in many cases.
8 CHAPTER 2. BACKGROUND
2.5.2 Peak-picking
A standard peak-picking function was used to find peaks in the output
from the RNN. The function was smoothed by a Hamming window
of seven frames and local maxima exceeding 0.25 were reported as
onsets. The threshold 0.25 was choosen to optimize the F-score on the
training set [3].
If two onsets were detected less than 30 ms apart, they were com-
bined by picking the first (left) one.
2.5.3 Initialization
The weights of the network were initialized randomly using a Gaussian
distribution with 0 mean and 0.1 standard deviation [3].
• The frame sizes were changed from 512, 1024 and 2048 samples
to 1024, 2048 and 4096 samples, corresponding to periods of
23.22, 46.44 and 92.88 ms respectively.3
Logarithmic filter and doubled frame sizes caused the input vectors
to grow in dimensionality from 144 to 266.
2.6.2 Peak-picking
The article for the 2013 architecture did not detail what peak-picking
procedure was used.
CHAPTER 2. BACKGROUND 11
1. Rectified linear units (ReLU) were used instead of tanh for the
convolutional layers.
Peak picking worked exactly the same as for the RNN (see sec-
tion 2.5.2) except that five frames of smoothing was used instead of
seven. The threshold value was choosen by maximizing the F-score
using grid search.
Similarily to the 2013 architecture, the 2014 architecture was trained
for 100 epochs with gradient descent, mini-batches of size 256 and a
fixed learning rate of 0.05. Additionally, 0.45 momentum was used
which was linearly increased to 0.9 between epoch 10 and 20.
it as the Böck dataset. Each sound file in it is between a few seconds and
one minute in length. In total, the collection is 102 minutes long and
contains 25,927 annotated note onsets [1].4 It is one of very few onset
datasets available to researchers and therefore many onset detectors
have been benchmarked against it.
The dataset was created using manual annotation, rather than
using synthesized sounds generated from MIDI, meaning that the
annotations can be off by a few tens of milliseconds from the actual
onsets. According to Böck [17], the accuracy of a manual annotation is
typically ±2 ms for percussive sounds and up to ±10 ms for soft onsets
produced by instruments such as string or woodwind instruments.
We compensate for this lack of accuracy when evaluating the detec-
tors by considering a 50 ms window centered around each predicted
onset. If an annotation falls within this window we consider the onset
correctly predicted. However, one window can’t cover more than one
onset. This method was also used by Böck and colleagues.
For each audio file in the dataset there is a text file with the offsets
of the onsets in fractions of seconds for that audio file. One offset for
each line in the file.
2.8 Madmom
Madmom5 is a free audio signal processing library and plays an
essential role in our replication efforts. It was created by Böck to
accompany his doctoral thesis and other articles in Music information
retrieval [18]. In particular, it contains implementations of the RNN
and CNN detectors evaluated in this article and also ready-trained
models for them.
Using Madmom’s source code, we were able to deduce details such
as parameter settings not explicitly stated in the published research.
The library has also made it significantly easier to write the code
required for our replication.
4 https://drive.google.com/file/d/1ICEfaZ2r_cnqd3FLNC5F_UOEUalgV7cv/
view
5 https://madmom.readthedocs.io/en/latest/
Chapter 3
RNN_Onset_Detection.ipynb
2 https://github.com/bjourne/onset-replication
13
14 CHAPTER 3. METHODS AND IMPLEMENTATIONS
The first group constitutes a testing harness for training and evaluating
onset detectors on the Böck dataset. The second group – cnn and rnn –
contains our CNN and RNN implementations. The user can configure
its environment by editing the config module and can then train and
evaluate the architectures by running the main module. For example,
to train the CNN for at most 100 epochs for each fold:
$ python main.py -n cnn -t 0:8 --epochs 100
3.2.1 Preprocessing
Listing 3.1: Implementation of the audio preprocessing for the RNN. Input
to the preprocessing function is a filename and the output a two dimensional
numpy array of shape ( f , 266), where f is the number of frames in the file.
3.2.2 Peak-picking
def postprocess_y(y):
onsets = peak_picking(y,
threshold = 0.35,
smooth = 7,
pre_avg = 0, post_avg = 0,
pre_max = 1.0, post_max = 1.0)
onsets = onsets.astype(np.float) / 100.0
onsets = combine_events(onsets, 0.03, ’left’)
return np.asarray(onsets)
16 CHAPTER 3. METHODS AND IMPLEMENTATIONS
def model():
m = Sequential()
m.add(Masking(input_shape = (None, 266)))
m.add(Bidirectional(SimpleRNN(units = 25,
return_sequences = True)))
m.add(Bidirectional(SimpleRNN(units = 25,
return_sequences = True)))
m.add(Bidirectional(SimpleRNN(units = 25,
return_sequences = True)))
m.add(Dense(units = 1, activation = ’sigmoid’))
optimizer = SGD(lr = 0.01, clipvalue = 5, momentum = 0.9)
m.compile(loss = ’binary_crossentropy’,
optimizer = optimizer,
metrics = [’binary_accuracy’])
return m
3.3.1 Normalization
3.3.2 Preprocessing
3.3.3 Batching
3.3.4 Peak-picking
def postprocess_y(y):
onsets = peak_picking(y,
threshold = 0.54,
smooth = 5,
pre_avg = 0, post_avg = 0,
CHAPTER 3. METHODS AND IMPLEMENTATIONS 19
def model():
m = Sequential()
m.add(Conv2D(10, (7, 3), input_shape = (15, 80, 3),
padding = ’valid’, activation = ’relu’))
m.add(MaxPooling2D(pool_size = (1, 3)))
m.add(Conv2D(20, (3, 3), input_shape = (9, 26, 10),
padding = ’valid’, activation = ’relu’))
m.add(MaxPooling2D(pool_size = (1, 3)))
m.add(Dropout(0.5))
m.add(Flatten())
m.add(Dense(256, activation = ’sigmoid’))
m.add(Dense(1, activation = ’sigmoid’))
optimizer = SGD(lr = 0.05, momentum = 0.8, clipvalue = 5)
m.compile(loss = ’binary_crossentropy’,
optimizer = optimizer,
metrics = [’binary_accuracy’])
return m
3.4 Training
We used eight-fold cross-validation for evaluation for both networks.
We split the dataset of 321 files into eight evenly sized sets of size 40
or 41 and trained eight models – one for each set. The models were
given six of the eight sets as training data and the other two were used
for validation and testing respectively, so that no model was trained
on its testing data.
This cross-validation setup was slightly different from the one
described in [4] in which 75% of the data was used for training, 15%
for validation and 10% for testing. Since there was no explanation for
why the validation and testing sets were of unequal size, and because
they were of equal size in the ISMIR 2018 tutorial, we surmised that
the discrepancy was a calculation error on their part.
The eight models were trained with 20 epochs of patience. That is,
we continued training them until no improvement on their respective
validation sets could be observed for 20 epochs. This is the same
regimen that Böck and colleagues used to train the RNN but not the
CNN, in which each fold was trained for exactly 100 epochs. The
reason we deviated was again due to limited computing resources.
From cursory inspection, it seemed that 20 epochs of patience was
sufficient and that the models wouldn’t have improved even if we had
trained them more.
CHAPTER 3. METHODS AND IMPLEMENTATIONS 21
3.5 Evaluation
To evalute the two architectures, we trained three models for each
architecture and fold. In total 3 ∗ 2 ∗ 8 = 48 models were trained.
For each of the metrics precision, recall and F-score, we computed
the arithmetic mean of the three values. To make the performance
numbers comparable, we did not reshuffle the folds between runs.
An onset was considered detected if there was a ground truth
label within ±25 ms of the predicted peak, as described in section 2.7.
Furthermore, peaks were also combined if they occurred within 30 ms
of each other.
Chapter 4
Results
Our results are presented in table 4.1 and table 4.2. The tables show
the precision, recall and F-score for each fold for each of the three
models trained. It also shows these metrics calculated on the whole
dataset.
The values vary quite a bit depending on the fold, which is likely
an artifact of the comparatively small dataset – each fold only contains
40 or 41 files.
A surprising result is that the RNN architecture outperformed the
CNN architecture with a mean F-score of 86.3% versus 85.6% for the
former.
22
CHAPTER 4. RESULTS 23
# R1 R2 R3 R
P R1 R R1 F R1 P R2 R R2 F R2 P R3 R R3 F R3 PR RR FR
0 91.4 85.8 88.5 92.7 84.0 88.1 91.5 85.9 88.6 91.9 85.2 88.4
1 92.3 88.5 90.3 91.8 88.4 90.1 92.9 87.9 90.3 92.3 88.3 90.2
2 92.1 73.2 81.6 88.5 77.0 82.3 91.6 74.6 82.2 90.7 74.9 82.0
3 85.2 85.4 85.3 84.3 86.2 85.2 87.3 86.0 86.6 85.6 85.9 85.7
4 91.3 84.5 87.8 91.2 84.3 87.6 86.7 88.0 87.4 89.7 85.6 87.6
5 88.8 84.4 86.6 87.3 86.5 86.9 88.5 84.7 86.5 88.2 85.2 86.7
6 95.1 74.1 83.3 95.3 74.4 83.6 96.3 73.5 83.4 95.6 74.0 83.4
7 88.9 84.7 86.7 90.7 82.6 86.5 88.3 85.3 86.8 89.3 84.2 86.7
Tot 90.7 82.3 86.3 90.2 82.7 86.3 90.3 83.0 86.5 90.4 82.9 86.3
Table 4.1: Precision, recall and F-scores for three fully trained RNN models.
# C1 C2 C3 C
P C1 R C1 F C1 P C2 R C2 F C2 P C3 R C3 F C3 PC RC FC
0 97.7 74.6 84.6 97.3 80.2 87.9 97.7 75.1 84.9 97.6 76.6 85.8
1 96.3 81.7 88.4 97.1 79.3 87.3 95.4 84.5 89.6 96.3 81.8 88.4
2 96.9 68.6 80.3 96.1 73.4 83.3 96.0 70.1 81.0 96.3 70.7 81.5
3 93.5 82.5 87.7 94.0 85.2 89.4 91.0 84.8 87.8 92.8 84.2 88.3
4 94.0 77.3 84.8 92.9 79.3 85.6 92.8 80.9 86.4 93.2 79.2 85.6
5 95.3 79.4 86.6 96.2 76.2 85.0 94.0 83.4 88.4 95.2 79.7 86.7
6 97.2 72.8 83.2 97.0 73.8 83.8 97.1 72.8 83.2 97.1 73.1 83.4
7 94.1 77.4 84.9 93.9 76.7 84.4 94.0 77.2 84.8 94.0 77.1 84.7
Tot 95.6 76.5 85.0 95.5 77.9 85.8 94.7 78.4 85.8 95.3 77.8 85.6
Table 4.2: Precision, recall and F-scores for three fully trained CNN models.
Chapter 5
Discussion
Yet even taking these caveats into account, the difference is hard to
explain. Cursory experiments suggests that, indeed, the CNNs peak-
picking threshold is set too high and that an F-score of perhaps 88%
would be achievable by optimizing it. However, that is still two percent
short of 90.3%.
Another interesting result is the lack of improvement for the RNN.
The architecture we have evaluated has one extra bidirectional layer
24
CHAPTER 5. DISCUSSION 25
Conclusion
In this research, we have tried but failed to replicate the results de-
scribed in Schlüter and Böck [1] and Böck, Schlüter, and Widmer [3].
Due to differences in training regimen, implementation architecture
and quality, we cannot claim that our failure is caused by the original
research being impossible to replicate.
What we can claim, though, is that the original research lacked
important details that made our replication attempt harder than it
otherwise could have been. For example, for the RNN the articles
didn’t mention what hyperparameter settings were used for training
and for the CNN the fuzziness feature wasn’t explained in enough
detail for us to reimplement it.
Even though the omission of these details complicated our repli-
cation attempt, we do not believe that the original authors should be
faulted. Documenting every little detail seem like a nigh impossible
task and there would always be something missing. Researchers also
have to abide by publishing rules, limiting the lengths of their articles.
We also note that without Böck’s Madmom library and his IS-
MIR 2018 tutorial, our replication attempt would have been orders of
magnitude harder.
We therefore claim that our attempt to replicate has shown that
published research is not sufficient to replicate state of the art results
in onset detection. Whether more detailed research can help improve
the situation is an open question. In our opinion the answer is in
the negative – to replicate you need to see executable source code,
explanations written in natural languages aren’t enough.
26
CHAPTER 6. CONCLUSION 27
[1] Jan Schlüter and Sebastian Böck. “Improved musical onset de-
tection with Convolutional Neural Networks.” In: ICASSP. 2014,
pp. 6979–6983.
[2] Rong Gong and Xavier Serra. “Towards an efficient deep learning
model for musical onset detection”. In: CoRR abs/1806.06773
(2018). arXiv: 1806.06773. url: http://arxiv.org/abs/1806.
06773.
[3] Sebastian Böck, Jan Schlüter, and Gerhard Widmer. “Enhanced
peak picking for onset detection with recurrent neural networks”.
In: Proceedings of the 6th International Workshop on Machine Learn-
ing and Music, Prague, Czech Republic. 2013.
[4] Sebastian Böck et al. “Online real-time onset detection with
recurrent neural networks”. In: Proceedings of the 15th International
Conference on Digital Audio Effects (DAFx-12), York, UK. 2012.
[5] Matthew Hutson. Artificial intelligence faces reproducibility crisis.
2018.
[6] Juan Pablo Bello et al. “A tutorial on onset detection in music
signals”. In: IEEE Transactions on speech and audio processing 13.5
(2005), pp. 1035–1047.
[7] David Martin Powers. “Evaluation: from precision, recall and
F-measure to ROC, informedness, markedness and correlation”.
In: (2011).
[8] Sebastian Böck and Gerhard Widmer. “Maximum filter vibrato
suppression for onset detection”. In: Proc. of the 16th Int. Conf. on
Digital Audio Effects (DAFx). Maynooth, Ireland (Sept 2013). 2013.
[9] Jan Schlüter and Sebastian Böck. “Musical onset detection with
convolutional neural networks”. In: 6th international workshop on
machine learning and music (MML), Prague, Czech Republic. 2013.
28
BIBLIOGRAPHY 29
www.kth.se