Audio Paper

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

DEGREE PROJECT IN COMPUTER ENGINEERING,

FIRST CYCLE, 15 CREDITS


STOCKHOLM, SWEDEN 2019

Reproducing the state of the art in


onset detection using neural
networks

BJÖRN LINDQVIST

KTH ROYAL INSTITUTE OF TECHNOLOGY


SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Reproducing the state of the
art in onset detection using
neural networks

BJÖRN A. LINDQVIST BJOLIN2@KTH.SE

Bachelor in Computer Science


Date: June 15, 2019
Supervisor: Jeanette Hällgren Kotaleski
Examiner: Örjan Ekberg
School of Electrical Engineering and Computer Science
Swedish title: Replikering av de bästa resultaten inom området
ansatsdetektion med neurala nätverk
iii

Abstract
Onset detection is a research area in music anaysis concerned with
determining when in digital audio streams audio events occur. The
area has grown considerably in recent years and several new methods
for onset detection has been developed which have gradually improved
upon the state of the art. Currently, the top spot is held by Schlüter
and Böck, who in 2014 presented a detector based on a convolutional
neural network (CNN) that attained an F-score of 90.3% (Precision
91.7%, 88.9% recall) on a commonly used dataset [1].
But in 2018, another pair of researchers, Gong and Serra, failed to
reproduce the above mentioned’s result. They only got the F-score
86.67% (precision and recall values weren’t reported) [2], a significantly
worse result than Schlüter and Böck’s. In comparison, a 2013 detector
based on a recurrent neural network (RNN), also designed by Schlüter
and Böck, got an F-score of 87.3% [3].
Gong and Serra’s result casts doubt on the 90.3% figure reported by
Schlüter and Böck. We therefore try to shed some light on the question
of what the state of the art performance in musical onset detection
really is by posing and answering the question; can Schlüter and Böck’s
results be reproduced? We do this by implementing the proposed
networks and seeing if they can achieve the same performance on the
same dataset Schlüter and Böck used.
Our answer is “Maybe – but we were unable to!” which is perhaps
the only result possible since you can’t prove a negative. We trained the
CNN architecture three times and obtained F-scores of 85.0%, 85.8%
and 85.6%, about 5% less than Schlüter and Böck’s 90.3%. We also
tried to replicate their result for the RNN and obtained the F-scores
86.3%, 86.3% and 86.3%. These too are worse than Schlüter and Böck’s,
albeit the difference is only 1%.
Some details were missing in the short articles authored by the
mentioned researchers that we read to understand the methods. There-
fore, we had to make “educated guesses” about parameters such as
learning rate. Thus, our implementations are not exact replicas of the
author’s. It could partially explain our worse results.
Nevertheless, we believe that our work is worthwhile because it
demonstrates how infuriatingly difficult it is in deep learning for
researchers to reproduce each others work.
iv

Sammanfattning
Ansatsdetektion är ett område inom musikanalysen som går ut på
att bestämma när händelser i ljuddata inträffar. Området har växt
betydligt på sistone och flera nya metoder för ansatsdetektion har
utvecklats. Dessa har gradvis förbättrat den bästa detektionsförmågan.
Rekordet innehas av Schlüter och Böck som år 2014 presenterade en
detektionsmetod based på ett faltningsnätverk (Convolutional Neural
Network, CNN) som uppnådde ett F-värde på 90,3 % (med precision
91,7 % och täckning 88,9 %) på en ofta använd mängd data [1].
Men år 2018 misslyckades ett annat forskarpar, Gong och Serra,
med att replikera de förstnämndas resultat. Deras F-värde landade på
86,67 % (de rapporterade varken precision eller täckning) [2] vilket är
ett mycket sämre resultat än Schlüter och Böcks. Som jämförelse kan
nämnas att Schlüter och Böck år 2013 presenterade en detektionsme-
tod baserad på ett återkommande neuralt nätverk (Recurrent Neural
Network, RNN) med F-värdet 87,3 % [3].
Gong och Serras resultat gör att Schlüter och Böcks 90,3 %-resultat
kan ifrågasättas. Vi försöker därför ta reda på om de sistnämndas
resultat står sig. Detta gör vi genom att implementera de nätverk
som föreslagits och se om vi kan få lika bra detektionsförmåga på
samma datamängd som Schlüter och Böck använde – kan deras resultat
replikeras?
Vårt svar är “Kanske – men vi kunde det inte!” Möjligtvis är det det
enda säkra som kan sägas eftersom vi misslyckades med att replikera
deras resultat. De tre faltningsnätverk vi tränade fick F-värdena 85,0 %,
85,8 %, och 85,6 %, ungefär fem procentenheter lägre än Schlüter
och Böcks 90,3 %. Vi försökte också replikera deras resultat för det
återkommande neurala nätverket och fick där 86,3 %, 86,3 % och 86,3 %
i F-värde. Även dessa är sämre än Schlüter och Böcks, men skillnaden
är bara en procentenhet.
Detaljinformation saknades i de korta artiklar författade av ovan
nämnda som vi läste för att förstå metoderna. Därför var vi tvungna
att gissa oss till vissa detaljer såsom parametrar för inlärningshastighet
och därför kan vi inte garantera att de implementationer vi utvärderat
är exakta kopior av författarnas. Det kan vara en delförklaring till våra
sämre resultat.
Vi menar ändå att vårt arbete är värdefullt eftersom det visar hur
otroligt svårt det är att replikera resultat inom området djupinlärning.
Contents

1 Introduction 1
1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 3
2.1 Precision, recall and F-score . . . . . . . . . . . . . . . . . 3
2.2 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Onset detection methods . . . . . . . . . . . . . . . . . . . 5
2.4 Detection using supervised learning . . . . . . . . . . . . 6
2.5 RNN-based onset detector . . . . . . . . . . . . . . . . . . 7
2.5.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . 7
2.5.2 Peak-picking . . . . . . . . . . . . . . . . . . . . . 8
2.5.3 Initialization . . . . . . . . . . . . . . . . . . . . . . 8
2.5.4 Training regimen . . . . . . . . . . . . . . . . . . . 8
2.5.5 Improvements and Madmom implementation . . 8
2.6 CNN-based onset detector . . . . . . . . . . . . . . . . . . 9
2.6.1 Audio preprocessing . . . . . . . . . . . . . . . . . 10
2.6.2 Peak-picking . . . . . . . . . . . . . . . . . . . . . 10
2.6.3 Training regimen . . . . . . . . . . . . . . . . . . . 11
2.6.4 2014 architecture . . . . . . . . . . . . . . . . . . . 11
2.7 The Böck dataset . . . . . . . . . . . . . . . . . . . . . . . 11
2.8 Madmom . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Methods and implementations 13


3.1 Implementation structure . . . . . . . . . . . . . . . . . . 13
3.2 RNN implementation . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . 14
3.2.2 Peak-picking . . . . . . . . . . . . . . . . . . . . . 15
3.2.3 Network layers and optimizer . . . . . . . . . . . 16
3.3 CNN implementation . . . . . . . . . . . . . . . . . . . . 16

v
vi CONTENTS

3.3.1 Normalization . . . . . . . . . . . . . . . . . . . . . 16
3.3.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . 17
3.3.3 Batching . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.4 Peak-picking . . . . . . . . . . . . . . . . . . . . . 18
3.3.5 Network layers and optimizer . . . . . . . . . . . 19
3.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Results 22

5 Discussion 24

6 Conclusion 26
6.1 Directions for future research . . . . . . . . . . . . . . . . 27

Bibliography 28
Chapter 1

Introduction

By leveraging deep learning, researchers have improved upon the state


of the art performance of musical onset detection. A convolutional
neural network (CNN) architecture was created by Schlüter and Böck
[1] in 2014, attaining an F-score of 90.3 (Precision 91.7%, 88.9% recall)
on the Böck dataset. Five years later, this result still defines the field’s
state of the art.
In 2018, Gong and Serra [2] tried to reproduce their result but only
got the F-Score 86.67% (precision and recall values weren’t reported),
a significantly worse result than Schlüter and Böck’s. In comparison a
2013 detector based on a recurrent neural network (RNN), created by
Böck, Schlüter, and Widmer [3], achieved an F-score of 87.3% [4].
Their replication failure casts doubt on the 90.3% figure reported by
Schlüter and Böck and in this thesis we therefore try to shed some light
on the situation by attempting to reproduce Schlüter and Böck’s result
ourselves. We do this by training two onset detectors; one based on
the RNN architecture and one on the CNN architecture and evaluating
their respective performances.
The necessity of researchers being able to reproduce each others
results have been emphasized many times before. The observed high
rate of reproduction failures in artificial intelligence (which deep learn-
ing is a subfield of) has been described by some as a “crisis” [5]. We
therefore believe that our attempt at reproducing the state of the art in
onset detection is a worthwhile contribution to the field.

1
2 CHAPTER 1. INTRODUCTION

1.1 Outline
The rest of this thesis is structured as follows; in chapter 2 we briefly
describe what onset detection is, the field’s state of the art and the two
detector designs we have evaluated. We also introduce the Böck dataset
– one of the few freely available moderately sized onset datasets. In
chapter 3 we describe our reimplementations of Schlüter and Böck’s
detectors, including how and why they differ from the designs de-
scribed in the referenced articles and our evaluation method. Then
follows a section discussing our results and finally a short conclusion.
Chapter 2

Background

Onset detection refers to the detection of sound events in audio signals.


Detection means locating the time offset in the signal at which the
event occurs. Examples of events could be the sound of footsteps,
knocks on doors, yells or any kind of audio that a human would
interpret as a discrete sound with a clearly delineated onset.
In a musical context, onsets are defined as the starting point of notes.
Detection of such onsets has a wide range of uses in musicology and is
used in, for example, tempo and meter estimation, genre classification
and music fingerprinting [1]. It can also be used to facilitate edit
operations like cut-and-paste in audio editing software [6].
For a human onset detection isn’t particularily challenging. Recog-
nizing the drum beat in a song is trivial and with only a little training
almost everyone can hear the onset of notes produced by other instru-
ments. But as is often the case, what is easy for a human can be nigh
impossible for a computer!

2.1 Precision, recall and F-score


As is customary when evaluating binary classifiers, we will primarily
use the three metrics precision, P, recall, R, and F-score, F [7].1 These
are all derived from the three metrics true positives, tp, false positives,
f p and false negatives f n,

tp tp 2PR
P= , R= , F= .
tp + f p tp + f n P+R
1 F-score is sometimes denoted F1 -score, but we omit the suffix.

3
4 CHAPTER 2. BACKGROUND

Detector F-score Prec. Rec. Year Article


RNN/OnsetDetector.2010 0.826 0.857 0.796 2012 [3]
RNN/OnsetDetector.2011 0.866 0.906 0.830 2012 [3]
RNN/OnsetDetector.2011 0.873 0.892 0.855 2012 [3]
SuperFlux 0.836 0.883 0.793 2013 [8]
CNN 0.885 0.905 0.866 2013 [9]
CNN+Dropout 0.890 0.909 0.871 2014 [1]
CNN+Dropout+Fuzzy 0.899 0.914 0.885 2014 [1]
CNN+Dropout+Fuzzy+ReLU 0.903 0.917 0.889 2014 [1]

Table 2.1: Onset detectors and their performance on the Böck dataset. Three
variants of the RNN-based and three of the CNN-based detector along with
the SuperFlux detector, relying on traditional signal processing, are shown.
All detectors were developed by Böck and colleagues.

Note that the F-score is the harmonic mean of the precision and
the recall. Often in classification, there is a tradeoff between sensitivity
(recall) and specificity (precision). If such tradeoffs can be made, then
the F-score can be optimized by setting the classification threshold in
such a way that recall and precision is roughly equal.

2.2 State of the art


Some of the best performing onset detectors published are presented in
table 2.1. They all use deep learning, with the exception of SuperFlux
– an enhanced version of the universal SpectralFlux onset detection
algorithm [8]. The best one is the CNN detector developed by Schlüter
and Böck [1], reaching an F-score of 90.3%, surpassing their own 87.3%
record from 2013 [3]. This detector also won the annual MIREX Audio
Onset Detection (MIREX AOD) competition in 2016, 2017 and 2018.2
We describe the RNN and CNN detectors in detail in section 2.5
and section 2.6.
CHAPTER 2. BACKGROUND 5

Figure 2.1: The three data processing steps demonstrated using the Spec-
tralFlux algorithm. The raw waveform is first converted into a power spectro-
gram, then the ODF is applied and onsets are predicted using peak picking
(vertical dotted lines).

2.3 Onset detection methods


The first onset detection methods relied on traditional signal process-
ing, for example by analyzing changes of spectral energy [10, 6], pitch
[11] or phase [12] accompanying an onset. More sophisticated onset
detection methods have been based on probabalistic models and have
incorporated neural networks with magnitude spectrograms as input
features [8].
Most onset detection methods works in three steps; preprocessing,
onset detection and peak-picking [6], as shown in figure 2.1.

1. First the raw signal is converted into a format amendable for


onset detection. One commonly used format is the short-time
Fourier transform of the signal.
2 Seef.e https://nema.lis.illinois.edu/nema_out/mirex2018/results/aod/
summary.html
6 CHAPTER 2. BACKGROUND

2. Then the onset detection function (ODF) is applied. It is a func-


tion Rn → R whose peaks are intended to coincide with the
times of note onsets. ODFs usually have much lower sampling
rates (e.g. 100 - 200 Hz) compared to the audio signal [10]. The
ODF can be either hand-crafted or learned from data.

3. Finally, peak-picking (PP) is performed. It consists of picking


local maxima in the ODF, subject to various constraints [10].
These constraints and thresholds have a large impact on the
number of false positives and false negatives reported, meaning
that the peak-picking function must be tuned to reach optimal
performance [10].

2.4 Detection using supervised learning


Inspired by the successes of supervised learning and deep learning
in particular, researchers have tried to use the technology for musical
onset detection.
Lacoste and Eck designed two variants of a feed-forward network
which they named SINGLE-NET and MULTI-NET, winning the 2005
edition of the MIREX AOD [13]. In 2010, Eyben et al. suggested using
RNNs for onset detection [14]. The idea was iterated upon by Böck
et al. [4] in 2012 and further refined by Böck, Schlüter, and Widmer [3]
in 2013 in which the trio presented two detectors – one for offline and
one for online (realtime) use.
A CNN for onset detection was presented by Schlüter and Böck
[9] in 2013 and improved up in 2014, also by Schlüter and Böck [1].
They found that the method outperformed state of the art and also
required less manual preprocessing. Cakır et al. put forward a very
promising design based on combining convolutional and recurrent
layers (CRNN) in 2017 [15]. However, they did not benchmark their
design on the Böck dataset and have not competed in the MIREX AOD,
making it hard to evaluate their performance claims.
In the following sections we describe the aforementioned RNN and
CNN designs.
CHAPTER 2. BACKGROUND 7

Layer Type Weights Activation


0 Bidi-RNN 144 × 25 tanh
1 Bidi-RNN 50 × 25 tanh
2 Fully-Connected 50 × 1 sigmoid

Table 2.2: Architecture of the network used for onset detection.

2.5 RNN-based onset detector


The RNN-based onset detector described in three articles by Eyben et
al. [14], Böck, Schlüter, and Widmer [3], and Böck et al. [4] respectively
was based on a two-layer bidirectional neural network (BRNN) (four
layers in total since the bidirectionality doubles them) with four tanh
units each as shown in table 2.2. According to Böck, Schlüter, and
Widmer [3], it achieved an F-score of 87.3% (Precision 89.2%, recall
85.5%) on the Böck dataset.
Note that our RNN implementation differs in important aspects
from the one described in the following (see section 2.5.5).

2.5.1 Preprocessing

Figure 2.2: Preprocessing steps for the RNN.

Figure 2.2 shows the preprocessing steps the RNN used. The signal
was first resampled into mono 44.1 kHz and three magnitude spec-
trograms were calculated, each with hop sizes of 10 ms and window
sizes of 11.61, 23.22 and 46.44 ms respectively. The three spectrograms
were then filtered using a Bark scale filterbank and the logarithmic
magnitude of the linear representation was calculated. According to
Böck, Krebs, and Schedl [16], the logarithmic magnitude yielded better
results than using the linear representation in many cases.
8 CHAPTER 2. BACKGROUND

The final step of the preprocessing consisted of calculating “dif-


ferences” between consecutive frames of the spectrograms and then
concatenating them. This created a sequence of 144-dimensional vec-
tors, one for each frame in the audio file which was then fed to the
RNN.

2.5.2 Peak-picking
A standard peak-picking function was used to find peaks in the output
from the RNN. The function was smoothed by a Hamming window
of seven frames and local maxima exceeding 0.25 were reported as
onsets. The threshold 0.25 was choosen to optimize the F-score on the
training set [3].
If two onsets were detected less than 30 ms apart, they were com-
bined by picking the first (left) one.

2.5.3 Initialization
The weights of the network were initialized randomly using a Gaussian
distribution with 0 mean and 0.1 standard deviation [3].

2.5.4 Training regimen


The network was trained using 8-fold cross-validation. To prevent
overfitting, training was aborted if no performance improvement on the
validation set was detected for 20 epochs [14]. Other hyper parameter
settings weren’t described in the referenced articles.

2.5.5 Improvements and Madmom implementation


After the publication of the 2013 article, Böck added the detector to
Madmom (see section 2.8) and improved it in the following ways:

• The signal preprocessing was changed to use a logarithmically


spaced filterbank with six bands per octave instead of Bark-
filtering [17].

• The detection threshold was increased from 0.25 to 0.35.3


3 We learned about these modifications by reading the Madmom library’s source
code.
CHAPTER 2. BACKGROUND 9

Layer Type Filter dims Input shape Activation


0 Batch-normalization 15 × 80 × 3
1 Convolutional 10, 7 × 3 15 × 80 × 3 tanh
2 Max-pooling 1×3 9 × 78 × 10
3 Convolutional 20, 3 × 3 9 × 26 × 10 tanh
4 Max-pooling 1×3 7 × 24 × 20
5 Flatten 7 × 8 × 20
6 Fully-connected 1 × 1120 sigmoid
7 Fully-connected 1 × 256 sigmoid

Table 2.3: The 2013 architecture.

• The frame sizes were changed from 512, 1024 and 2048 samples
to 1024, 2048 and 4096 samples, corresponding to periods of
23.22, 46.44 and 92.88 ms respectively.3

• One extra bidirectional layer was added.3

• The frame overlap factor was changed from 0.5 to 0.25.3

Logarithmic filter and doubled frame sizes caused the input vectors
to grow in dimensionality from 144 to 266.

2.6 CNN-based onset detector


Böck and colleagues published two articles about their CNN-based
onset detector, Schlüter and Böck [9] in 2013 and Schlüter and Böck [1]
in 2014. As expected, these architectures were quite different, with the
2014 one outperforming the 2013 one. We refer to these two variants
as the 2013 architecture and the 2014 architecture.
The architecture we have implemented and evaluated is a hybrid
of the 2013 and 2014 one. But in this section we refer to the 2013
architecture, unless otherwise noted. The reason we did not use
the 2014 architecture verbatim is because the fuzziness feature was
difficult to implement properly (see section 3.3.5). Incidentally, the
2013 architecture is also the one used in Madmom.
The layers of the 2013 architecture are shown in table 2.3. The inputs
to the network were three-channel spectrogram excerpts, centered on
the frame to classify, and the outputs scalars, 0 to 1, representing the
probability of an onset.
10 CHAPTER 2. BACKGROUND

2.6.1 Audio preprocessing

Figure 2.3: Preprocessing steps for the CNN.

Figure 2.3 shows the preprocessing steps, common to both the


2013 and 2014 architecture. They were similar to those used in the
RNN with the major difference being that the CNN used 80-band
mel-filters from 27.5 Hz to 16 kHz rather than a Bark-filter. Another
difference was that the spectrograms were stacked in depth rather
than concatenated. This way, the shape of the data for each audio
file became ( f , 80, 3), where f is the number of frames in the audio,
loosely resembling the shape of three-channel RGB images.
Excerpts of the spectrogram data was then created using 7 frames
of preceeding and succeeding context. This was done by letting a
15 frame wide window slide over the audio data which also entailed
inserting 7 frames of padding at the beginning and the end of it.
According to Schlüter and Böck [1], this was roughly the same amount
of contextual data that the RNN used. Thus, for each file, f training
samples of shape (15, 80, 3) were created. Each of those corresponded
to 10 ms of audio data and the network’s job was to determine whether
they contained one or more onsets.
Frequency bands were normalized to zero mean and unit variance.
The constants for this normalization step were computed on a hold-out
set.

2.6.2 Peak-picking
The article for the 2013 architecture did not detail what peak-picking
procedure was used.
CHAPTER 2. BACKGROUND 11

2.6.3 Training regimen


The researchers trained the network for 100 epochs with gradient
descent, mini-batches of size 256 and a fixed learning rate of 0.05. They
used eight-fold cross-validation to obtain all results and to prevent
overfitting.

2.6.4 2014 architecture


In 2014 Schlüter and Böck [1] published a major redesign of the CNN
detector they first wrote about in 2013. It increased the F-score from
88.5% to 90.3%, setting the current world record. The biggest differ-
ences were:

1. Rectified linear units (ReLU) were used instead of tanh for the
convolutional layers.

2. A 50% dropout layer was inserted after the convolutional layers.

3. The training data was augmented so that any frames immediately


before or after an onset would be counted as a positive example.
The idea behind this was to mitigate the effects of both “soft
onsets” and annotations that were slightly off. These samples
would only be weighted by 25% during training. They called this
feature fuzziness.

Peak picking worked exactly the same as for the RNN (see sec-
tion 2.5.2) except that five frames of smoothing was used instead of
seven. The threshold value was choosen by maximizing the F-score
using grid search.
Similarily to the 2013 architecture, the 2014 architecture was trained
for 100 epochs with gradient descent, mini-batches of size 256 and a
fixed learning rate of 0.05. Additionally, 0.45 momentum was used
which was linearly increased to 0.9 between epoch 10 and 20.

2.7 The Böck dataset


The dataset used both by Böck and colleagues and in this research con-
sists of 321 audio files, mostly in .flac 44.1 kHz mono format, compiled
by Böck and others and contains a wide variety of monophonic and
polyphonic music. Since the dataset was compiled by Böck, we refer to
12 CHAPTER 2. BACKGROUND

it as the Böck dataset. Each sound file in it is between a few seconds and
one minute in length. In total, the collection is 102 minutes long and
contains 25,927 annotated note onsets [1].4 It is one of very few onset
datasets available to researchers and therefore many onset detectors
have been benchmarked against it.
The dataset was created using manual annotation, rather than
using synthesized sounds generated from MIDI, meaning that the
annotations can be off by a few tens of milliseconds from the actual
onsets. According to Böck [17], the accuracy of a manual annotation is
typically ±2 ms for percussive sounds and up to ±10 ms for soft onsets
produced by instruments such as string or woodwind instruments.
We compensate for this lack of accuracy when evaluating the detec-
tors by considering a 50 ms window centered around each predicted
onset. If an annotation falls within this window we consider the onset
correctly predicted. However, one window can’t cover more than one
onset. This method was also used by Böck and colleagues.
For each audio file in the dataset there is a text file with the offsets
of the onsets in fractions of seconds for that audio file. One offset for
each line in the file.

2.8 Madmom
Madmom5 is a free audio signal processing library and plays an
essential role in our replication efforts. It was created by Böck to
accompany his doctoral thesis and other articles in Music information
retrieval [18]. In particular, it contains implementations of the RNN
and CNN detectors evaluated in this article and also ready-trained
models for them.
Using Madmom’s source code, we were able to deduce details such
as parameter settings not explicitly stated in the published research.
The library has also made it significantly easier to write the code
required for our replication.

4 https://drive.google.com/file/d/1ICEfaZ2r_cnqd3FLNC5F_UOEUalgV7cv/

view
5 https://madmom.readthedocs.io/en/latest/
Chapter 3

Methods and implementations

We have implemented two variants of the detector architectures de-


scribed in section 2.5 and section 2.6 – the CNN and the RNN – and
have evaluated their performance on the Böck dataset. Clearly, this
method depended on the original research being described in suffi-
cient detail to allow reimplementation. In our opinion the articles did
lack some details (see section 2.5.5), but with the source code for the
Madmom library and Böck’s ISMIR 2018 tutorial,1 we have mostly
been able to fill in the gaps wherever necessary.
In order to ensure that our replication attempts have been carried
out correctly, and to facilitate “replication of our replication,” we have
opted to use source code listings to illustrate our methods. This way,
ambiguities are hopefully kept to a minimum, making it easier for
readers to spot errors. However, for brewity’s sake, we have omitted
import statements in code listings. We have also published our source
code online.2
The implementation tools we have used are Python, Keras, Numpy
and tensorflow. All very popular in the machine learning world. We
have also used the audio signal processing library Madmom.

3.1 Implementation structure


Our implementation consists of six Python modules divided into two
groups:
1 https://github.com/slychief/ismir2018_tutorial/blob/master/Part_3b_

RNN_Onset_Detection.ipynb
2 https://github.com/bjourne/onset-replication

13
14 CHAPTER 3. METHODS AND IMPLEMENTATIONS

• The testing harness: config, evaluation, main and preprocessing.

• Architecture implementations: cnn and rnn.

The first group constitutes a testing harness for training and evaluating
onset detectors on the Böck dataset. The second group – cnn and rnn –
contains our CNN and RNN implementations. The user can configure
its environment by editing the config module and can then train and
evaluate the architectures by running the main module. For example,
to train the CNN for at most 100 epochs for each fold:
$ python main.py -n cnn -t 0:8 --epochs 100

Or to evaluate each of the eight folds in the RNN:


$ python main.py -n rnn -e 0:8

In the following sections, we describe the pertinent details of the RNN


and CNN implementations.

3.2 RNN implementation


In this section we describe the interesting parts of our RNN imple-
mentation. First the code for preprocessing, or feature extraction, then
postprocesing, or peak-picking and finally, the network implementa-
tion.

3.2.1 Preprocessing

def preprocess_sig(sig, frame_size):


frames = FramedSignal(sig,
frame_size = frame_size,
fps = 100)
stft = ShortTimeFourierTransform(frames)
filt = FilteredSpectrogram(stft, num_bands = 6)
spec = np.log10(5*filt + 1)
# Calculate difference spectrogram with ratio 0.25
diff_frames = _diff_frames(0.25,
frame_size = frame_size,
hop_size = 441,
window = np.hanning)
init = np.empty((diff_frames, spec.shape[1]))
init[:] = np.inf
CHAPTER 3. METHODS AND IMPLEMENTATIONS 15

spec = np.insert(spec, 0, init, axis = 0)


diff_spec = spec[diff_frames:] - spec[:-diff_frames]
np.maximum(diff_spec, 0, out = diff_spec)
diff_spec[np.isinf(diff_spec)] = 0
diff_spec = np.hstack((spec[diff_frames:], diff_spec))
return diff_spec
def preprocess_x(filename):
sig = Signal(filename, sample_rate = 44100, num_channels = 1)
frame_sizes = [1024, 2048, 4096]
D = [preprocess_sig(sig, fs) for fs in frame_sizes]
return np.hstack(D)

Listing 3.1: Implementation of the audio preprocessing for the RNN. Input
to the preprocessing function is a filename and the output a two dimensional
numpy array of shape ( f , 266), where f is the number of frames in the file.

def preprocess_y(anns, n_frames):


y = anns[:np.searchsorted(anns, (n_frames - 0.5) / 100)]
q = np.zeros(n_frames)
idx = np.unique(np.round(y * 100).astype(np.int))
q[idx] = 1
return q

Listing 3.2: Annotation preprocessing for the RNN.


Listing 3.1 and listing 3.2 shows how we have implemented the RNNs
preprocessing. preprocess_y transforms anns from a list of time offset
annotations to a one-hot encoded vector whose length is equal to the
number of frames in the file.

3.2.2 Peak-picking

def postprocess_y(y):
onsets = peak_picking(y,
threshold = 0.35,
smooth = 7,
pre_avg = 0, post_avg = 0,
pre_max = 1.0, post_max = 1.0)
onsets = onsets.astype(np.float) / 100.0
onsets = combine_events(onsets, 0.03, ’left’)
return np.asarray(onsets)
16 CHAPTER 3. METHODS AND IMPLEMENTATIONS

Listing 3.3: The RNNs peak-picking procedure.


Our peak-picking works almost exactly as described in section 2.5.2.
The only difference is that we increased the threshold from 0.25 to 0.35
as suggested in section 2.5.5.

3.2.3 Network layers and optimizer

def model():
m = Sequential()
m.add(Masking(input_shape = (None, 266)))
m.add(Bidirectional(SimpleRNN(units = 25,
return_sequences = True)))
m.add(Bidirectional(SimpleRNN(units = 25,
return_sequences = True)))
m.add(Bidirectional(SimpleRNN(units = 25,
return_sequences = True)))
m.add(Dense(units = 1, activation = ’sigmoid’))
optimizer = SGD(lr = 0.01, clipvalue = 5, momentum = 0.9)
m.compile(loss = ’binary_crossentropy’,
optimizer = optimizer,
metrics = [’binary_accuracy’])
return m

Listing 3.4: The network layers.


Listing 3.4 reflects the changes described in section 2.5.5. Since the
settings for hyper parameters weren’t described in the referenced
articles, we instead used those we found in Böck’s 2018 ISMIR tutorial.

3.3 CNN implementation


In this section we describe the interesting parts of our CNN implemen-
tation.

3.3.1 Normalization

all_data = np.concatenate([d.x for d in D])


mean = np.mean(all_data, axis = 0)
std = np.std(all_data, axis = 0)
CHAPTER 3. METHODS AND IMPLEMENTATIONS 17

D = [AudioSample((d.x - mean) / std, d.y, d.a, d.name)


for d in D]

Listing 3.5: Normalization of CNN samples.


As mentioned in section 2.6.1, the 2013 architecture required the input
data to the normalized to zero mean and unit variance. In contrast
to Schlüter and Böck [1], we used all available data to calculate these
parameters rather than calculating them on a hold-out set. We reasoned
that, since the referenced article didn’t describe what data was used to
form the hold-out set, it would be incorrect of us to guess because we
could inadvertently harm the network’s performance. Calculating the
parameters using all data should favor the network.

3.3.2 Preprocessing

def preprocess_sig(sig, frame_size):


frames = FramedSignal(sig, frame_size = frame_size, fps = 100)
stft = ShortTimeFourierTransform(frames)
filt = FilteredSpectrogram(stft,
filterbank = MelFilterbank,
num_bands = 80,
fmin = 27.5, fmax = 16000,
norm_filters = True,
unique_filters = False)
log_filt = LogarithmicSpectrogram(filt,
log = np.log,
add = np.spacing(1))
return log_filt
def preprocess_x(filename):
sig = Signal(filename, sample_rate = 44100, num_channels = 1)
D = [preprocess_sig(sig, fs) for fs in [2048, 1024, 4096]]
D = np.dstack(D)
# Pad left and right with 7 frames
s = np.repeat(D[:1], 7, axis = 0)
e = np.repeat(D[-1:], 7, axis = 0)
D = np.concatenate((s, D, e))
return D

Listing 3.6: Implementation of the audio preprocessing for the CNN.


preprocess_x takes a filename and returns a numpy array in the format
( f + 14, 80, 3), where f is the number of frames in the file.
18 CHAPTER 3. METHODS AND IMPLEMENTATIONS

The preprocessing followed the algorithm outlined in section 2.6.1.


Seven frames were concatenated to the beginning and the end of the
data to facilitate the creation of sliding windows done in a later stage.

3.3.3 Batching

def get_sample(self, i):


at = 0
for d in self.D:
n_frames = len(d.y)
assert len(d.x) == n_frames + 14
if at + n_frames > i:
ofs = i - at
return d.x[ofs:ofs+15], d.y[ofs]
at += n_frames
raise IndexError(’Index ‘%d‘ is wrong...’ % i)

def __getitem__(self, idx):


start = idx * self.batch_size
end = min(start + self.batch_size, self.n_samples)
inds = self.indices[start:end]
samples = [self.get_sample(i) for i in inds]
xs = np.array([s[0] for s in samples])
ys = np.array([s[1] for s in samples])
return xs, ys

Listing 3.7: Creation of mini batches of samples.


The RNN was trained by feeding the three spectrograms with different
window sizes directly to the network. The CNN, on the other hand,
was trained on mini batches of 256 spectrogram excerpts. The Python
class ArchSequence in the module cnn generates such excerpts. The
two relevant methods of it, __getitem__ and get_sample, are shown
in listing 3.7.

3.3.4 Peak-picking

def postprocess_y(y):
onsets = peak_picking(y,
threshold = 0.54,
smooth = 5,
pre_avg = 0, post_avg = 0,
CHAPTER 3. METHODS AND IMPLEMENTATIONS 19

pre_max = 1.0, post_max = 1.0)


onsets = onsets.astype(np.float) / 100.0
onsets = combine_events(onsets, 0.03, ’left’)
return np.asarray(onsets)

Listing 3.8: The CNNs peak-picking procedure.


The article for the 2013 architecture didn’t mention what peak-picking
parameters were used. The article for the 2014 architecture stated
that the parameters were found by optimizing the F-score using grid
search.
While we could have ran our own grid search we decided not to,
since the peak picking parameters should be considered an integral
part of the onset detector. Instead we used the values 0.54 for the
threshold and five frames of smoothing since those are the values we
found in the Madmom library’s source code.
Note that this parameter setting is likely not optimal because it was
tuned for the 2013 architecture and not the 2014 one.

3.3.5 Network layers and optimizer

def model():
m = Sequential()
m.add(Conv2D(10, (7, 3), input_shape = (15, 80, 3),
padding = ’valid’, activation = ’relu’))
m.add(MaxPooling2D(pool_size = (1, 3)))
m.add(Conv2D(20, (3, 3), input_shape = (9, 26, 10),
padding = ’valid’, activation = ’relu’))
m.add(MaxPooling2D(pool_size = (1, 3)))
m.add(Dropout(0.5))
m.add(Flatten())
m.add(Dense(256, activation = ’sigmoid’))
m.add(Dense(1, activation = ’sigmoid’))
optimizer = SGD(lr = 0.05, momentum = 0.8, clipvalue = 5)
m.compile(loss = ’binary_crossentropy’,
optimizer = optimizer,
metrics = [’binary_accuracy’])
return m

Listing 3.9: The network layers.


The 2014 architecture (see section 2.6.4) used a momentum starting at
0.45 and then being linearly increased to 0.9 between epoch 10 and
20 CHAPTER 3. METHODS AND IMPLEMENTATIONS

20. However, due to limited computing resources we needed to be


able to stop and resume the training progress and hyper parameter
scheduling works poorly in combination with resumtion in Keras. We
therefore decided to fix the momentum at 0.8, arguing that it would
be a reasonable compromise.
Furthermore, we used ReLU activations and 50% dropout, as ad-
vised by the article for the 2014 architecture.
One big difference between our implementation and the 2014 ar-
chitecture is that we did not implement fuzziness (see section 2.6.4).
The researchers augmented the input data so that samples adjacent to
positive samples were themselves seen as positive samples, but only
weighted 25%. But they did not explain how the original positive and
negative samples should we weighted. We decided it would be safer
to omit the fuzziness feature instead of guessing.

3.4 Training
We used eight-fold cross-validation for evaluation for both networks.
We split the dataset of 321 files into eight evenly sized sets of size 40
or 41 and trained eight models – one for each set. The models were
given six of the eight sets as training data and the other two were used
for validation and testing respectively, so that no model was trained
on its testing data.
This cross-validation setup was slightly different from the one
described in [4] in which 75% of the data was used for training, 15%
for validation and 10% for testing. Since there was no explanation for
why the validation and testing sets were of unequal size, and because
they were of equal size in the ISMIR 2018 tutorial, we surmised that
the discrepancy was a calculation error on their part.
The eight models were trained with 20 epochs of patience. That is,
we continued training them until no improvement on their respective
validation sets could be observed for 20 epochs. This is the same
regimen that Böck and colleagues used to train the RNN but not the
CNN, in which each fold was trained for exactly 100 epochs. The
reason we deviated was again due to limited computing resources.
From cursory inspection, it seemed that 20 epochs of patience was
sufficient and that the models wouldn’t have improved even if we had
trained them more.
CHAPTER 3. METHODS AND IMPLEMENTATIONS 21

All weights in the network were initialized using Xavier initializa-


tion, except for the recurrent weights in the RNN which were initial-
ized using orthogonal initialization. This initialization scheme was at
the time of writing Keras’ default and also used in the ISMIR 2018
tutorial. Xavier initialization has been shown to bring substantially
faster convergence [19].

3.5 Evaluation
To evalute the two architectures, we trained three models for each
architecture and fold. In total 3 ∗ 2 ∗ 8 = 48 models were trained.
For each of the metrics precision, recall and F-score, we computed
the arithmetic mean of the three values. To make the performance
numbers comparable, we did not reshuffle the folds between runs.
An onset was considered detected if there was a ground truth
label within ±25 ms of the predicted peak, as described in section 2.7.
Furthermore, peaks were also combined if they occurred within 30 ms
of each other.
Chapter 4

Results

Our results are presented in table 4.1 and table 4.2. The tables show
the precision, recall and F-score for each fold for each of the three
models trained. It also shows these metrics calculated on the whole
dataset.
The values vary quite a bit depending on the fold, which is likely
an artifact of the comparatively small dataset – each fold only contains
40 or 41 files.
A surprising result is that the RNN architecture outperformed the
CNN architecture with a mean F-score of 86.3% versus 85.6% for the
former.

22
CHAPTER 4. RESULTS 23

# R1 R2 R3 R
P R1 R R1 F R1 P R2 R R2 F R2 P R3 R R3 F R3 PR RR FR
0 91.4 85.8 88.5 92.7 84.0 88.1 91.5 85.9 88.6 91.9 85.2 88.4
1 92.3 88.5 90.3 91.8 88.4 90.1 92.9 87.9 90.3 92.3 88.3 90.2
2 92.1 73.2 81.6 88.5 77.0 82.3 91.6 74.6 82.2 90.7 74.9 82.0
3 85.2 85.4 85.3 84.3 86.2 85.2 87.3 86.0 86.6 85.6 85.9 85.7
4 91.3 84.5 87.8 91.2 84.3 87.6 86.7 88.0 87.4 89.7 85.6 87.6
5 88.8 84.4 86.6 87.3 86.5 86.9 88.5 84.7 86.5 88.2 85.2 86.7
6 95.1 74.1 83.3 95.3 74.4 83.6 96.3 73.5 83.4 95.6 74.0 83.4
7 88.9 84.7 86.7 90.7 82.6 86.5 88.3 85.3 86.8 89.3 84.2 86.7
Tot 90.7 82.3 86.3 90.2 82.7 86.3 90.3 83.0 86.5 90.4 82.9 86.3

Table 4.1: Precision, recall and F-scores for three fully trained RNN models.

# C1 C2 C3 C
P C1 R C1 F C1 P C2 R C2 F C2 P C3 R C3 F C3 PC RC FC
0 97.7 74.6 84.6 97.3 80.2 87.9 97.7 75.1 84.9 97.6 76.6 85.8
1 96.3 81.7 88.4 97.1 79.3 87.3 95.4 84.5 89.6 96.3 81.8 88.4
2 96.9 68.6 80.3 96.1 73.4 83.3 96.0 70.1 81.0 96.3 70.7 81.5
3 93.5 82.5 87.7 94.0 85.2 89.4 91.0 84.8 87.8 92.8 84.2 88.3
4 94.0 77.3 84.8 92.9 79.3 85.6 92.8 80.9 86.4 93.2 79.2 85.6
5 95.3 79.4 86.6 96.2 76.2 85.0 94.0 83.4 88.4 95.2 79.7 86.7
6 97.2 72.8 83.2 97.0 73.8 83.8 97.1 72.8 83.2 97.1 73.1 83.4
7 94.1 77.4 84.9 93.9 76.7 84.4 94.0 77.2 84.8 94.0 77.1 84.7
Tot 95.6 76.5 85.0 95.5 77.9 85.8 94.7 78.4 85.8 95.3 77.8 85.6

Table 4.2: Precision, recall and F-scores for three fully trained CNN models.
Chapter 5

Discussion

Our final F-score for the CNN is 85.6%, which is an unsatisfactory


result compared to the 90.3% figure reported in the referenced article.
Some caveats apply though:

1. We did not implement the fuzziness feature which maybe caused


the detector to become too conservative. The RNN’s mean pre-
cision and recall are quite balanced at 90.7% and 82.3%, respec-
tively, but the same cannot be said of the CNN’s scores. Its
precision and recall is 95.6% and 76.5%, indicating that the detec-
tion threshold is perhaps set too high.

2. We did not schedule the momentum parameter, as suggested in


section 2.6.4.

3. We trained with 20 epochs of patience instead of fixing the


number of epochs to 100.

4. We did not optimize the peak-picking threshold, for reasons


given in section 3.3.4, and instead used the 0.54 figure found in
Madmom verbatim.

Yet even taking these caveats into account, the difference is hard to
explain. Cursory experiments suggests that, indeed, the CNNs peak-
picking threshold is set too high and that an F-score of perhaps 88%
would be achievable by optimizing it. However, that is still two percent
short of 90.3%.
Another interesting result is the lack of improvement for the RNN.
The architecture we have evaluated has one extra bidirectional layer

24
CHAPTER 5. DISCUSSION 25

and uses 266-dimensional input vectors over the 144-dimensional


ones used in the 2013 article. Yet it didn’t outperform it! This result
suggests that more than two layers leads to diminishing returns for
RNN architectures for onset detection.
Also worth noticing is that some folds were more difficult for the
architectures than others. The mean F-scores for the third fold for the
RNN and CNN were 81.6% and 82.9%, but 85.4% and 88.4% for the
fourth one. The difference could be explained by there being more
audio files containing notes with soft onsets in the third fold, which
are known to be harder to predict. The small size of the dataset likely
also played a part.
Chapter 6

Conclusion

In this research, we have tried but failed to replicate the results de-
scribed in Schlüter and Böck [1] and Böck, Schlüter, and Widmer [3].
Due to differences in training regimen, implementation architecture
and quality, we cannot claim that our failure is caused by the original
research being impossible to replicate.
What we can claim, though, is that the original research lacked
important details that made our replication attempt harder than it
otherwise could have been. For example, for the RNN the articles
didn’t mention what hyperparameter settings were used for training
and for the CNN the fuzziness feature wasn’t explained in enough
detail for us to reimplement it.
Even though the omission of these details complicated our repli-
cation attempt, we do not believe that the original authors should be
faulted. Documenting every little detail seem like a nigh impossible
task and there would always be something missing. Researchers also
have to abide by publishing rules, limiting the lengths of their articles.
We also note that without Böck’s Madmom library and his IS-
MIR 2018 tutorial, our replication attempt would have been orders of
magnitude harder.
We therefore claim that our attempt to replicate has shown that
published research is not sufficient to replicate state of the art results
in onset detection. Whether more detailed research can help improve
the situation is an open question. In our opinion the answer is in
the negative – to replicate you need to see executable source code,
explanations written in natural languages aren’t enough.

26
CHAPTER 6. CONCLUSION 27

6.1 Directions for future research


In this thesis, we have highlighted some problems making replicating
the state of the art results in onset detection difficult. It would be very
interesting to know if it is equally difficult to replicate results in other
domains in which deep learning has been used, such as sentiment
analysis, computer vision and bioinformatics.
For improving onset detection in music, one idea is to forego using
spectrograms and instead use the raw waveform as input to the neural
networks. Spectrograms are very useful for general signal processing
but are perhaps unnecessary for neural network training. Possibly,
simplifying the input in that way could boost performance.
Something also quite obvious is that bigger datasets are needed.
The Böck dataset appears to be too small to train very deep networks
with.
Bibliography

[1] Jan Schlüter and Sebastian Böck. “Improved musical onset de-
tection with Convolutional Neural Networks.” In: ICASSP. 2014,
pp. 6979–6983.
[2] Rong Gong and Xavier Serra. “Towards an efficient deep learning
model for musical onset detection”. In: CoRR abs/1806.06773
(2018). arXiv: 1806.06773. url: http://arxiv.org/abs/1806.
06773.
[3] Sebastian Böck, Jan Schlüter, and Gerhard Widmer. “Enhanced
peak picking for onset detection with recurrent neural networks”.
In: Proceedings of the 6th International Workshop on Machine Learn-
ing and Music, Prague, Czech Republic. 2013.
[4] Sebastian Böck et al. “Online real-time onset detection with
recurrent neural networks”. In: Proceedings of the 15th International
Conference on Digital Audio Effects (DAFx-12), York, UK. 2012.
[5] Matthew Hutson. Artificial intelligence faces reproducibility crisis.
2018.
[6] Juan Pablo Bello et al. “A tutorial on onset detection in music
signals”. In: IEEE Transactions on speech and audio processing 13.5
(2005), pp. 1035–1047.
[7] David Martin Powers. “Evaluation: from precision, recall and
F-measure to ROC, informedness, markedness and correlation”.
In: (2011).
[8] Sebastian Böck and Gerhard Widmer. “Maximum filter vibrato
suppression for onset detection”. In: Proc. of the 16th Int. Conf. on
Digital Audio Effects (DAFx). Maynooth, Ireland (Sept 2013). 2013.
[9] Jan Schlüter and Sebastian Böck. “Musical onset detection with
convolutional neural networks”. In: 6th international workshop on
machine learning and music (MML), Prague, Czech Republic. 2013.

28
BIBLIOGRAPHY 29

[10] Simon Dixon. “Onset detection revisited”. In: Proceedings of the


9th International Conference on Digital Audio Effects. Vol. 120. Cite-
seer. 2006, pp. 133–137.
[11] Nick Collins. “Using a Pitch Detector for Onset Detection.” In:
ISMIR. Citeseer. 2005, pp. 100–106.
[12] Juan Pablo Bello et al. “On the use of phase and energy for
musical onset detection in the complex domain”. In: IEEE Signal
Processing Letters 11.6 (2004), pp. 553–556.
[13] Alexandre Lacoste and Douglas Eck. “A supervised classification
algorithm for note onset detection”. In: EURASIP Journal on
Applied Signal Processing 2007.1 (2007), pp. 153–153.
[14] Florian Eyben et al. “Universal onset detection with bidirectional
long-short term memory neural networks”. In: Proc. 11th Intern.
Soc. for Music Information Retrieval Conference, ISMIR, Utrecht, The
Netherlands. 2010, pp. 589–594.
[15] Emre Cakır et al. “Convolutional recurrent neural networks for
polyphonic sound event detection”. In: IEEE/ACM Transactions
on Audio, Speech, and Language Processing 25.6 (2017), pp. 1291–
1303.
[16] Sebastian Böck, Florian Krebs, and Markus Schedl. “Evaluating
the Online Capabilities of Onset Detection Methods.” In: ISMIR.
2012, pp. 49–54.
[17] Sebastian Böck. “Event Detection in Musical Audio”. PhD thesis.
PhD thesis. Johannes Kepler University Linz, Linz Austria, 2016.
[18] Sebastian Böck et al. “Madmom: A new Python audio and music
signal processing library”. In: Proceedings of the 2016 ACM on
Multimedia Conference. ACM. 2016, pp. 1174–1178.
[19] Xavier Glorot and Yoshua Bengio. “Understanding the difficulty
of training deep feedforward neural networks”. In: Proceedings
of the thirteenth international conference on artificial intelligence and
statistics. 2010, pp. 249–256.
TRITA -EECS-EX-2019:340

www.kth.se

You might also like