Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Experiment Reproducibility of A Holistic Approach To Polyphonic

Music Transcription With Neural Networks

Alec Burge Isaac Hong Joanna Lin Heesuk Son


Music Technology Music Technology Music Technology Music Technology
Georgia Inst. of Tech. Georgia Inst. of Tech. Georgia Inst. of Tech. Georgia Inst. of Tech.
Atlanta, GA, USA Atlanta, GA, USA Atlanta, GA, USA Atlanta, GA, USA
aburge6@email.com ihong34@gatech.edu jlin442@gatech.edu hson40@gatech.edu

ABSTRACT
1.1 Reproducibility
Through partial recreation of the Audio-to-Score (A2S) Task as In determining reproducibility, we have used the following set
described in A Holistic Approach To Polyphonic Music of questions, which are also adopted from the National
Transcription With Neural Networks by Miguel A. Román, Academies of Sciences, Engineering, and Medicine:
Antonio Pertusa, and Jorge Calvo-Zaragoza [1], we have ● Are the data and analysis laid out with sufficient
determined that its experiment is reproducible, though with transparency and clarity that the results can be
difficulty. Our recreation of the experiment included modifying checked?
source code and outputting a neural network model based on ● If checked, do the data and analysis offered in support
training and test data. With our issues with installation, setup, of the result in fact support that result?
time and resource restraints, and execution of code, we were ● If the data and analysis are shown to support the
only able to run six epochs. While we obtained results that original result, can the result reported be found again
aligned with the original paper, we were not able to evaluate in the specific study context investigated?
the full reproducibility of the experiment, which ran one We will explore this by comparing our execution of the
hundred epochs. We can conclude through our findings that, experiment with that of the original paper.
although inconclusive, our results are nontrivial in replicating
an end-to-end polyphonic transcription.
2 Background
1 Introduction 2.1 Background From the Paper
Reproducibility is essential to science because it proves the The main purpose of the original experiment, as presented by
validity of an experiment, allowing for more thorough research Román, Pertusa, and Calvo-Zaragoza, was to perform automatic
while confirming its results. The result of an experiment is notation-level music transcription, a task evolved from frame-,
considered a building block for other studies only after it is note-, and stream-level transcription [1]. The input data, as
proven valid. Therefore, it is important for us to determine described in the paper, was the time-domain spectral
reproducibility of an experiment. Specifically, we would like to information derived from the Hamming window STFT of the
reproduce this paper due to its model’s potential in automatic audio found in the Quartets and Chorales dataset of the
music transcription. humdrum-data repository [4]. Their methodology was to
implement a single-step convolutional recurrent neural
To assess the reproducibility of the paper, we have adopted the network (CRNN) that would output **kern files, which could
definition of reproducibility from the National Academies of then be translated into a western music score format. Aspects
Sciences, Engineering, and Medicine. Reproducibility is of scores not included in the creation of the model were clefs,
obtaining consistent results using the same input data; time signatures, key signatures, and several music notation
computational steps, methods, and code; and conditions of symbols due the potential for misleading the training of the
analysis [10]. model.

The source code and data from A Holistic Approach To Polyphonic Music
Transcription With Neural Networks are available at
https://github.com/mangelroman/audio2score.
In the original experiment, preprocessing done on the data
included manually editing **kern files for correctness, adding
metronome markings, including a random scaling factor to
account for tempo variability, and using symbolic notation to
condense the number of characters. After this processing, the
method for the A2S task included training and decoding a CRNN
as well as its outputs. The CRNN training was accomplished
using a Connectionist Temporal Classification (CTC) loss
function. To reduce overfitting, 10 batch normalization layers
and 20 dropout layers were added after all the recurrent Figure 2: A chart showing the error per epoch trained. Taken
convolutional layers. However, due to the smaller size of the from Figure 4 in [1].
Chorales dataset, there was a greater risk of overfitting
compared to the Quartets dataset. The architecture of the Overall, the original experiment was an expansion of multiple
CRNN was as shown below: previous automatic music transcription attempts. It improved
upon previous attempts by not requiring a frame-alignment of
the ground truth, including an end-to-end approach that keeps
errors within one stage, and using **kern-format output files,
which allows for easy translation to the western music score
format.

2.1 Background From the Repository


The code, as presented in the original GitHub [3], was based
off of DeepSpeech2, an end-to-end English and Mandarin
speech recognition experiment run by Sean Naren [5]. The
base system requirement for the original experiment’s
installations was an Ubuntu 18.04 operating system with a
miniconda package and environment manager, and the
Figure 1: A model of the original experiment’s CRNN python version used in the original experiment was Python
architecture. LSTM is the acronym for Long Short-Term 3.7. Below is a figure detailing the flowchart of required
Memory. Taken from Figure 3 in [1]. installations:

The evaluation metrics, called Word Error Rate (WER) and


Character Error Rate (CER), were adopted from an ASR task
[2]. These metrics are “defined as the number of elementary
editing operations (insertion, deletion, or substitution)
needed to convert the predicted sequences into the ground-
truth sequences, at the word and character level respectively”
[1]. These were calculated after each of the 100 epochs that
the model trained, and the model with the lowest WER was
used as the test model. The batch sizes of the training data
was as follows: “The Chorales dataset, whose samples are full-
length chorales, is trained with a batch size of 4. The Quartets
dataset, whose samples are small excerpts extracted from the
full-length quartets, is trained with a batch size of 16” [1].
Figure 3: A flowchart of the installations necessary to run the
The best model after training was determined to be one with A2S task in python.
“WER of 30.96% and CER of 18.10% for Chorales, and WER of
21.02% and CER of 13.53% for Quartets” [1]. The most errors The purpose of the first step, the miniconda environment, was
arose from differentiating between voices and rhythms, to activate a virtual environment that could run the
leading to syntax errors in the **kern sequences. Figure 2 experiment. Pytorch Audio, which allows for GPU acceleration
shows the correlation between error percentage and number and running deep neural networks [6], includes functions for
of epochs run for both the Chorales and Quartets datasets. input/output, signal processing, dataset, and forming the
neural network model. This is an important second step

2
because of its necessity in creating and running the CRNN and other manipulations that were performed using the
mentioned in the paper. The third step, NVIDIA Apex, is an humdrum tools. This preparation also split the training data
extension for easy mixed precision and distributed training in from the testing data. Before training began, however, a major
Pytorch [7], and the fourth step, Fluidsynth, is a software flaw in the environment was realized, which was that a virtual
synthesizer that generates audio from MIDI data or plays MIDI environment could not access GPU hardware. This was a huge
files [8]. FFmpeg allows for the processing of audio [9]. The last oversight that required us to change how our environment was
step, the humdrum path, installs the tools necessary to process set up. Assuming that our goal was to replicate a model as
humdrum files. shown in the paper, training a model with 100 epochs with the
limited resources partitioned to the virtual machine would take
After the necessary installations were completed, the far too long to train. Additionally, due to the usage of NVIDIA
downloaded Chorales and Quartets datasets were then stored Apex AMP, a pytorch extension that enabled automated mixed
in a folder. In the same folder was a preparation script, which precision, the training would have to be done on a GPU or
created a training file, a validation file, a testing file, and an require significant intervention within the code to get a model
output label file. These represented the CRNN stored into csv trained solely using a CPU.
files.
The next solution was to import the entire system into Google
After the data preparation was completed, the model was Colaboratory (Colab) and use their resources to train our
trained, starting with a shell script that trained the model using model. Trivial issues with package imports and version
default parameters. This script’s parameters could be modified conflicts arose, but when those were resolved, training was
to be multi-GPU, mixed precision, or a combination of the two. finally able to begin. However, the switch to Colab introduced
The testing of the model was also done through shell, and after an unfortunate issue to the scope of our project. Because Colab
both testing and training were completed, the model as well as only allowed for limited-time GPU resource usage, we had to
its metrics were outputted using python. As stated in the limit the total number of epochs our model ran. Due to time and
Section 2.1, the evaluation metrics were WER and CER, and the Colab constraints, the model was only trained and validated
output model was the one with the lowest WER. This marked against 6 epochs. Due to the model taking around 30 minutes
the end of the original experiment code, from which they per epoch to train, we eventually ran out of GPU resources from
derived a WER of 30.96% and CER of 18.10% for Chorales, and Colab while attempting to train the 20 epochs that we had
WER of 21.02% and CER of 13.53% for Quartets. reduced our scope to. This complication was further
exacerbated by a failure on our part to properly document our
changes in a way that Colab could easily replicate the fixes.
2 Implementation Many of the fixes were changed during the runtime of the Colab
Many technical issues arose when attempting to replicate the and not onto a forked repo. This meant that each time the
paper. A major hurdle in this project was the fact that no model model had to be changed, it would be changed just for that
was provided. This meant that we needed to complete an entire instance of the Colab and lines of code would have to be
installation of the repository in order to train a model to changed manually. Ideally, we would be able to share the model
validate the results. Therefore, the scope of our project early on once our Colab timed out and rotate the training of the model,
was to create a model to see if we could get similar results to or we would create different Google accounts to train the model
the paper. If we did, we could also change parameters in the at all times.
training to create interesting results. For instance, given a
dataset of music of a different era, we could see how a model Regardless of these issues, we were able to get a trained model
trained on another genre of music fares with the automated and evaluate it on a testing set. With our model trained with the
transcription algorithm. quartet dataset, we were able to get a WER of 34.93% and a
CER of 26.03%.
Early in the process, it became apparent that the program was
not going to be smoothly compatible with Windows or MacOS,
so workarounds had to be found in order to get the installation
running smoothly. Our first solution was to run a virtual
machine with the appropriate Ubuntu environment. This led to
easier installation of the required packages, some of which
were not available on non-Linux systems.

Once installation was complete, training the model became the


next step. First, the dataset had to be prepared. This included
trimming the audio files, removing ties from the **kern files,

3
80 epochs in the paper. However, this argument is also abstract
and inconclusive because: 1) Changes were still present in WER
and CER in the paper up to 80 epochs; and 2) The experiment
we ran was inconclusive. To replicate the paper’s results, we
would have needed at least 80 epochs to validate our results
accurately. We also acknowledge that we were limited in time
and resources, so our evaluation of this experiment’s
reproducibility may be incomplete. However, the issues in
setting up the experiment existed despite our limitations and
contributed to the difficulty of its reproducibility.

To make this experiment easier to reproduce, we recommend


updating code versions, checking for syntax correctness, and
providing details for all source code. This would allow us to
smoothly execute all code without errors due to misspelling or
version/package/system differences. We would also like to
offer one potential extension on the original experiment that
we would have explored if we were able to reproduce this
experiment fully—that is, to train the model on datasets with
genres of music other than classical. The training of this model
Figure 4: Ground truth **kern (right) and the model’s
on wider ranges of music would expand its application to
hypothesis **kern (left). According to the algorithm in the
include different instruments and rhythms, and we predict that
paper, the WER is 12.30% and the CER is 8.45%.
this new training would aid the model in the differentiation of
different voices. Overall, because of the model’s potential in the
The final task was to see if the model could produce a
realm of automatic music transcription, we believe that this
transcription from audio. A script was provided that allows for
experiment is worth revisiting fully and editing so that it
a .wav file to be converted to a **kern file using the model,
becomes an easily reproducible task.
which could then be converted to musicXML. While the **kern
file was able to be created, the conversion between **kern and
XML proved difficult, as the exported file required significant
REFERENCES
manual cleaning and a proper understanding of the **kern
[1] Miguel A. Román, Antonio Pertusa, and Jorge Calvo-Zaragoza. 2019. A
format. holistic approach to polyphonic music transcription with neural networks.
arXiv. DOI: https://doi.org/10.48550/ARXIV.1910.12086.
[2] Miguel A. Román, Antonio Pertusa, and Jorge Calvo-Zaragoza. An End-to-End
Framework for Audio-toScore Music Transcription on Monophonic
3 Conclusion Excerpts. In Proc. of the 19th International Society for Music Information
Retrieval Conference, ISMIR 2018, Paris, France, 2018.
Throughout the experiment, we questioned the reproducibility [3] https://github.com/mangelroman/audio2score.
of the paper. Trivial issues, like debugging and trouble with the [4] https://github.com/humdrum-tools/humdrum-data.git.
[5] https://github.com/SeanNaren/deepspeech.pytorch.
initial installation, along with more pertinent problems, like
[6] https://github.com/pytorch/pytorch#installation.
finding the best operating system and problems within the [7] https://github.com/NVIDIA/apex.
source code, reduced the chances of reproducing the results. [8] https://github.com/FluidSynth/fluidsynth.
[9] https://github.com/FFmpeg/FFmpeg.
[10] National Academies of Sciences, Engineering, and Medicine; Policy and
Through all the challenges we faced, we found that the original Global Affairs; Committee on Science, Engineering, Medicine, and Public
Policy; Board on Research Data and Information; Division on Engineering
experiment is reproducible, but with many caveats. Syntax and Physical Sciences; Committee on Applied and Theoretical Statistics;
errors, invalid parameters, missing packages, and other issues Board on Mathematical Sciences and Analytics; Division on Earth and Life
prevented us from reproducing the experiment fully. However, Studies; Nuclear and Radiation Studies Board; Division of Behavioral and
Social Sciences and Education; Committee on National Statistics; Board on
through scanning and studying the source code, it was Behavioral, Cognitive, and Sensory Sciences; Committee on Reproducibility
transparent enough for us to be able to diagnose and change and Replicability in Science. Reproducibility and Replicability in Science.
Washington (DC): National Academies Press (US); 2019 May 7. 3,
lines of code to get the experiment running. In modifying the Understanding Reproducibility and Replicability. Available from:
source code, important parameters had to be manipulated in https://www.ncbi.nlm.nih.gov/books/NBK547546/.
order to get the training to begin. Despite these manipulations
to the experiment, we were able to produce nearly identical
WER and CER when comparing our model and the paper’s
model at 6 epochs. This discovery is non trivial, and one could
argue that it proves a certain degree of reproducibility; there is
only a small difference in WER and CER between 6 epochs and

You might also like