Professional Documents
Culture Documents
Final Paper
Final Paper
ABSTRACT
1.1 Reproducibility
Through partial recreation of the Audio-to-Score (A2S) Task as In determining reproducibility, we have used the following set
described in A Holistic Approach To Polyphonic Music of questions, which are also adopted from the National
Transcription With Neural Networks by Miguel A. Román, Academies of Sciences, Engineering, and Medicine:
Antonio Pertusa, and Jorge Calvo-Zaragoza [1], we have ● Are the data and analysis laid out with sufficient
determined that its experiment is reproducible, though with transparency and clarity that the results can be
difficulty. Our recreation of the experiment included modifying checked?
source code and outputting a neural network model based on ● If checked, do the data and analysis offered in support
training and test data. With our issues with installation, setup, of the result in fact support that result?
time and resource restraints, and execution of code, we were ● If the data and analysis are shown to support the
only able to run six epochs. While we obtained results that original result, can the result reported be found again
aligned with the original paper, we were not able to evaluate in the specific study context investigated?
the full reproducibility of the experiment, which ran one We will explore this by comparing our execution of the
hundred epochs. We can conclude through our findings that, experiment with that of the original paper.
although inconclusive, our results are nontrivial in replicating
an end-to-end polyphonic transcription.
2 Background
1 Introduction 2.1 Background From the Paper
Reproducibility is essential to science because it proves the The main purpose of the original experiment, as presented by
validity of an experiment, allowing for more thorough research Román, Pertusa, and Calvo-Zaragoza, was to perform automatic
while confirming its results. The result of an experiment is notation-level music transcription, a task evolved from frame-,
considered a building block for other studies only after it is note-, and stream-level transcription [1]. The input data, as
proven valid. Therefore, it is important for us to determine described in the paper, was the time-domain spectral
reproducibility of an experiment. Specifically, we would like to information derived from the Hamming window STFT of the
reproduce this paper due to its model’s potential in automatic audio found in the Quartets and Chorales dataset of the
music transcription. humdrum-data repository [4]. Their methodology was to
implement a single-step convolutional recurrent neural
To assess the reproducibility of the paper, we have adopted the network (CRNN) that would output **kern files, which could
definition of reproducibility from the National Academies of then be translated into a western music score format. Aspects
Sciences, Engineering, and Medicine. Reproducibility is of scores not included in the creation of the model were clefs,
obtaining consistent results using the same input data; time signatures, key signatures, and several music notation
computational steps, methods, and code; and conditions of symbols due the potential for misleading the training of the
analysis [10]. model.
The source code and data from A Holistic Approach To Polyphonic Music
Transcription With Neural Networks are available at
https://github.com/mangelroman/audio2score.
In the original experiment, preprocessing done on the data
included manually editing **kern files for correctness, adding
metronome markings, including a random scaling factor to
account for tempo variability, and using symbolic notation to
condense the number of characters. After this processing, the
method for the A2S task included training and decoding a CRNN
as well as its outputs. The CRNN training was accomplished
using a Connectionist Temporal Classification (CTC) loss
function. To reduce overfitting, 10 batch normalization layers
and 20 dropout layers were added after all the recurrent Figure 2: A chart showing the error per epoch trained. Taken
convolutional layers. However, due to the smaller size of the from Figure 4 in [1].
Chorales dataset, there was a greater risk of overfitting
compared to the Quartets dataset. The architecture of the Overall, the original experiment was an expansion of multiple
CRNN was as shown below: previous automatic music transcription attempts. It improved
upon previous attempts by not requiring a frame-alignment of
the ground truth, including an end-to-end approach that keeps
errors within one stage, and using **kern-format output files,
which allows for easy translation to the western music score
format.
2
because of its necessity in creating and running the CRNN and other manipulations that were performed using the
mentioned in the paper. The third step, NVIDIA Apex, is an humdrum tools. This preparation also split the training data
extension for easy mixed precision and distributed training in from the testing data. Before training began, however, a major
Pytorch [7], and the fourth step, Fluidsynth, is a software flaw in the environment was realized, which was that a virtual
synthesizer that generates audio from MIDI data or plays MIDI environment could not access GPU hardware. This was a huge
files [8]. FFmpeg allows for the processing of audio [9]. The last oversight that required us to change how our environment was
step, the humdrum path, installs the tools necessary to process set up. Assuming that our goal was to replicate a model as
humdrum files. shown in the paper, training a model with 100 epochs with the
limited resources partitioned to the virtual machine would take
After the necessary installations were completed, the far too long to train. Additionally, due to the usage of NVIDIA
downloaded Chorales and Quartets datasets were then stored Apex AMP, a pytorch extension that enabled automated mixed
in a folder. In the same folder was a preparation script, which precision, the training would have to be done on a GPU or
created a training file, a validation file, a testing file, and an require significant intervention within the code to get a model
output label file. These represented the CRNN stored into csv trained solely using a CPU.
files.
The next solution was to import the entire system into Google
After the data preparation was completed, the model was Colaboratory (Colab) and use their resources to train our
trained, starting with a shell script that trained the model using model. Trivial issues with package imports and version
default parameters. This script’s parameters could be modified conflicts arose, but when those were resolved, training was
to be multi-GPU, mixed precision, or a combination of the two. finally able to begin. However, the switch to Colab introduced
The testing of the model was also done through shell, and after an unfortunate issue to the scope of our project. Because Colab
both testing and training were completed, the model as well as only allowed for limited-time GPU resource usage, we had to
its metrics were outputted using python. As stated in the limit the total number of epochs our model ran. Due to time and
Section 2.1, the evaluation metrics were WER and CER, and the Colab constraints, the model was only trained and validated
output model was the one with the lowest WER. This marked against 6 epochs. Due to the model taking around 30 minutes
the end of the original experiment code, from which they per epoch to train, we eventually ran out of GPU resources from
derived a WER of 30.96% and CER of 18.10% for Chorales, and Colab while attempting to train the 20 epochs that we had
WER of 21.02% and CER of 13.53% for Quartets. reduced our scope to. This complication was further
exacerbated by a failure on our part to properly document our
changes in a way that Colab could easily replicate the fixes.
2 Implementation Many of the fixes were changed during the runtime of the Colab
Many technical issues arose when attempting to replicate the and not onto a forked repo. This meant that each time the
paper. A major hurdle in this project was the fact that no model model had to be changed, it would be changed just for that
was provided. This meant that we needed to complete an entire instance of the Colab and lines of code would have to be
installation of the repository in order to train a model to changed manually. Ideally, we would be able to share the model
validate the results. Therefore, the scope of our project early on once our Colab timed out and rotate the training of the model,
was to create a model to see if we could get similar results to or we would create different Google accounts to train the model
the paper. If we did, we could also change parameters in the at all times.
training to create interesting results. For instance, given a
dataset of music of a different era, we could see how a model Regardless of these issues, we were able to get a trained model
trained on another genre of music fares with the automated and evaluate it on a testing set. With our model trained with the
transcription algorithm. quartet dataset, we were able to get a WER of 34.93% and a
CER of 26.03%.
Early in the process, it became apparent that the program was
not going to be smoothly compatible with Windows or MacOS,
so workarounds had to be found in order to get the installation
running smoothly. Our first solution was to run a virtual
machine with the appropriate Ubuntu environment. This led to
easier installation of the required packages, some of which
were not available on non-Linux systems.
3
80 epochs in the paper. However, this argument is also abstract
and inconclusive because: 1) Changes were still present in WER
and CER in the paper up to 80 epochs; and 2) The experiment
we ran was inconclusive. To replicate the paper’s results, we
would have needed at least 80 epochs to validate our results
accurately. We also acknowledge that we were limited in time
and resources, so our evaluation of this experiment’s
reproducibility may be incomplete. However, the issues in
setting up the experiment existed despite our limitations and
contributed to the difficulty of its reproducibility.