Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Human or Machine?

Classical Piano Music Piece


Generation with Bidirectional GRU
Marin Beijerbacht Rivka Vollebregt Dimitris Christodoulou Mart Schilthuis Nikos Petsios
Utrecht University Utrecht University Utrecht University Utrecht University Utrecht University
(9884270) (8842507) (5141761) (1083147) (1987380)

Abstract—Music plays a crucial role in many aspects of will be elaborated upon. Furthermore, in the results section,
entertainment such as movies, games, television etc. These fields, the generated songs are expanded and analyzed and a user test
and more specifically the music industry have an extensive need experiment is conducted to determine whether the generated
of a system that can create music with ease. This system could
also be used by non-musicians in case they want to create songs are able to be distinguished against actual man-made
their own music without any prior music knowledge. A rather music. The limitations of the model as well as avenues for
interesting topic in artificial intelligence is how to generate future work are discussed as well.
something by using data only. In this study, we experimented
with a Recurrent Neural Network (RNN) that has a sophisticated II. R ELATED W ORK
gating mechanism, in order to generate music that is similar to
human made music. More specifically we tuned a bidirectional GRU are part of a class of networks that are recurrent. As
GRU neural network that was trained using midi files and made the output from a previous step is used as input to the current
to come up with patterns in order to create music. As we trained step these kinds of networks lend themselves to tasks where
the model we observed that the later epochs were significantly the output of the network is dependent on what came before it
improving in terms of generating music that is as close as to like text classification, speech recognition or generating music
human music.
Index Terms—GRU, music, deeplearning, RNN
[3]–[5]. In 1989 in a paper by Todd the first recurrent architec-
ture was used for the task of music generation [6]. From this
point many papers on music generation with different RNN
I. I NTRODUCTION
architectures have been published. The ones using GRUs often
Music generation is a lengthy and difficult process for compare the model to a LSTM model, which is the other gated
humans. Combining technicality with emotion is something type of recurrent network.
that a moderate amount of people can do well. It is of In a paper by Xie three RNN structures are tested on the
no surprise therefore, that the machine learning community task of music generation [7]. The data used are MIDI files
has tried to emulate this process, in particular through deep encoded using a dictionary created from all used notes. On
learning. Multiple models exist that try to capture the feeling of this encoded data a simple RNN, GRU and LSTM architecture
music composition, such as DeepMind’s WaveNet or OpenAI’s are trained. The GRU came out as the best model due to its
MuseNet [1] [2]. faster convergence compared to the other models, even though
In this paper, a neural network model that generates classical all models eventually reached 80% accuracy.
piano music is created using bidirectional Gated Recurrent In Empirical Evaluation of Gated Recurrent Neural Net-
Unit (GRU) layers. GRU, much like Long Term Short Term works on Sequence Modeling both LSTM and GRU are tested
Memory (LSTM), is a memory cell for Recurrent Neural Net- again [8]. Both models are tested on speech signal modeling
works (RNNs), with the former providing better performance, and music modeling to see how the models perform in regard
especially in small and sparse datasets. The purpose of this to each other since there are many similarities between the
model is to create music sounding as close as possible to LSTM and GRU. Focusing on the music modeling task,
human made music and benchmark this type of approach to the GRU model outperformed the LSTM. It makes faster
generating music against other models, such as WaveNET and progress in terms of number of updates and CPU time. LSTMs
LSTM neural networks. performed better on the speech task, leading to the conclusion
Overall, the data that is used is comprised of MIDI files both models are comparable but have their advantages and
of classical piano pieces from the MAESTRO dataset. The disadvantages depending on the task.
MAESTRO dataset is part of the Magenta project, an open- A different approach to the task of generating music is using
source research project that overall explores the role of ma- the waveforms directly as input to the network to learn, as is
chine learning as a tool in creative processes. usually done for speech signal modeling tasks. This approach
The data section shows how the dataset was parsed and was used in a paper by Nayebi and Vitelli, where once again a
how the size of it played an important role in the training LSTM and GRU were compared [9]. In this paper the LSTM
of the model and in its architecture. In the methods section, model was found to generate more musically plausible samples
the training and hyperparameter tuning processes of the model than the GRU. The LSTM has better representation capabilities
in order to learn the waveforms thus outperforming the GRU fore, the architecture of the model, training dataset, chosen
here. hyperparameter values and evaluation methods played a very
In this paper MIDI data will be used and will be processed in important role. The approach of this research was, unlike
the same manner as is done in the paper by Xie [7]. However, previous research on generating music with Neural Networks,
where the papers discussed all focus on seeing the performance mostly result-based. Therefore, the evaluation of the model
of the models computationally and the differences between was done mostly based on the quality of the generated music
the LSTM and GRU. This paper will focus on tuning a GRU which was measured in multiple ways; a Turing test (whereby
model to create music as close as possible to human made participants listened to a human and a NN song and guess
music. which one was the ’real’ one), a melody evaluation based
on the produced sheet music and an evaluation of the notes
III. DATA distribution of the songs. Also, the model was tuned based
When looking into a possible dataset, there were a few on the training loss instead of on the validation loss. This
requirements in mind. [1] The dataset needed to be able to was done because it did not matter very much if the model
be manageable and easily parseable. This means being able to slightly overfitted as long as the produced music was of
get the needed data from a datapoint and being easily used and good quality. First, the model was designed by tuning the
generated without the need of many third party libraries. [2] hyperparameters and looking at both the resulting model loss
The dataset need to be big enough. There needed to be a lot and the (subjectively evaluated) quality of the generated music.
of datapoints because this dataset would be used for training Once the model was optimized, it was run with a dataset of
and validating the model. 12.6 million bytes (93 songs ranging from 3 to 40 minutes)
There were three possible datasets that seemed useful for for 150 epochs with a batch size of 150. The model produced
the research. The Lakh Pianoroll Dataset (LPD) [10], [11], songs with a length 50 seconds based on a random starting
the Million Song Dataset (MSD) [12] and the Meastro Dataset note from the dataset.
(MD) [13]. The LPD is a big dataset with songs in ’Pianorolls’, 1) Model Architecture: The GRU model was created using
however these needed an extra step to be parsed in order to the keras library in Python. The initial chosen model archi-
be used for the model, therefore, it was not used. The MSD tecture was that of a gated recurrent neural network. The
also is a big dataset but only contains derived features and no choice for a gated RNN was twofold; to prevent vanishing
audio. To get audio, the output of the model needed to get an gradients and because music generation requires a memory-
extra step for parsing. The MD does not have these issues and unit that can ’remember’ more than 1 note (more previous
was thus used for our model. The MD contains two hundred states than 1), which a gated RNN is very suited for. The first
hours of piano performances in MIDI and audio forms from choice was a long-short term memory (LSTM) neural network,
the year 2000 to 2018. These are divided into songs between which was eventually dropped in favor of a gated recurrent unit
3 to 40 minutes. (GRU) neural network architecture. While both memory cells
Before inputting the MIDI into the model, the data was pre- are better suited to the task of music generation than plain
processed with Music21. Music21 is a “python-based toolkit RNNs because of the lengthy nature of song tracks, LSTM
for computer-aided musicology” [14]. This toolkit was used networks have the disadvantage of taking longer to train. This
to convert the MIDI files into ’Scores’. These scores save the happens because LSTM networks contain an output gate and
notes and chords of a piece and store it in a container. Scores thus take longer computational time to train.
also save the timing and length of each note and chord. After The final model architecture contained one bidirectional
this, the preprocessing continues by checking the entire score GRU layer and one regular GRU layer with 256 nodes per
and appending the notes, and the notes of chords, into an array. layer (Figure 1). The dropout rate was 0.1 and it used a
Finally the notes are divided into sets of a hundred samples, so softmax activation function. It measured loss with cross-
the training songs are of length hundred. This is taken from the entropy and used an Adam optimizer.
array and encoded in a dictionary. Afterwards, the dictionary 2) Hyperparameter tuning: Optimising the model design
can be used to train the network with the encoded samples. by tuning hyper-parameters such as number and size of layers,
The data that was used for training is a subset of the or optimal cycles of learning was done using the validation
Maestro Dataset. The training was done on a subset of this dataset. The ratio train/validation data was 90/10, with the
dataset; only the songs from 2018. These were 93 songs with training dataset of 94 songs and the validation dataset of
a duration between 3 to 40 minutes. Not the entire dataset was 9 songs. The evaluation was done manually by listening
used because the training for the model would take days. The to the generated tracks and determining if it sounds like
training was validated with cross-validation, in order to not human orchestrated music. When the model produced well-
having to find a similar dataset and reduce workload. sounding tracks, training stopped and the model was tested
with participants. The quality of the generated music was
IV. A PPROACH
tested after 10, 20, 50, 100 and 135 cycles of learning.
A. Methodology Because of how long RNNs take to train, it was clear
The main goal of the neural network was to design com- from the beginning that GridSearchCV with an exhaustive
pelling music that is indistinguishable to actual music. There- hyperparameter list would take too long to converge and find
The subjective evaluation of the model was the main part
of the evaluation and was done with a Turing test. There
were 10 participants who did 5 trials per person. This entails
that the model generated a piece of music based on the first
few seconds of the track, and a participant listened to both
the original (human made) track and the generated track and
had to determine which one was made by the model. The
exact question that participants had to answer was: “Is track
1 or track 2 created by a human musician?”. The tracks
were randomised so in some trials the first track was model
generated and in some trials the second track. The model
was categorised as a good music generative model if the
accuracy of the participants was below chance level (50%)
at determining the machine generated track. After the Turing
test, the participants were told which songs were the model
generated ones and had to rate the musical quality of the song
Fig. 1. Design of the final bidirectional GRU model on a scale of 1 to 5 with 1 being of very bad quality and 5
being of outstanding quality.

the optimal set of hyperparameters. Therefore, a couple of V. R ESULTS


hyperparameters were chosen and were given values driven A. Model training
by other research in order to find the best performing neural The validation and training loss for the final model were
network. The list of hyperparameters chosen are: obtained during training and plotted, this can be found in figure
1) Number of layers 2. The training loss converges to 0.6 in the 150 the model was
2) Size of each layer trained. Notable is the increase of the validation loss when the
3) The neural network optimizer training loss decreases. This indicates that the model is likely
4) The number of dropout layers and the dropout rate overfitting to the training data. Also notable is the spike in loss
5) The activation function at epoch 62, this together with the fact that in some training
6) The learning rate runs NaN values were encountered indicate that the model is
The hyperparameter setting that gave the lowest loss were having trouble with exploding and vanishing gradients. The
chosen for the final model. Table I shows the history of the fact that the model is not widely generalisable is not focused
hyperparameter settings. on as the goal is to generate music that sounds plausible.
3) User study: In the user study, six participants judged the
GRU produced songs in a Turing test. Participants were asked
to listen to two pieces of music, one from the MAESTRO
dataset, the other generated by our GRU network, and give
their opinion on which song they thought was made by
humans. They also rated the GRU produced song on a scale
of 1 to 5 on quality of music where 1 was very poor music
and 5 was excellent music quality. Each participant completed
5 trials. The songs were generated using a random seed from
the model trained on epoch 135 to have the best sounding
music from the model. The control songs from the MAESTRO
dataset were chosen randomly. Each song was displayed for
20 seconds after which the song was manually stopped by the
experimenter.

B. Evaluation Fig. 2. Model loss and validation loss per epoch

The evaluation of the model was done with multiple mea-


sures, both subjective and objective. The objective evaluation B. Visualizing Generated Songs
was done by visualising the generated songs in sheet music 1) Sheet music of Generated Songs: The songs that the
and the notes played. With this the songs generated of the final model generated after 36 epochs and 136 epochs (the final
model could be compared to songs generated of a previous model) were very different as the song from 136 epochs
(less trained) model and human produced songs. Also, the showed many musical qualities as a melody, different length
model was evaluated during training based on the loss. of notes and a both-handedness (both the left and right hand
Model Type Optimizer Epochs Batch Size Dropout Activation Function # of Layers Layer Size Loss Function Loss
GRU RMSprop 25 128 0.3 Softmax 1 256 Categorical Crossentropy 4,0069
GRU RMSprop 25 128 0.3 Softmax 1 256 Kullback-Leibler divergence 4.0210
GRU Adam 25 128 0.5 Hidden - ReLU/Output - Softmax 2 256 Categorical Crossentropy 4.9109
GRU RMSprop 50 128 0.3 Softmax 1 512 Categorical Crossentropy 3.6265
Bidirectional GRU RMSprop 30 128 0.5 Softmax 2 256/64/128 Categorical Crossentropy 4.2187
Bidirectional GRU RMSprop 50 128 0.1 Softmax 1 512 Categorical Crossentropy 3.2375
Bidirectional LSTM RMSprop 50 64 0.3 Softmax 2 512 Categorical Crossentropy 0.6
LSTM RMSprop 20 128 0.1 Softmax 1 256 Categorical Crossentropy 5.4286
Bidirectional GRU with Dense Layer Adam 25 128 0.5 Softmax 2 256/128 Categorical Crossentropy 4.5809
2-Layer Bidirectional GRU Adam 25 128 0.5 Softmax 2 256/256 Categorical Crossentropy 4.5981
3-layer bidirectional GRU Adam 25 128 0.5 Softmax 3 256/256/256 Categorical Crossentropy 3.8271
2-layer bidirectional GRU with single GRU layer Adam 25 128 0.5 Softmax 3 256/256/256 Categorical Crossentropy 3.7823
Single layer bi-directional LSTM Adam 20 128 0.3 Softmax 1 64 Categorical Crossentropy 4.0937
Single layer bi-directional LSTM SGD 20 128 0.3 Softmax 2 64 Categorical Crossentropy 4.4412
Bidirectional GRU with single GRU layer SGD 25 128 0.5 Softmax 2 256/64 Categorical Crossentropy 4.6531
Bidirectional GRU with single GRU layer RMSprop 25 128 0.5 Softmax 2 64/64 Categorical Crossentropy 4.6517
TABLE I
H YPERPARAMETER T UNING OF THE MODEL

are used in this song), that the song generated after 36 epochs
did not show at all (Figure 3, 4).

Fig. 3. Sheet music of a generated song after 36 epochs

Fig. 4. Sheet music of a generated song after 136 epochs

2) Notes distribution: As with the sheet music of the Fig. 5. Notes distribution of a generated song after 36 epochs
generated songs, the notes distribution also displays well how
the music quality progressed during model training. In the
generated song after 36 epochs in Figure 5, there is little
variation in the notes and there is no melody visible, while
in the song generated after 136 epochs in Figure 6 (one of the
songs used in the user turing test) there is a variety of notes
with different intervals and it is very similar to the human
generated song in figure 7. One difference is that the model
generates shorter notes whereas the human produced song has
varied long and shorter notes.
C. User Study Results
The accuracy with which participants guessed correctly in
the Turing test which song was RNN generated was 93.3%.
The musical quality of the songs was measured with a mean
value of 3.1 (So slighty above medium quality). A one-sample
t-test revealed that this rating was not statistically significant
(p=0.225).
Fig. 6. Notes distribution of a generated song after 136 epochs
VI. D ISCUSSION
In this paper the generation of a MIDI sample of piano
music was discussed. After compiling a bidirectional 2-layer The results obtained during the user study indicate that the
GRU model and training it the generated samples were evalu- samples used in the test trials were not indistinguishable from
ated. From this it was found that although the generated music the human made music from the test set as 93 % of the
was promising, as it displayed some structure and developed samples were identified correctly ( only 2 trials of the 30 were
melody, it was not able to let people think it was man made. misidentified). Despite of the small sample of users, the results
overfitted on the training data, as the increase in validation loss
and decrease in loss showed. Two factors likely contributing
to this are the increased learning rate, which is known to
cause increasing validation loss after just a few epochs, and
the small data set used to train the model. The model was
trained on this smaller dataset due to resource limitations as
training time would otherwise be excessive. In future work
tuning the parameters to decrease validation loss might result
in better generations, for example a higher dropout rate could
be implemented combined with a larger training dataset. This
tuning of parameters could then also be done using gridsearch
with cross validation to ensure a thorough search of the
parameter space.
What should be noted is that loss might not be the best
metric to judge the performance of the model on an open ended
generation task. When generating a text or music sample the
product might still be correct (in this case sound like plausible
Fig. 7. Notes distribution of a human produced song
music), and has learned correct patterns. However, if it applies
the patterns in a different manner the loss when compared
to the test will still be high. Thus, the high validation loss
from the Turing test strongly indicate that the generated music was not a very important result in this study and perhaps this
is of worse sounding quality than human generated songs. This indicated that for future research, a different evaluation metric
means that the generated samples are not musically plausible (more result-based) might be better for the field of artificial
enough. An alternative explanation, however, is that there is a music generation.
systematic difference between the model and human generated There are some technical issues with the current model that
songs that has little to do with musical likeliness: the human inhibited the optimal model training and might have prevented
generated music contained dynamics meaning that the music further improvement of the generated songs. First, the issue
got softer and louder throughout the song which the network with the exploding and vanishing gradients encountered during
generated music did not have. This is supported by the fact that the training of the model could be addressed in future work.
in the first trial someone did think the generated sample was Currently the training time is impacted by this and in some
human, but in the next trials could distinguish the human from training runs NaN values were encountered when training
generated music, based on this dynamics artifact: the generated more than 130 epochs. To handle the vanishing gradients,
samples distinguished themselves from the test samples in implementing a ReLu activation layer in the model design
that all notes were played at the same loudness and included might be beneficial. For the exploding gradients, implementing
little to no sustaining notes. Because the Turing test might gradient clipping should help prevent them. Another interest-
in this case be biased, a better evaluation method is probably ing feature that might really benefit the music generation is
the sheet music and note visualisation. From the sheet music to incorporate reinforcement learning and/or attention in the
was apparent that the generated song contained both melody, model as it would allow the model to learn more of the overall
different note lengths and both-handedness which are qualities structure in the data.
all found in human generated music. The note visualisation
showed that while there were some differences between the VII. C ONCLUSION
generated and human produced music, especially in the length We have built a model to generate music using the MAE-
of the notes, there was also a lot of similarity which could be STRO dataset which contains piano music in the form of
heard in both songs. The user study did judge the generated MIDI files. The music generated is to the subjective ear of
songs at medium quality. The statistical analysis showed that moderately good musical quality but it is still not on the same
although the user judgement was slightly above medium (3.1), level as human generated music. The model after 136 epochs
this was not significant meaning that the music quality was of training started to create some melody which looks similar
medium. As an improvement on the current methods, it would to the human generated music but could still be differentiated
be better to also ask the users of the perceived musical quality from human generated music in a Turing test.
of the human produced songs so a two-sample t-test can For future work, this model could run with more computa-
establish whether there the quality of the songs are similar tional power as training was time consuming. This would also
or not. Another improvement would be to have a Turing test allow the model to be more complex which might help the
where the human samples are mixed to the same loudness quality of the generated songs. In addition, we could create a
as the generated samples could help eliminate this factor and bidirectional LSTM neural network for further experiments
make it harder for participants to learn the network artefacts. and compare this with the model tested in this research.
From the model results it became apparent that the model In this study we experimented on generating music of a
single instrument, this makes further experimenting and adding
of multiple instruments a great challenge. Adding multiple
instruments creates the need for a better architecture and would
probably end up with better results. All in all, the bidirectional
GRU network is reasonably good at generating music but still
has a long way to go to match human generated music.
R EFERENCES
[1] DeepMind, “A generative model for raw audio.”
[2] C. M. L. Payne, “Musenet. openai..”
[3] A. Shewalkar, D. Nyavanandi, and S. A. Ludwig, “Performance evalu-
ation of deep neural networks applied to speech recognition: Rnn, lstm
and gru,” Journal of Artificial Intelligence and Soft Computing Research,
vol. 9, no. 4, pp. 235–245, 2019.
[4] A. N. Shewalkar, “Comparison of rnn, lstm and gru on speech recogni-
tion data,” 2018.
[5] J.-P. Briot, “From artificial neural networks to deep learning for music
generation: history, concepts and trends,” Neural Computing and Appli-
cations, vol. 33, no. 1, pp. 39–65, 2021.
[6] P. M. Todd, “A connectionist approach to algorithmic composition,”
Computer Music Journal, vol. 13, no. 4, pp. 27–43, 1989.
[7] J. Xie, “A novel method of music generation based on three different
recurrent neural networks,” in Journal of Physics: Conference Series,
vol. 1549, p. 042034, IOP Publishing, 2020.
[8] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of
gated recurrent neural networks on sequence modeling,” arXiv preprint
arXiv:1412.3555, 2014.
[9] A. Nayebi and M. Vitelli, “Gruv: Algorithmic music generation using
recurrent neural networks,” Course CS224D: Deep Learning for Natural
Language Processing (Stanford), 2015.
[10] H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, and Y.-H. Yang, “Musegan:
Multi-track sequential generative adversarial networks for symbolic
music generation and accompaniment. in proceedings of the 32nd aaai
conference on artifical intelligence (aaai),” 2018.
[11] C. Raffel, “Learning-based methods for comparing sequences, with
applications to audio-to-midi alignment and matching. phd thesis,” 2016.
[12] T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and P. Lamere, “The million
song dataset. in proceedings of the 12th international society for music
information retrieval conference (ismir 2011),” 2011.
[13] C. Hawthorne, A. Stasyuk, A. Roberts, I. Simon, C.-Z. A. Huang,
S. Dieleman, E. Elsen, J. Engel, and D. Eck, “Enabling factorized
piano music modeling and generation with the MAESTRO dataset,” in
International Conference on Learning Representations, 2019.
[14] M. S. Cuthbert, “What is music21?,” Sep 2021.

You might also like