Music Generation With RNN: Report On Technical Seminar

Report on
Technical Seminar
Music Generation with

RNN
Ankur Narula
1218003
School of Engineering &
IT
Manipal University- Dubai

Campus
November 27, 2017
Contents
Chapter 1
What are Artificial Neural Networks ......................................................................................................... 3
Analogy to the Brain ................................................................................................................................. 4
Artificial Neurons and How They Work .................................................................................................... 5
Electronic Implementation of Artificial Neurons ...................................................................................... 8
WHAT ARE RNNS? ................................................................................................................................... 12
WHAT CAN RNNS DO? ............................................................................................................................ 15
LANGUAGE MODELING AND GENERATING TEXT .................................................................................... 15
Chapter 2
Music Generation with RNN .................................................................................................................... 16
Feedforwardz Neural Networks: ............................................................................................................. 16
Recurrent Neural Networks .................................................................................................................... 18
Training Neural Networks ....................................................................................................................... 19
Can They Make Music? ........................................................................................................................... 20
Input and Output Details ........................................................................................................................ 24
Tensor Flow ............................................................................................................................................. 27
Chapter 3
The Experiment ....................................................................................................................................... 28
Results ..................................................................................................................................................... 31
A GanneG Maman ............................................................................................................................... 31
Ad 197 ................................................................................................................................................. 31
Thacrack .............................................................................................................................................. 32
Lea Oxlee ............................................................................................................................................. 32
References .............................................................................................................................................. 34
2|Page
What are Artificial Neural Networks
Artificial Neural Networks are relatively crude electronic models based on the neural
structure of the brain. The brain basically learns from experience. It is natural proof
that some problems that are beyond the scope of current computers are indeed
solvable by small energy efficient packages. This brain modeling also promises a less
technical way to develop machine solutions. This new approach to computing also
provides a more graceful degradation during system overload than its more traditional
counterparts.
These biologically inspired methods of computing are thought to be the next major
advancement in the computing industry. Even simple animal brains are capable of
functions that are currently impossible for computers. Computers do rote things well,
like keeping ledgers or performing complex math. But computers have trouble
recognizing even simple patterns much less generalizing those patterns of the past into
actions of the future.
Now, advances in biological research promise an initial understanding of the natural

thinking mechanism. This research shows that brains store information as patterns.
Some of these patterns are very complicated and allow us the ability to recognize
individual faces from many different angles. This process of storing information as
patterns, utilizing those patterns, and then solving problems encompasses a new field in
computing. This field, as mentioned before, does not utilize traditional programming but
involves the creation of massively parallel networks and the training of those networks
to solve specific problems. This field also utilizes words very different from traditional
computing, words like behave, react, self-organize, learn, generalize, and forget.
3|Page
Analogy to the Brain
The exact workings of the human brain are still a mystery. Yet, some aspects of this
amazing processor are known. In particular, the most basic element of the human brain
is a specific type of cell which, unlike the rest of the body, doesn't appear to
regenerate. Because this type of cell is the only part of the body that isn't slowly
replaced, it is assumed that these cells are what provides us with our abilities to
remember, think, and apply previous experiences to our every action. These cells, all
100 billion of them, are known as neurons. Each of these neurons can connect with up
to 200,000 other neurons, although 1,000 to 10,000 is typical.
The power of the human mind comes from the sheer numbers of these basic
components and the multiple connections between them. It also comes from genetic
programming and learning.
The individual neurons are complicated. They have a myriad of parts, sub-systems, and
control mechanisms. They convey information via a host of electrochemical pathways.
There are over one hundred different classes of neurons, depending on the
classification method used. Together these neurons and their connections form a
process which is not binary, not stable, and not synchronous. In short, it is nothing like
the currently available electronic computers, or even artificial neural networks.
These artificial neural networks try to replicate only the most basic elements of this
complicated, versatile, and powerful organism. They do it in a primitive way. But for the
software engineer who is trying to solve problems, neural computing was never about
replicating human brains. It is about machines and a new way to solve problems.
4|Page
Artificial Neurons and How They Work
The fundamental processing element of a neural network is a neuron. This building
block of human awareness encompasses a few general capabilities. Basically, a
biological neuron receives inputs from other sources, combines them in some way,
performs a generally nonlinear operation on the result, and then outputs the final
result. Figure 2.2.1 shows the relationship of these four parts.
Figure 1
Within humans there are many variations on this basic type of neuron, further
complicating man's attempts at electrically replicating the process of thinking. Yet, all
natural neurons have the same four basic components. These components are known
by their biological names - dendrites, soma, axon, and synapses. Dendrites are hair-like
extensions of the soma which act like input channels. These input channels receive their
input through the synapses of other neurons. The soma then processes these incoming
signals over time. The soma then turns that processed value into an output which is
sent out to other neurons through the axon and the synapses.
Recent experimental data has provided further evidence that biological neurons are
structurally more complex than the simplistic explanation above. They are significantly
more complex than the existing artificial neurons that are built into today's artificial
neural networks. As biology provides a better understanding of neurons, and as
technology advances, network designers can continue to improve their systems by
building upon man's understanding of the biological brain.
But currently, the goal of artificial neural networks is not the grandiose recreation of the
brain. On the contrary, neural network researchers are seeking an understanding of
5|Page
nature's capabilities for which people can engineer solutions to problems that have not
been solved by traditional computing.
To do this, the basic unit of neural networks, the artificial neurons, simulate the four
basic functions of natural neurons. Figure 2.2.2 shows a fundamental representation of
an artificial neuron.
Figure 2
In Figure 2.2.2, various inputs to the network are represented by the mathematical
symbol, x(n). Each of these inputs are multiplied by a connection weight. These weights
are represented by w(n). In the simplest case, these products are simply summed, fed
through a transfer function to generate a result, and then output. This process lends
itself to physical implementation on a large scale in a small package. This electronic
implementation is still possible with other network structures which utilize different
summing functions as well as different transfer functions.
Some applications require "black and white," or binary, answers. These applications
include the recognition of text, the identification of speech, and the image deciphering
of scenes. These applications are required to turn real-world inputs into discrete values.
These potential values are limited to some known set, like the ASCII characters or the
most common 50,000 English words. Because of this limitation of output options, these
applications don't always utilize networks composed of neurons that simply sum up, and
thereby smooth, inputs. These networks may utilize the binary properties of ORing and
ANDing of inputs. These functions, and many others, can be built into the summation
and transfer functions of a network.
6|Page
Other networks work on problems where the resolutions are not just one of several
known values. These networks need to be capable of an infinite number of responses.
Applications of this type include the "intelligence" behind robotic movements. This
"intelligence" processes inputs and then creates outputs which actually cause some
device to move. That movement can span an infinite number of very precise motions.
These networks do indeed want to smooth their inputs which, due to limitations of
sensors, comes in non-continuous bursts, say thirty times a second. To do that, they
might accept these inputs, sum that data, and then produce an output by, for example,
applying a hyperbolic tangent as a transfer function. In this manner, output values from
the network are continuous and satisfy more real world interfaces.
Other applications might simply sum and compare to a threshold, thereby producing
one of two possible outputs, a zero or a one. Other functions scale the outputs to
match the application, such as the values minus one and one. Some functions even
integrate the input data over time, creating time-dependent networks.
7|Page
Electronic Implementation of Artificial Neurons
In currently available software packages these artificial neurons are called "processing
elements" and have many more capabilities than the simple artificial neuron described
above. Those capabilities will be discussed later in this report. Figure 2.2.3 is a more
detailed schematic of this still simplistic artificial neuron.
Figure 3
In Figure 2.2.3, inputs enter into the processing element from the upper left. The first
step is for each of these inputs to be multiplied by their respective weighting factor
(w(n)). Then these modified inputs are fed into the summing function, which usually
just sums these products. Yet, many different types of operations can be selected.
These operations could produce a number of different values which are then
propagated forward; values such as the average, the largest, the smallest, the ORed
values, the ANDed values, etc. Furthermore, most commercial development products
allow software engineers to create their own summing functions via routines coded in a
higher level language (C is commonly supported). Sometimes the summing function is
further complicated by the addition of an activation function which enables the
summing function to operate in a time sensitive way.
Either way, the output of the summing function is then sent into a transfer function.
This function then turns this number into a real output via some algorithm. It is this
8|Page
algorithm that takes the input and turns it into a zero or a one, a minus one or a one,
or some other number. The transfer functions that are commonly supported are
sigmoid, sine, hyperbolic tangent, etc. This transfer function also can scale the output
or control its value via thresholds. The result of the transfer function is usually the
direct output of the processing element. An example of how a transfer function works is
shown in Figure 2.2.4.
This sigmoid transfer function takes the value from the summation function, called sum
in the Figure 2.2.4, and turns it into a value between zero and one.
Figure 4
Finally, the processing element is ready to output the result of its transfer function. This
output is then input into other processing elements, or to an outside connection, as
dictated by the structure of the network.
All artificial neural networks are constructed from this basic building block - the
processing element or the artificial neuron. It is variety and the fundamental differences
in these building blocks which partially cause the implementing of neural networks to be
an "art."
2.4 Artificial Network Operations
The other part of the "art" of using neural networks revolve around the myriad of ways
these individual neurons can be clustered together. This clustering occurs in the human
mind in such a way that information can be processed in a dynamic, interactive, and
self-organizing way. Biologically, neural networks are constructed in a three-dimensional
world from microscopic components. These neurons seem capable of nearly
unrestricted interconnections. That is not true of any proposed, or existing, man-made
network. Integrated circuits, using current technology, are two-dimensional devices
with a limited number of layers for interconnection. This physical reality restrains the
types, and scope, of artificial neural networks that can be implemented in silicon.
9|Page
Currently, neural networks are the simple clustering of the primitive artificial neurons.
This clustering occurs by creating layers which are then connected to one another. How
these layers connect is the other part of the "art" of engineering networks to resolve
real world problems.
Figure 5
Basically, all artificial neural networks have a similar structure or topology as shown in
Figure 2.4.1. In that structure some of the neurons interfaces to the real world to
receive its inputs. Other neurons provide the real world with the network's outputs. This
output might be the particular character that the network thinks that it has scanned or
the particular image it thinks is being viewed. All the rest of the neurons are hidden
from view.
But a neural network is more than a bunch of neurons. Some early researchers tried to
simply connect neurons in a random manner, without much success. Now, it is known
that even the brains of snails are structured devices. One of the easiest ways to design
a structure is to create layers of elements. It is the grouping of these neurons into
layers, the connections between these layers, and the summation and transfer functions
that comprises a functioning neural network. The general terms used to describe these
characteristics are common to all networks.
Although there are useful networks which contain only one layer, or even one element,
most applications require networks that contain at least the three normal types of layers
- input, hidden, and output. The layer of input neurons receive the data either from
input files or directly from electronic sensors in real-time applications. The output layer
sends information directly to the outside world, to a secondary computer process, or to
other devices such as a mechanical control system. Between these two layers can be
10 | P a g e
many hidden layers. These internal layers contain many of the neurons in various
interconnected structures. The inputs and outputs of each of these hidden neurons
simply go to other neurons.
In most networks each neuron in a hidden layer receives the signals from all of the
neurons in a layer above it, typically an input layer. After a neuron performs its function
it passes its output to all of the neurons in the layer below it, providing a feedforward
path to the output. (Note: in section 5 the drawings are reversed, inputs come into the
bottom and outputs come out the top.)
These lines of communication from one neuron to another are important aspects of
neural networks. They are the glue to the system. They are the connections which
provide a variable strength to an input. There are two types of these connections. One
causes the summing mechanism of the next neuron to add while the other causes it to
subtract. In more human terms one excites while the other inhibits.
Some networks want a neuron to inhibit the other neurons in the same layer. This is
called lateral inhibition. The most common use of this is in the output layer. For
example in text recognition if the probability of a character being a "P" is .85 and the
probability of the character being an "F" is .65, the network wants to choose the
highest probability and inhibit all the others. It can do that with lateral inhibition. This
concept is also called competition.
Another type of connection is feedback. This is where the output of one layer routes
back to a previous layer. An example of this is shown in Figure 2.4.2.
Figure 6
11 | P a g e
The way that the neurons are connected to each other has a significant impacton the
operation of the network. In the larger, more professional software development
packages the user is allowed to add, delete, and control these connections at will. By
"tweaking" parameters these connections can be made to either excite or inhibit.
12 | P a g e
WHAT ARE RNNS?
The idea behind RNNs is to make use of sequential information. In a traditional neural
network we assume that all inputs (and outputs) are independent of each other. But for
many tasks thats a very bad idea. If you want to predict the next word in a sentence
you better know which words came before it. RNNs are called recurrent because they
perform the same task for every element of a sequence, with the output being
depended on the previous computations. Another way to think about RNNs is that they
have a memory which captures information about what has been calculated so far. In
theory RNNs can make use of information in arbitrarily long sequences, but in practice
Figure 7
they are limited to looking back only a few steps (more on this later). Here is what a
typical RNN looks like:
A recurrent neural network and the unfolding in time of the computation involved in its
forward computation. Source: Nature
The above diagram shows a RNN being unrolled (or unfolded) into a full network. By
unrolling we simply mean that we write out the network for the complete sequence. For
example, if the sequence we care about is a sentence of 5 words, the network would be
unrolled into a 5-layer neural network, one layer for each word. The formulas that
govern the computation happening in a RNN are as follows:
13 | P a g e
is the input at time step . For example, could be a one-hot vector corresponding
to the second word of a sentence.
is the hidden state at time step . Its the memory of the network. is calculated
based on the previous hidden state and the input at the current
step: . The function usually is a nonlinearity such
as tanh or ReLU. , which is required to calculate the first hidden state, is typically
initialized to all zeroes.
is the output at step . For example, if we wanted to predict the next word in a
sentence it would be a vector of probabilities across our vocabulary. .
There are a few things to note here:
You can think of the hidden state as the memory of the network. captures
information about what happened in all the previous time steps. The output at step
is calculated solely based on the memory at time . As briefly mentioned above, its a
bit more complicated in practice because typically cant capture information from too
many time steps ago.
Unlike a traditional deep neural network, which uses different parameters at each layer,
a RNN shares the same parameters ( above) across all steps. This reflects the
fact that we are performing the same task at each step, just with different inputs. This
greatly reduces the total number of parameters we need to learn.
The above diagram has outputs at each time step, but depending on the task this may
not be necessary. For example, when predicting the sentiment of a sentence we may
only care about the final output, not the sentiment after each word. Similarly, we may
not need inputs at each time step. The main feature of an RNN is its hidden state,
which captures some information about a sequence.
14 | P a g e
WHAT CAN RNNS DO?
RNNs have shown great success in many NLP tasks. At this point I should mention that
the most commonly used type of RNNs are LSTMs, which are much better at capturing
long-term dependencies than vanilla RNNs are. But dont worry, LSTMs are essentially
the same thing as the RNN we will develop in this tutorial, they just have a different
way of computing the hidden state. Well cover LSTMs in more detail in a later post.
Here are some example applications of RNNs in NLP (by non means an exhaustive list).
LANGUAGE MODELING AND GENERATING TEXT
Given a sequence of words we want to predict the probability of each word given the
previous words. Language Models allow us to measure how likely a sentence is, which
is an important input for Machine Translation (since high-probability sentences are
typically correct). A side-effect of being able to predict the next word is that we get
a generative model, which allows us to generate new text by sampling from the output
probabilities. And depending on what our training data is we can generate all kinds of
stuff. In Language Modeling our input is typically a sequence of words (encoded as
one-hot vectors for example), and our output is the sequence of predicted words. When
training the network we set since we want the output at step to be the
actual next word.
15 | P a g e
Music Generation with RNN
Feedforwardz Neural Networks:

A single node in a simple neural network takes some number of inputs, and then
performs a weighted sum of those inputs, multiplying them each by some weight before
adding them all together. Then, some constant (called bias) is added, and the overall
sum is then squashed into a range (usually -1 to 1 or 0 to 1) using a nonlinear
activation function, such as a sigmoid function.
Figure 8
We can visualize this node by drawing its inputs and single output as arrows, and
denoting the weighted sum and activation by a circle:
We can then take multiple nodes and feed them all the same inputs, but allow them to
have different weights and biases. This is known as a layer.
16 | P a g e
(Note: Because each node in the layer performs a weighted sum, but they all share the
same inputs, we can calculate the outputs using a matrix multiplication, followed by
element wise activation! This is one reason why neural networks can be trained so
effectively.)
Then we can connect multiple layers together:
and voila, we have a neural network. (Quick note on terminology: The set of inputs is
called the input layer, the last layer of nodes is called the output layer, and all
intermediate node layers are called hidden layers. Also, in case it isnt clear, all the
arrows from each node carry the same value, since each node has a single output
value.)
For simplicity, we can visualize the layers as single objects, since thats how they are
implemented most of the time:
From this point on, when you see a single circle, that represents an entire layer of the
network, and the arrows represent vectors of values.
17 | P a g e
Recurrent Neural Networks
Notice that in the basic feed forward network, there is a single direction in which the
information flows: from input to output. But in a recurrent neural network, this direction
constraint does not exist. There are a lot of possible networks that can be classified as
recurrent, but we will focus on one of the simplest and most practical.
Basically, what we can do is take the output of each hidden layer, and feed it back to
itself as an additional input. Each node of the hidden layer receives both the list of
inputs from the previous layer and the list of outputs of the current layer in the last
time step. (So if the input layer has 5 values, and the hidden layer has 3 nodes, each
hidden node receives as input a total of 5+3=8 values.)
We can show this more clearly by unwrapping the network along the time axis:
In this representation, each horizontal line of layers is the network running at a single
time step. Each hidden layer receives both input from the previous layer and input from
itself one time step in the past.
The power of this is that it enables the network to have a simple version of memory,
with very minimal overhead. This opens up the possibility of variable-length input and
output: we can feed in inputs one-at-a-time, and let the network combine them using
the state passed from each time step.
One problem with this is that the memory is very short-term. Any value that is output in
one time step becomes input in the next, but unless that same value is output again, it
is lost at the next tick. To solve this, we can use a Long Short-Term Memory (LSTM)
node instead of a normal node. This introduces a memory cell value that is passed
down for multiple time steps, and which can be added to or subtracted from at each
tick.
We can think of the memory cell data being sent in parallel with the activation output.
Above, I have shown it using a blue arrow, and in my following diagrams I will omit it
for simplicity.
18 | P a g e
Training Neural Networks
All those pretty pictures are nice, but how do we actually get the networks to output
what we want? Well, the neural network behavior is determined by the set of weights
and biases that each node has, so we need to adjust those to some correct value.
First, we need to define how good or bad any given output is, given the input. This
value is called the cost. For example, if we were trying to use a neural network to
model a mathematical function, the cost might be the difference between the function
answer and the network output, squared. Or if we were trying to model the likelihood
of letters appearing in a particular order, the cost might be one minus the probability of
predicting the correct letter each time.
Once we have this cost value, we can use backpropagation. This boils down to
calculating the gradient of the cost with respect to the weights (i.e the derivative of
cost with respect to each weight for each node in each layer), and then using some
optimization method to adjust the weights to reduce the cost. The bad news is that
these optimization methods are often very complex. But the good news is that many of
them are already implemented in libraries, so we can just feed our gradient to the right
function and let it adjust our weights correctly. (If youre curious, and dont mind some
math, some optimization methods include stochastic gradient descent, Hessian-free
optimization, AdaGrad, and AdaDelta.)
Figure 9
19 | P a g e
Can They Make Music?
All right, enough background. From here on, Ill be talking mostly about my own
thought process and the design of my network architecture.
When I was starting to design my networks architecture, I naturally looked up how
other people have approached this problem. A few existing methods are collected
below:
Bob Sturm uses a character-based model with an LSTM to generate a textual

representation of a song (in abc notation). The network seems to only be able to
play one note at a time, but achieves interesting temporal patterns.
Doug Eck, in A First Look at Music Composition using LSTM Recurrent Neural
Networks, uses LSTMs to do blues improvization. The sequences chosen all have
the same set of chords, and the network has a single output node for each note,
outputting the probability of that note being played at each time step. Results
are promising in that it learns temporal structure, but is pretty restricted as to
what it can output. Also, there is no distinction between playing a note and
holding it, so the network cannot rearticulate held notes.
Nicolas Boulanger-Lewandowski, in Modeling Temporal Dependencies in High-

Dimensional Sequences: Application to Polyphonic Music Generation and Transcription,
uses a network with two parts. There is an RNN to handle time dependency, which
produces a set of outputs that are then used as the parameters for a restricted
Boltzmann machine, which in turn models the conditional distribution of which notes
should be played with which other notes. This model actually produces quite nice-
sounding music, but does not seem to have a real sense of time, and only plays a
couple of chords.
For my network design, there were a few properties we wanted it to have:
Have some understanding of time signature: I wanted to give the neural network
its current time in reference to a time signature, since most music is composed
with a fixed time signature.
Be time-invariant: I wanted the network to be able to compose indefinitely, so it
needed to be identical for each time step.
Be (mostly) note-invariant: Music can be freely transposed up and down, and it
stays fundamentally the same. Thus, I wanted the structure of the neural
network to be almost identical for each note.
Allow multiple notes to be played simultaneously, and allow selection of coherent
chords.
Allow the same note to be repeated: playing C twice should be different than
holding a single C for two beats.
20 | P a g e
Most existing RNN-based music composition approaches are invariant in time, since
each time step is a single iteration of the network. But they are in general not invariant
in note. There is usually some specific output node that represents each note. So
transposing everything up by, say, one whole step, would produce a completely
different output. For most sequences, this is something you would want: hello is
completely different from ifmmp, which is just transposed one letter. But for music,
you want to emphasize the relative relationships over the absolute positions: a C major
chords sounds more like a D major chord than like a C minor chord, even though the C
minor chord is closer with regard to absolute note positions.
There is one kind of neural network that is widely in use today that has this invariant
property along multiple directions: convolutional neural networks for image recognition.
These work by basically learning a convolution kernel and then applying that same
convolution kernel across every pixel of the input image.
Figure 10
How convolution works. Each pixel is replaced by a weighted sum of the surrounding
pixels. The neural network has to learn the weights. Picture from developer.apple.com.
21 | P a g e
Figure 11
Hypothetically, what would happen if we replaced the convolution kernel with

something else? Say, a recurrent neural network? Then each pixel would have its own
neural network, which would take input from an area around the pixel. Each neural
network would in turn have its own memory cells and recurrent connections across
time.
Now replace pixels with notes, and we have an idea for what we can do. If we make a
stack of identical recurrent neural networks, one for each output note, and give each
one a local neighborhood (for example, one octave above and below) around the note
as its input, then we have a system that is invariant in both time and notes: the
network can work with relative inputs in both directions.
Note: I have rotated the time axis here! Notice that time steps now come out of the
page, as do the recurrent connections. You can think of each of the flat slices as a
copy of the basic RNN picture from above. Also, I am showing each layer getting input
from one note above and below. This is a simplification: the real network gets input
from 12 notes (the number of half steps in an octave) in each direction.
However, there is still a problem with this network. The recurrent connections allow
patterns in time, but we have no mechanism to attain nice chords: each notes output is
completely independent of every other notes output. Here we can draw inspiration
22 | P a g e
from the RNN-RBM combination above: let the first part of our network deal with time,
and let the second part create the nice chords. But an RBM gives a single conditional
distribution of a bunch of outputs, which is incompatible with using one network per
note.
The solution I decided to go with is something I am calling a biaxial RNN. The idea is
that we have two axes (and one pseudo-axis): there is the time axis and the note axis
(and the direction-of-computation pseudo-axis). Each recurrent layer transforms inputs
to outputs, and also sends recurrent connections along one of these axes. But there is
no reason why they all have to send connections along the same axis!
Figure 12
Notice that the first two layers have connections across time steps, but are independent
across notes. The last two layers, on the other hand, have connections between notes,
but are independent between time steps. Together, this allows us to have patterns both
in time and in note-space without sacrificing invariance!
23 | P a g e
Its a bit easier to see if we collapse one of the dimensions:
Figure 13
Now the time connections are shown as loops. Its important to remember that the
loops are always delayed by one time step: the output at time t is part of the input at
time t+1.
Input and Output Details

My network is based on this architectural idea, but of course the actual implementation
is a bit more complex. First, we have the input to the first time-axis layer at each time
step: (the number in brackets is the number of elements in the input vector that
correspond to each part)
Position [1]: The MIDI note value of the current note. Used to get a vague idea
of how high or low a given note is, to allow for differences (like the concept that
lower notes are typically chords, upper notes are typically melody).
Pitchclass [12]: Will be 1 at the position of the current note, starting at A for 0
and increasing by 1 per half-step, and 0 for all the others. Used to allow
selection of more common chords (i.e. it's more common to have a C major
chord than an E-flat major chord)
Previous Vicinity [50]: Gives context for surrounding notes in the last
timestep, one octave in each direction. The value at index 2(i+12) is 1 if the note
at offset i from current note was played last timestep, and 0 if it was not. The
24 | P a g e
value at 2(i+12) + 1 is 1 if that note was articulated last timestep, and 0 if it was
not. (So if you play a note and hold it, first timestep has 1 in both, second has it
only in first. If you repeat a note, second will have 1 both times.)
Previous Context [12]: Value at index i will be the number of times any note x
where (x-i-pitchclass) mod 12 was played last timestep. Thus if current note is C
and there were 2 E's last timestep, the value at index 4 (since E is 4 half steps
above C) would be 2.
Beat [4]: Essentially a binary representation of position within the measure,
assuming 4/4 time. With each row being one of the beat inputs, and each
column being a time step, it basically just repeats the following pattern:
0101010101010101
0011001100110011
0000111100001111
0000000011111111
However, it is scaled to [-1, 1] instead of [0,1].</p>
Then there is the first hidden LSTM stack, which consists of LSTMs that have recurrent
connections along the time-axis. The last time-axis layer outputs some note state that
represents any time patterns. The second LSTM stack, which is recurrent along the note
axis, then scans up from low notes to high notes. At each note-step (equivalent of time-
steps) it gets as input
the corresponding note-state vector from the previous LSTM stack

a value (0 or 1) for whether the previous (half-step lower) note was chosen to
be played (based on previous note-step, starts 0)
a value (0 or 1) for whether the previous (half-step lower) note was chosen to
be articulated (based on previous note-step, starts 0)
After the last LSTM, there is a simple, non-recurrent output layer that outputs 2 values:
Play Probability, which is the probability that this note should be chosen to be
played
Articulate Probability, which is the probability the note is articulated, given
that it is played. (This is only used to determine rearticulation for held notes.)
The model is implemented in Theano, a Python library that makes it easy to generate
fast neural networks by compiling the network to GPU-optimized code and by
automatically calculating gradients for you. The error messages can be a bit confusing
(since exceptions tend to get thrown while it is running its generated code, not yours),
but it's well worth the hassle. ## Using the Model
25 | P a g e
During training, we can feed in a randomly-selected batch of short music segments. We
then take all of the output probabilities, and calculate the cross-entropy, which is a
fancy way of saying we find the likelihood of generating the correct output given the
output probabilities. After some manipulation using logarithms to make the probabilities
not ridiculously tiny, followed by negating it so that it becomes a minimization problem,
we plug that in as the cost into the AdaDelta optimizer and let it optimize our weights.
We can make training faster by taking advantage of the fact that we already know
exactly which output we will choose at each time step. Basically, we can first batch all
of the notes together and train the time-axis layers, and then we can reorder the output
to batch all of the times together and train all the note-axis layers. This allows us to
more effectively utilize the GPU, which is good at multiplying huge matrices.
To prevent our model from being overfit (which would mean learning specific parts of
specific pieces instead of overall patterns and features), we can use something
called dropout. Applying dropout essentially means randomly removing half of the
hidden nodes from each layer during each training step. This prevents the nodes from
gravitating toward fragile dependencies on each other and instead promotes
specialization. (We can implement this by multiplying a mask with the outputs of each
layer. Nodes are "removed" by zeroing their output in the given time step.)
During composition, we unfortunately cannot batch everything as effectively. At each
time step, we have to first run the time-axis layers by one tick, and then run an entire
recurrent sequence of the note-axis layers to determine what input to give to the time-
axis layers at the next tick. This makes composition slower. In addition, we have to add
a correction factor to account for the dropout during training. Practically, this means
multiplying the output of each node by 0.5. This prevents the network from becoming
overexcited due to the higher number of active nodes.
We can trained the model using a g2.2xlarge Amazon Web Services instance. We were
able to save money by using "spot instances", which are cheaper, ephemeral instances
that can be shut down by Amazon and which are priced based on supply and demand.
Prices fluctuated between $0.10 and $0.15 an hour for me, as opposed to $0.70 for a
dedicated on-demand instance. My model used two hidden time-axis layers, each with
300 nodes, and two note-axis layers, with 100 and 50 nodes, respectively. I trained it
using a dump of all of the midi files on the Classical Piano Midi Page, in batches of ten
randomly-chosen 8-measure chunks at a time.
26 | P a g e
Tensor Flow
TensorFlow is an open source software library for machine learning in various kinds of
perceptual and language understanding tasks. It is currently used for both research and
production by 50 different teams in dozens of commercial Google products, such as
speech recognition, Gmail, Google Photos, and search, many of which had previously
used its predecessor DistBelief. TensorFlow was originally developed by the Google
Brain team for Google's research and production purposes and later released under the
Apache 2.0 open source license on November 9, 2015.
TensorFlow is Google Brain's second generation machine learning system, with a

reference implementation released as open source software on November 9, 2015.
While the reference implementation runs on single devices, TensorFlow can run on
multiple CPUs and GPUs [9] (with optional CUDA extensions for general-purpose
computing on graphics processing units). It runs on 64-bit Linux or Mac OS X desktop
or server systems, as well as on mobile computing platforms, including Android and
Apple's iOS. TensorFlow computations are expressed as stateful dataflow graphs. Many
teams at Google have migrated from DistBelief to TensorFlow for research and
production uses. This library of algorithms originated from Google's need to instruct
computer systems, known as neural networks, to learn and reason similarly to how
humans do, so that new applications can be derived which are able to assume roles and
functions previously reserved only for capable humans; the name TensorFlow itself
derives from the operations which such neural networks perform on multidimensional
data arrays. These multidimensional arrays are referred to as "tensors" but this concept
is not identical to the mathematical concept of tensors. The purpose is to train neural
networks to detect and decipher patterns and correlations.
In June 2016, Google's Jeff Dean said there were 1500 repositories on GitHub which
mentioned TensorFlow, of which only 5 were from Google.
Tensor processing unit (TPU)
In May 2016 Google announced its tensor processing unit (TPU), a custom ASIC built
specifically for machine learning and tailored for TensorFlow. The TPU is a
programmable AI accelerator designed to provide high throughput of low-precision
arithmetic (e.g., 8-bit), and oriented toward using or running models rather than
training them. Google announced they had been running TPUs inside their data centers
for more than a year, and have found them to deliver an order of magnitude better-
optimized performance per watt for machine learning.
27 | P a g e
The Experiment
First thing I did was looking for a good representation for the music to be composed. I
eventually found the abc notation, which was particularly good for my purposes
because it includes concepts of chords and melodies, which makes it easier to
procedurally create more meaningful sounds.
Heres a simple example I just wrote:
X: 1
T:"Hello world in abc notation"
M:4/4
K:C
"Am" C, D, E, F,|"F" G, A, B, C|"C"D E F G|"G" A B e c
I wont go over the details of the format, but if you know some musical notation it will
look familiar. If youre curious about how this sounds, you can listen to this in Ubuntu
by saving the above snippet into a hello.abc file and running:
$ sudo apt-get install abcmidi timidity
$ abc2midi hello.abc -o hello.mid && timidity hello.mid
Given that abc is a text format, I decided to give Karpathys char-rnn a spin. I actually
ended up using Sherjil Ozairs TensorFlow version of char-rnn, because TensorFlow
(Googles new Machine Learning framework) is way easier to play with than Torch. You
should know that hardcore Machine Learning researchers dont often use TensorFlow
because its 3x slower
We need the data to train the RNN on. After googling a bit I came across
the Nottingham Music Database, which has over 1000 Folk Tunes in abc notation. I
downloaded this pretty small dataset, compared the >1M data points typically used to
train Neural Nets for real.
I concatenated all the abc files together and started training the network on a AWS
g2.2xlarge instance:
$ python train.py --data_dir data/music/
28 | P a g e
Figure 14
Without much training a RNN on a folk music dataset.
After only 500 batches of training, the network produces mostly noise, but you could
begin to guess a trace of the abc notation:
Figure 15
After 500 batches of training the RNN produced invalid abc notation.
29 | P a g e
Wait a couple more minutes, and with 1000 trained batches the outcome changes
completely:
Figure 16
While still strictly invalid abc format, this looks much like the correct notation.
As you can see, the network even starts generating titles for its creations (found in
the T: field).
After 7200 batches, the network produces mostly fully correct abc notation pieces.
Heres an
example:
Figure 17
A valid abc notation piece produced by the trained RNN after 7200 batches.
OK, so the RNN can learn to produce valid abc notation files, but that doesnt say
anything about their actual musical quality. Ill let you judge by yourself (I think theyre
quite shitty but at least non-random. I dont think I could have programmed an
algorithm to generate better non-trivial music)
30 | P a g e
Results
Ill list 4 more non-hand-picked pieces I just generated from the fully-trained network,
along with their abc version and music sheet. The names are hilarious.
A GanneG Maman
X: 1
T:A GanneG Maman
P:A/1
M:4/4
L:1/4
K:D
z/2|:"D"FD/2F/2 FA|"G"GB/2d/2 BG|"D"F/2D/2A/2B/2 A3/2F/2|"Em"GE

GG|"A7"E3/2E/2 "D"ED
:::
"D"a/2f/2d/2f/2 "Bm"fd/2f/2|"A"ea fe|"D"d2 A:|
Ad 197
X: 1
T:Ad 197, via PR
M:4/4
L:1/4
K:D
G|:"A"B/2c/2A/2G/2 A/2G/2E/2|"D"FF/2E/2 Fd|"D"FD "A"FF/2E/2|\
"A"EA/2G/2 c3/2B/2|\
"A"c/2c/2A A/2G/2E/2G/2|"A"Ac2-||
"A"\
P:3
31 | P a g e
f'/2f/2e/2 f/2a/2g/2f/2|ef'3/2b/2 a/2g/2a/2c/2|"A"ae/2a/2 a/2g/2f/2a/2|"D"d#"fe
"E"c2::
"A"ef/2e/2 "Bm"d3/2f/2|"Em"g/2f/2e/2f/2 "A"e/2d/2c/2B/2|"A"AB/2c/2

"D"d/2e/2f/2e/2|\
"Em"de "A7"d/2c/2B/2A/2|
"D"BA/2G/2 A/2F/2D/2F/2|"D"D/2F/2A/2F/2 DA|"D"df/2e/2

a/2g/2f/2e/2|"G"d/2c/2B/2G/2 "E7"BG/2A/2||
"Bm"df/2d/2 "E7"e/2d/2c/2B/2|"A"AA/2B/2 eA
Thacrack
X: 1
T:Thacrack, via EF
M:6/8
K:D
fg|fed e2d|"D"AFD F2A|"D"fgf "A"c2A|"D"ABf "C"ege|"D"d3 d2:|
P:B
f/2e/2|"D"ddf f2f|"Bm"edc ABc|BdB BdB|"A"A3 "D"Ace|"A"agf fde|"E7"gfe ddB|
"A7"ABA B2A|"D"FGF "A7"EDE|"D"d,2D "A7"AGF|
"G"GF G2B|GBd gfg|"D"FdA A2d|"D"fed "A7"edc|"D"d2A AFG|
"G"GFG "E7/f+"G2B|"A"A2c c2e|"D"dfa d2f|
"Bm"fag B=cd|"Em"gfe "A7"A^GA|"D"d3 -d2:|
Lea Oxlee
X: :2
T:Lea Oxlee
% Nottingham Music Database
S:Kevin Briggs, via EF
32 | P a g e
M:4/4
L:1/4
K:D
P:A
A/2G/2|"D"F/2A/2F/2A/2 FA|"G"Bd/2c/2 B/2d/2B/2d/2|\
"A"c3c/2B/2|
"D"Af f/2e/2f|"Em"e/2d/2c/2B/2 "A7"c/2=c/2B/2A/2|"D"G/2A/2G/2F/2

"Em"E/2F/2D/2E/2:|
"F#"df fe/2f/2|"E7"gf z"A7"A/2B/2d/2^c/2|\
"A"ce "E7"e/2d/2c/2B/2|"A"AA A:|
P:B
e/2f/2|"A"ea ee/2f/2|"D"gf "A"ec|"D"dd/2f/2 "A"ec|"D#m"d3/2e/2 dA|"A7"GE

E2|"D"FD D:|
33 | P a g e
References
http://www.psych.utoronto.ca/users/reingold/courses/ai/cache/neural2.html
https://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html#What is a Neural
Network
http://www.cs.bham.ac.uk/~jxb/INC/l12.pdf
http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
http://www.hexahedria.com/2015/08/03/composing-music-with-recurrent-neural-networks/
https://maraoz.com/2016/02/02/abc-rnn/
https://www.tensorflow.org/versions/r0.11/get_started/index.html
http://iceeot.org/papers/OR0240.pdf
https://en.wikipedia.org/wiki/TensorFlow
http://www.netinstructions.com/how-to-install-and-run-tensorflow-on-a-windows-pc/
34 | P a g e

Music Generation With RNN: Report On Technical Seminar

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Music Generation With RNN: Report On Technical Seminar

Uploaded by

Copyright:

Available Formats

Report on

Music Generation with

School of Engineering &

Manipal University- Dubai

Now, advances in biological research promise an initial understanding of the natural

2.4 Artificial Network Operations

There are a few things to note here:

LANGUAGE MODELING AND GENERATING TEXT

Feedforwardz Neural Networks:

Bob Sturm uses a character-based model with an LSTM to generate a textual

Nicolas Boulanger-Lewandowski, in Modeling Temporal Dependencies in High-

For my network design, there were a few properties we wanted it to have:

Hypothetically, what would happen if we replaced the convolution kernel with

Input and Output Details

However, it is scaled to [-1, 1] instead of [0,1].</p>

the corresponding note-state vector from the previous LSTM stack

TensorFlow is Google Brain's second generation machine learning system, with a

Tensor processing unit (TPU)

Heres a simple example I just wrote:

T:"Hello world in abc notation"

"Am" C, D, E, F,|"F" G, A, B, C|"C"D E F G|"G" A B e c

$ sudo apt-get install abcmidi timidity

$ abc2midi hello.abc -o hello.mid && timidity hello.mid

$ python train.py --data_dir data/music/

Without much training a RNN on a folk music dataset.

T:A GanneG Maman

z/2|:"D"FD/2F/2 FA|"G"GB/2d/2 BG|"D"F/2D/2A/2B/2 A3/2F/2|"Em"GE

"D"a/2f/2d/2f/2 "Bm"fd/2f/2|"A"ea fe|"D"d2 A:|

T:Ad 197, via PR

G|:"A"B/2c/2A/2G/2 A/2G/2E/2|"D"FF/2E/2 Fd|"D"FD "A"FF/2E/2|\

"A"ef/2e/2 "Bm"d3/2f/2|"Em"g/2f/2e/2f/2 "A"e/2d/2c/2B/2|"A"AB/2c/2

"D"BA/2G/2 A/2F/2D/2F/2|"D"D/2F/2A/2F/2 DA|"D"df/2e/2

fg|fed e2d|"D"AFD F2A|"D"fgf "A"c2A|"D"ABf "C"ege|"D"d3 d2:|

f/2e/2|"D"ddf f2f|"Bm"edc ABc|BdB BdB|"A"A3 "D"Ace|"A"agf fde|"E7"gfe ddB|

"A7"ABA B2A|"D"FGF "A7"EDE|"D"d,2D "A7"AGF|

"G"GF G2B|GBd gfg|"D"FdA A2d|"D"fed "A7"edc|"D"d2A AFG|

"G"GFG "E7/f+"G2B|"A"A2c c2e|"D"dfa d2f|

"Bm"fag B=cd|"Em"gfe "A7"A^GA|"D"d3 -d2:|

% Nottingham Music Database

S:Kevin Briggs, via EF

A/2G/2|"D"F/2A/2F/2A/2 FA|"G"Bd/2c/2 B/2d/2B/2d/2|\

"D"Af f/2e/2f|"Em"e/2d/2c/2B/2 "A7"c/2=c/2B/2A/2|"D"G/2A/2G/2F/2

"F#"df fe/2f/2|"E7"gf z"A7"A/2B/2d/2^c/2|\

"A"ce "E7"e/2d/2c/2B/2|"A"AA A:|

e/2f/2|"A"ea ee/2f/2|"D"gf "A"ec|"D"dd/2f/2 "A"ec|"D#m"d3/2e/2 dA|"A7"GE

You might also like