Professional Documents
Culture Documents
MusicVAE - Explanation
MusicVAE - Explanation
When a painter creates a work of art, she first blends and explores color options on
an artist’s palette before applying them to the canvas. This process is a creative act in
its own right and has a profound effect on the final work.
Musicians and composers have mostly lacked a similar device for exploring and
mixing musical ideas, but we are hoping to change that. Below we
introduce MusicVAE, a machine learning model that lets us create palettes for
blending and exploring musical scores.
The Variational Autoencoder (VAE) has proven to be an effective model for producing
semantically meaningful latent representations for data. However, it has thus far seen
limited application to sequential data, it was found out that existing recurrent VAE
models have difficulty modeling sequences with long-term structure.
Solution :
Hierarchical Decoders:
• Sampled Latent vectors passed through multiple levels of decoder rather than
a flat decoder.
• Reduce scope of core bottom level decoder by propagating state only within each
subsequence/bar as shown in the diagram below:
• MusicVAE model takes in MIDI files which is a widely used format for music.
• In music 1 whole note represents 4 beats in time which equals 1 Bar, i.e 1 whole
note = 4 beats = 1 bar
• Then, an eighth note is played for one eighth of duration of a whole note, i.e 1
eighth note = 1/8th of 4 beats = ½ beat
• Similarly, a 16th note is played for half the duration of an eighth note ,i.e 1
sixteenth note = 1/16 of 4 beats = ¼ beat
• Now, since 1 bar has 4 beats ,it will take 16 such 16th notes to consume 1 bar of
music.
Input to Encoder:
Here,
Input_depth : the dimension of each note that is played . Eg. In a monophonic piano
music sequence there can be 90 type of events at each time step (88 key presses , 1
release, 1 rest).
In case of MIDI files the input depth per note is 4.For example each note consist of
the following parameters:
Here, the pitch and velocity encode the note played on an instrument say a Piano.
The encoder rnn is a bidirectional LSTM of 2 layers.Each layer has a state size of 2048
which means there are 2048 hidden nodes in the fully connected layers that form
each LSTM cell.
The input tensor of dimensions [512 X 256 X 4] which is first converted to time major
(converting to time major is important before feeding to LSTM) with dimensions [256
X 512 X 4] is fed to the first layer of the encoder rnn. The output from 1st layer is
passed to second layer which gives us the hidden states output in both the directions-
(Ht- fwd ) and (Ht- bkwd ).
Each cell in the encoder receives a note from the sequence with respect to time .Hence
there are 256 LSTM cells for a 16 bar melody input. So, the dimension of input to 1st
LSTM cell would be [1 X 512 X 4] an so on.
The dimensions of (HT- fwd ) and (HT- bkwd ) are [1 X 512 X 4] respectively. These
concatenated states which are the encoder output are then passed through 2 different
fully connected layers with softplus activation function .This gives us the latent
distribution parameters mu and sigma to produce Multivariate Gaussian Normal
distribution which represents the posterior distribution of each sequence. The mu
and sigma can be represented as :
where W’s are the weight matrices and b are the bias vectors.This is a 512-dimension
multivariate Gaussian distribution .
First we understand how an LSTM cell looks in a unidirectional LSTM network. Then
we can expand to a Bidirectional LSTM network.
Generally an LSTM cell has 4 gates :
1. Forget Gate:
2. Input Gate
3. Selection gate
4. Attention Gate(optional)
Forget Gate : The very first gate in an LSTM cell, it decides which part of cumulated
previous information held in cell state to forget.The equation is given by:
Input Gate : This Gate decides on how much of current input to be added so as to
update the cell state.
Selection Gate: This gate selects which part of the current information to be output
as:
Bidirectional LSTM cell:
Bidirectional LSTM
Encoder FLow Explained:
Lstm_utils.py :
This script has method cudnn_lstm_layer which builds an LSTM layer as per the
parameters passed.
Layer_sizes : It is a list of number of units at each LSTM layer. In our case there are
2 LSTM layers each with unit size as 2048. However if you look at lstm_models.py,
BidirectionalLstmEncoder method (explained in next section ) we call this method
cudnn_lstm_layer one at a time for each layer.
Other Important method worth mentioning are get_final:
This method returns only the final hidden states index in the sequences. As for
interpolation we are interested in generating latent space of endpoints we extract only
the 1st and last note in each input .This is further concatenated and to form the
encoder output
Lstm_models.py :
The first method in this class is build which creates the bidirectional LSTM network:
Cells_fw : list containing the 2 LSTM unidirectional layers of forward network
Next we discuss the encode method that does the actual encoding :
Encode :
Encode method
1. As per the code ,we can see that the input tensor is first transposed to time major
format.It is of [256 X 512 X 4] dimension which is fed to the forward lstm cells.For
the backward cells the sequence is first reversed and then fed to the backward lstm
cells.
2. We then take the final hidden state output from the forward lstm cells and the first
hidden state from the backward lstm cells as the input tensor for it was reversed
earlier.
3. Finally both the hidden states of dimension[512 X 4 ] are concatenated to form the
encoder output [1024 X 4] to the two different fully connected layer .This gives us
the parameters of the normal distribution as shown in the code below.
4) Softplus is used as the activation function for the dense layer that gives us sigma
values for the distribution.
Hierarchical Decoder:
Overview :
The paper specifies the use of an additional layer called conductor to learn longer
sequences from the latent space. This conductor is nothing but an LSTM layer. The
number of LSTM blocks in the layer is specified by hyper parameter depending on
the type of music used. The latent variables obtained from the encoder is passed
through a fully connected dense network with activation function tanh. This is then
packed as two LSTM block cells which is used to initialise the first level of decoder
LSTM inputs. Depth wise decoding is done to get the final outputs.
Architecture
LSTM Layer :
Length of LSTM layer specifies the number of LSTM Blocks present in each layer.
This is set as an hyperparameter and differs from model to model. A single LSTM
block in one layer has stacked LSTM each with 1024 units. Consider the given figure.
For level_length = [16,16]. There are two levels. In the first level there are 16 LSTM
blocks and in the second level there are 16 LSTM blocks for each block of level 1. Total
blocks in second layer = No of blocks in layer 1 * number of blocks in layer 2 for each
block in layer 1 = 16*16 = 256. Number of cells inside one LSTM block depends on the
hyperparameter dec_rnn_size. In the figure shown below there are two stacked
1024 cells inside one block. In other words a multirnncell composed sequentially of 2
LSTM block cells of size 1024 each. The second layer is then connected to the output
LSTM layer
Code Snippet :
Conductor :
The first level of unidirectional LSTM layer is what is called as conductor. Only the
first block cell in the conductor is initialized with the embeddings from latent space.
This sets the hidden and cell state. The input is initialised to zeros based on batch
size. The batch size is set as hyper parameter. When batch size is 512 , a tensor of
shape 512 * 1 set all as zeros is given as input to the first block cell. Given input, cell
state and hidden state it provides the next cell state, hidden state and the output. The
state values are passed on to the next LSTM
Code Snippet:
Decoder:
Each of the conductor’s output becomes the input to the next layer decoder which is
again an unidirectional LSTM blocks as discussed above. Based on the layer length
specified (as hyperparameter) number of blocks to each conductor is created. States
of first block cell becomes the initial state of the next block within the conductor only.
There is no communication happening between the LSTM blocks of different
conductors.
Code snippet :
Core decoder :
The outputs from the last level is given as input to the core decoder. Number of core
decoder units depend on the number of units in the layer above as each of the lstm
block in the previous level will be connected with one core decoder. The type of core
decoder varies from model to model. The decoded outputs are then merged together
to give the final output of interpolated music
In the last level core decoder is called with the output from the previous level as
initial input (lstm_models.py)
Working :
1. Z of is passed to dense layer with activation function as tanh. The output shape is
the flattened state size of the first LSTM block in the first level ( First Conductor).
If the block has 1024 + 1024 LSTM units as mentioned previously the state size is
c = 1024, h = 1024 for each of the two stacked LSTMs. Thus the flattened state size
will be [1024, 1024, 1024, 1024]
2. The output shape of the dense layer is the sum of the flattened state size(4096).
Thus the dense layer has shape as 512 * 4096
3. This is then split into sequence giving four splits(len(flattened state size)) each of
shape 512 * 1024.
4. The four splits are then packed into two lstm state tuples. Each of shape c = 512 *
1024 , h = 512 * 1024. Where c is the cell state and h is the hidden state of the lstm
6. Based on the initial state, depth wise decoding is done passing down outputs to the
below layers and finally to the core decoder
7. The output from the conductor 1 becomes the input to the next level decoder (Core
decoder ).
9. The states from decoder 1 is passed as initial states to decoder 2 to provide the
outputs and states which are passed on to the next and so on
10. Once conductor 1 has completed the decoding , same procedure starts for
conductor 2, 3 and so on up to the level length(16 in this case). Thus each of the
conductors decode recursively going depth first and the conductors are
independent of the core decoder outputs thus forcing it to utilize the latent vectors
and thereby helping to learn longer sequences.
11. The output from all the decoders are then merged to give the final output. Below
figure shows the decoding process
12. The reconstruction loss is calculated by comparing the final output with the
actual output. Cross entropy is used
Given the same HVAE model we were interested on how it will perform for text
generation. We create an architecture similar to the above where instead of song
sequences the inputs will be texts
References:
https://arxiv.org/pdf/1803.05428.pdf