Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

MUSIC VAE — Understanding of

Google’s work for interpolating


long music sequences
May 8, 2019

Motivation behind Music VAE:

When a painter creates a work of art, she first blends and explores color options on
an artist’s palette before applying them to the canvas. This process is a creative act in
its own right and has a profound effect on the final work.

Musicians and composers have mostly lacked a similar device for exploring and
mixing musical ideas, but we are hoping to change that. Below we
introduce MusicVAE, a machine learning model that lets us create palettes for
blending and exploring musical scores.

The Variational Autoencoder (VAE) has proven to be an effective model for producing
semantically meaningful latent representations for data. However, it has thus far seen
limited application to sequential data, it was found out that existing recurrent VAE
models have difficulty modeling sequences with long-term structure.

1.Traditional LSTM’s unable to decode long sequences like in Music due to


posterior collapse problem.

2.Posterior Collapse : vanishing influence of the latent state as output sequence is


generated

Solution :

Hierarchical Decoders:

• Sampled Latent vectors passed through multiple levels of decoder rather than
a flat decoder.
• Reduce scope of core bottom level decoder by propagating state only within each
subsequence/bar as shown in the diagram below:

• Dataset and Data Preprocessing :

• MusicVAE model takes in MIDI files which is a widely used format for music.

• Each musical sample is quantized to 16 notes per bar (sixteenth notes). To


present some background on this :

• In music 1 whole note represents 4 beats in time which equals 1 Bar, i.e 1 whole
note = 4 beats = 1 bar

• Then, an eighth note is played for one eighth of duration of a whole note, i.e 1
eighth note = 1/8th of 4 beats = ½ beat

• Similarly, a 16th note is played for half the duration of an eighth note ,i.e 1
sixteenth note = 1/16 of 4 beats = ¼ beat

• Now, since 1 bar has 4 beats ,it will take 16 such 16th notes to consume 1 bar of
music.

• For a 16 bar melody, MusicVAE uses a Bidirectional LSTM Encoder and


Hierarchical Unidirectional LSTM decoders .
Encoder:

Input to Encoder:

The 16 bar music samples to be fed to the encoder can be represented as 3


dimensional matrix

[ batch_size, max_sequence_length, input_depth ]

Here,

Batch_size : is the number of samples during training which is 512.

Max_sequence length : is the maximum possible sequence length which is 16X16 =


256

Input_depth : the dimension of each note that is played . Eg. In a monophonic piano
music sequence there can be 90 type of events at each time step (88 key presses , 1
release, 1 rest).

So from a 16 bar melody we can generate 90²⁵⁶ possible sequences.

In case of MIDI files the input depth per note is 4.For example each note consist of
the following parameters:

notes {pitch: 69, velocity: 80, start_time: 1.25, end_time: 1.5}

notes {pitch: 66, velocity: 80, start_time: 1.5, end_time: 1.75}

Here, the pitch and velocity encode the note played on an instrument say a Piano.

This [512 X 256 X 4] input is fed into a Bidirectional LSTM encoder.


Encoder Configurations and Details:

The encoder rnn is a bidirectional LSTM of 2 layers.Each layer has a state size of 2048
which means there are 2048 hidden nodes in the fully connected layers that form
each LSTM cell.

The input tensor of dimensions [512 X 256 X 4] which is first converted to time major
(converting to time major is important before feeding to LSTM) with dimensions [256
X 512 X 4] is fed to the first layer of the encoder rnn. The output from 1st layer is
passed to second layer which gives us the hidden states output in both the directions-
(Ht- fwd ) and (Ht- bkwd ).

Each cell in the encoder receives a note from the sequence with respect to time .Hence
there are 256 LSTM cells for a 16 bar melody input. So, the dimension of input to 1st
LSTM cell would be [1 X 512 X 4] an so on.

Below is the code snippet:


Now we take only the final states(HT- fwd ) and (HT- bkwd ) vectors from both the
forward and backward direction cells which are then concatenated . Reason for
concatenating the two hidden states is because the latent feature should describe in
both the direction .If one of them is used then only information from one side will be
used which should not be the case for music interpolation

The dimensions of (HT- fwd ) and (HT- bkwd ) are [1 X 512 X 4] respectively. These
concatenated states which are the encoder output are then passed through 2 different
fully connected layers with softplus activation function .This gives us the latent
distribution parameters mu and sigma to produce Multivariate Gaussian Normal
distribution which represents the posterior distribution of each sequence. The mu
and sigma can be represented as :

where W’s are the weight matrices and b are the bias vectors.This is a 512-dimension
multivariate Gaussian distribution .

Unidirectional and Bidirectional LSTM Explained :

Unidirectional LSTM cell.

First we understand how an LSTM cell looks in a unidirectional LSTM network. Then
we can expand to a Bidirectional LSTM network.
Generally an LSTM cell has 4 gates :

1. Forget Gate:

2. Input Gate

3. Selection gate

4. Attention Gate(optional)

Forget Gate : The very first gate in an LSTM cell, it decides which part of cumulated
previous information held in cell state to forget.The equation is given by:

Input Gate : This Gate decides on how much of current input to be added so as to
update the cell state.

The cell state after input and forget gate is given as :

Selection Gate: This gate selects which part of the current information to be output
as:
Bidirectional LSTM cell:

Bidirectional LSTM’s are just 2 unidirectional LSTM’s which work in opposite


direction of input sequence.

Bidirectional LSTM
Encoder FLow Explained:

Encoder FLow Diagram


Encoder Code Explaination :
The user can run a command ‘music_vae_generate’ as given on Magenta VAE’s
website(https://github.com/tensorflow/magenta/tree/master/magenta/models/m
usic_vae):

Music_vae_genarate.py is a python script that defines various parameters like :

1. Mode : Sample or Interpolate

2. Num_outputs : the number of samples or number of steps to interpolate including


endpoints

3. Checkpoint_file : the location where pre-trained model is

4. Output_dir : the location to save actual output.

5. Input_midi_1 ,input_midi2: input files to interpolate between.(Only for


mode=interpolate)
The Magenta team has built a ‘ music ‘ library that converts the input MIDI files
supplied by user into note Sequences as in below code snippet which can be found in
run method of configs.py :

Convert MIDI file to Note Sequence

Lstm_utils.py :

This script has method cudnn_lstm_layer which builds an LSTM layer as per the
parameters passed.

Build LSTM Network for Encoder.

Layer_sizes : It is a list of number of units at each LSTM layer. In our case there are
2 LSTM layers each with unit size as 2048. However if you look at lstm_models.py,
BidirectionalLstmEncoder method (explained in next section ) we call this method
cudnn_lstm_layer one at a time for each layer.
Other Important method worth mentioning are get_final:

Gets the Final Hidden States to be concatenated

This method returns only the final hidden states index in the sequences. As for
interpolation we are interested in generating latent space of endpoints we extract only
the 1st and last note in each input .This is further concatenated and to form the
encoder output

Lstm_models.py :

The main class of interest is the BidirectionalLstmEncoder.

The first method in this class is build which creates the bidirectional LSTM network:
Cells_fw : list containing the 2 LSTM unidirectional layers of forward network

Cells_bw : list containing the 2 LSTM unidirectional layers of backward network.

Note we are calling the cudnn_lstm_layer in lstm_utils.py as described above.

Next we discuss the encode method that does the actual encoding :
Encode :

Encode method

1. As per the code ,we can see that the input tensor is first transposed to time major
format.It is of [256 X 512 X 4] dimension which is fed to the forward lstm cells.For
the backward cells the sequence is first reversed and then fed to the backward lstm
cells.
2. We then take the final hidden state output from the forward lstm cells and the first
hidden state from the backward lstm cells as the input tensor for it was reversed
earlier.

3. Finally both the hidden states of dimension[512 X 4 ] are concatenated to form the
encoder output [1024 X 4] to the two different fully connected layer .This gives us
the parameters of the normal distribution as shown in the code below.

Generating mu and sigma for the Normal Distribution

4) Softplus is used as the activation function for the dense layer that gives us sigma
values for the distribution.

5) This gives us the latent distribution parameters mu and sigma to produce


Multivariate Normal distribution which represents the posterior distribution of each
sequence.

Hierarchical Decoder:

Overview :

The paper specifies the use of an additional layer called conductor to learn longer
sequences from the latent space. This conductor is nothing but an LSTM layer. The
number of LSTM blocks in the layer is specified by hyper parameter depending on
the type of music used. The latent variables obtained from the encoder is passed
through a fully connected dense network with activation function tanh. This is then
packed as two LSTM block cells which is used to initialise the first level of decoder
LSTM inputs. Depth wise decoding is done to get the final outputs.

Architecture

LSTM Layer :

Length of LSTM layer specifies the number of LSTM Blocks present in each layer.
This is set as an hyperparameter and differs from model to model. A single LSTM
block in one layer has stacked LSTM each with 1024 units. Consider the given figure.
For level_length = [16,16]. There are two levels. In the first level there are 16 LSTM
blocks and in the second level there are 16 LSTM blocks for each block of level 1. Total
blocks in second layer = No of blocks in layer 1 * number of blocks in layer 2 for each
block in layer 1 = 16*16 = 256. Number of cells inside one LSTM block depends on the
hyperparameter dec_rnn_size. In the figure shown below there are two stacked
1024 cells inside one block. In other words a multirnncell composed sequentially of 2
LSTM block cells of size 1024 each. The second layer is then connected to the output
LSTM layer
Code Snippet :

Hierarchical LSTM creation in model build (lstm_models.py )

LSTM block cell creation (lstm_ustils.py)

Conductor :

The first level of unidirectional LSTM layer is what is called as conductor. Only the
first block cell in the conductor is initialized with the embeddings from latent space.
This sets the hidden and cell state. The input is initialised to zeros based on batch
size. The batch size is set as hyper parameter. When batch size is 512 , a tensor of
shape 512 * 1 set all as zeros is given as input to the first block cell. Given input, cell
state and hidden state it provides the next cell state, hidden state and the output. The
state values are passed on to the next LSTM
Code Snippet:

Intial state of conductor initialized from z (lstm_models.py)

Getting the initial state from z (lstm_utils.py)

Decoder:

Each of the conductor’s output becomes the input to the next layer decoder which is
again an unidirectional LSTM blocks as discussed above. Based on the layer length
specified (as hyperparameter) number of blocks to each conductor is created. States
of first block cell becomes the initial state of the next block within the conductor only.
There is no communication happening between the LSTM blocks of different
conductors.

Code snippet :

Recursive decoding is done depth first. Inside recursive


decode function ,

num_steps = level_lengths[level] Eg : num_steps = 16 for level 1. Recursively call the


bottom layers. Once returned from the bottom layers loop repeats for num_steps
(lstm_models.ly)

Core decoder :

The outputs from the last level is given as input to the core decoder. Number of core
decoder units depend on the number of units in the layer above as each of the lstm
block in the previous level will be connected with one core decoder. The type of core
decoder varies from model to model. The decoded outputs are then merged together
to give the final output of interpolated music
In the last level core decoder is called with the output from the previous level as
initial input (lstm_models.py)

Working :

This is based on the model “hierarchical trio”

1. Z of is passed to dense layer with activation function as tanh. The output shape is
the flattened state size of the first LSTM block in the first level ( First Conductor).
If the block has 1024 + 1024 LSTM units as mentioned previously the state size is
c = 1024, h = 1024 for each of the two stacked LSTMs. Thus the flattened state size
will be [1024, 1024, 1024, 1024]

2. The output shape of the dense layer is the sum of the flattened state size(4096).
Thus the dense layer has shape as 512 * 4096

3. This is then split into sequence giving four splits(len(flattened state size)) each of
shape 512 * 1024.

4. The four splits are then packed into two lstm state tuples. Each of shape c = 512 *
1024 , h = 512 * 1024. Where c is the cell state and h is the hidden state of the lstm

5. This becomes the initial state to the first conductor

6. Based on the initial state, depth wise decoding is done passing down outputs to the
below layers and finally to the core decoder

7. The output from the conductor 1 becomes the input to the next level decoder (Core
decoder ).

8. The decoder is again unidirectional LSTM like the conductor.

9. The states from decoder 1 is passed as initial states to decoder 2 to provide the
outputs and states which are passed on to the next and so on

10. Once conductor 1 has completed the decoding , same procedure starts for
conductor 2, 3 and so on up to the level length(16 in this case). Thus each of the
conductors decode recursively going depth first and the conductors are
independent of the core decoder outputs thus forcing it to utilize the latent vectors
and thereby helping to learn longer sequences.
11. The output from all the decoders are then merged to give the final output. Below
figure shows the decoding process

12. The reconstruction loss is calculated by comparing the final output with the
actual output. Cross entropy is used

13. Adam optimizer is used for training

Conductor and Decoder Architecture

HVAE for Text generation:

Given the same HVAE model we were interested on how it will perform for text
generation. We create an architecture similar to the above where instead of song
sequences the inputs will be texts
References:

https://arxiv.org/pdf/1803.05428.pdf

You might also like