Professional Documents
Culture Documents
Convolutional Generative Adversarial Networks With Binary Neurons For Polyphonic Music Generation
Convolutional Generative Adversarial Networks With Binary Neurons For Polyphonic Music Generation
ABSTRACT Dr.
Pi.
It has been shown recently that deep convolutional gen-
Gu.
erative adversarial networks (GANs) can learn to gener-
ate music in the form of piano-rolls, which represent mu- Ba.
sic by binary-valued time-pitch matrices. However, exist- En.
ing models can only generate real-valued piano-rolls and
arXiv:1804.09399v3 [cs.LG] 6 Oct 2018
Re.
require further post-processing, such as hard thresholding
(HT) or Bernoulli sampling (BS), to obtain the final binary- S.L.
valued results. In this paper, we study whether we can have S.P.
a convolutional GAN model that directly creates binary- Dr.
valued piano-rolls by using binary neurons. Specifically,
Pi.
we propose to append to the generator an additional refiner
network, which uses binary neurons at the output layer. Gu.
The whole network is trained in two stages. Firstly, the Ba.
generator and the discriminator are pretrained. Then, the En.
refiner network is trained along with the discriminator to
Re.
learn to binarize the real-valued piano-rolls the pretrained
generator creates. Experimental results show that using bi- S.L.
nary neurons instead of HT or BS indeed leads to better S.P.
results in a number of objective measures. Moreover, de-
terministic binary neurons perform better than stochastic Figure 1. Six examples of eight-track piano-roll of four-
ones in both objective measures and a subjective test. The bar long (each block represents a bar) seen in our training
source code, training data and audio examples of the gen- data. The vertical and horizontal axes represent note pitch
erated results can be found at https://salu133445. and time, respectively. The eight tracks are Drums, Piano,
github.io/bmusegan/. Guitar, Bass, Ensemble, Reed, Synth Lead and Synth Pad.
BNs: deterministic binary neurons (DBNs) and stochastic Figure 6. Residual unit used in the refiner network. The
binary neurons (SBNs). DBNs act like neurons with hard values denote the kernel size and the number of the output
thresholding functions as their activation functions. We channels of the two convolutional layers.
define the output of a DBN for a real-valued input x as:
Figure 7. Comparison of binarization strategies. (a): the probabilistic, real-valued (raw) predictions of the pretrained G.
(b), (c): the results of applying post-processing algorithms directly to the raw predictions in (a). (d), (e): the results of the
proposed models, using an additional refiner R to binarize the real-valued predictions of G. Empty tracks are not shown.
(We note that in (d), few noises (33 pixels) occur in the Reed and Synth Lead tracks.)
from G(z) to the data space. Hence, we draw inspiration 4. ANALYSIS OF THE GENERATED RESULTS
from residual learning and propose to construct the refiner
with a number of residual units [16], as shown in Figure 4. 4.1 Training Data & Implementation Details
The output layer (i.e. the final layer) of the refiner is made
The Lakh Pianoroll Dataset (LPD) [10] 3 contains 174,154
up of either DBNs or SBNs.
multi-track piano-rolls derived from the MIDI files in the
Lakh MIDI Dataset (LMD) [23]. 4 In this paper, we use a
3.4 Discriminator cleansed subset (LPD-cleansed) as the training data, which
contains 21,425 multi-track piano-rolls that are in 4/4 time
Similar to the generator, the discriminator D consists of and have been matched to distinct entries in Million Song
M private network Dpi , i = 1, . . . , M , one for each track, Dataset (MSD) [5]. To make the training data cleaner, we
followed by a shared network Ds , as shown in Figure 5. consider only songs with an alternative tag. We randomly
Each private network Dpi first extracts low-level features pick six four-bar phrases from each song, which leads to
from the corresponding track of the input piano-roll. Their the final training set of 13,746 phrases from 2,291 songs.
outputs are concatenated and sent to the shared network Ds We set the temporal resolution to 24 time steps per
to extract higher-level abstraction shared by all the tracks. beat to cover common temporal patterns such as triplets
The design differs from [10] in that only one (shared) dis- and 32th notes. An additional one-time-step-long pause is
criminator was used in [10] to evaluate all the tracks col- added between two consecutive (i.e. without a pause) notes
lectively. We intend to evaluate such a new shared/private of the same pitch to distinguish them from one single note.
design in Section 4.5. The note pitch has 84 possibilities, from C1 to B7.
As a minor contribution, to help the discriminator ex-
We categorize all instruments into drums and sixteen
tract musically-relevant features, we propose to add to the
instrument families according to the specification of Gen-
discriminator two more streams, shown in the lower half
eral MIDI Level 1. 5 We discard the less popular instru-
of Figure 5. In the first onset/offset stream, the differences
ment families in LPD and use the following eight tracks:
between adjacent elements in the piano-roll along the time
Drums, Piano, Guitar, Bass, Ensemble, Reed, Synth Lead
axis are first computed, and then the resulting matrix is
and Synth Pad. Hence, the size of the target output tensor
summed along the pitch axis, which is finally fed to Do .
is 4 (bar) × 96 (time step) × 84 (pitch) × 8 (track).
In the second chroma stream, the piano-roll is viewed
Both G and D are implemented as deep CNNs (see
as a sequence of one-beat-long frames. A chroma vector
Appendix A for the detailed network architectures). The
is then computed for each frame and jointly form a matrix,
length of the input random vector is 128. R consists of two
which is then be fed to Dc . Note that all the operations
residual units [16] shown in Figure 6. Following [13], we
involved in computing the chroma and onset/offset features
use the Adam optimizer [19] and only apply batch normal-
are differentiable, and thereby we can still train the whole
ization to G and R. We apply the slope annealing trick [9]
network by backpropagation.
to networks with BNs, where the slope of the sigmoid func-
Finally, the features extracted from the three streams are tion in the sigmoid-adjusted ST estimator is multiplied by
concatenated and fed to Dm to make the final prediction. 1.1 after each epoch. The batch size is 16 except for the
first stage in the two-stage training setting, where the batch
3.5 Training size is 32.
3 https://salu133445.github.io/
We propose to train the model in a two-stage manner: G
and D are pretrained in the first stage; R is then trained lakh-pianoroll-dataset/
4 http://colinraffel.com/projects/lmd/
along with D (fixing G) in the second stage. Other training 5 https://www.midi.org/specifications/item/
strategies are discussed and compared in Section 4.4. gm-level-1-sound-set
training pretrained proposed joint end-to-end ablated-I ablated-II
data BS HT SBNs DBNs SBNs DBNs SBNs DBNs BS HT BS HT
QN 0.88 0.67 0.72 0.42 0.78 0.18 0.55 0.67 0.28 0.61 0.64 0.35 0.37
PP 0.48 0.20 0.22 0.26 0.45 0.19 0.19 0.16 0.29 0.19 0.20 0.14 0.14
TD 0.96 0.98 1.00 0.99 0.87 0.95 1.00 1.40 1.10 1.00 1.00 1.30 1.40
(Underlined and bold font indicate respectively the top and top-three entries with values closest to those shown in the ‘training data’ column.)
Table 1. Evaluation results for different models. Values closer to those reported in the ‘training data’ column are better.
1.0
[10] Hao-Wen Dong, Wen-Yi Hsiao, Li-Chia Yang, and Yi- [24] Adam Roberts, Jesse Engel, Colin Raffel, Curtis
Hsuan Yang. MuseGAN: Symbolic-domain music gen- Hawthorne, and Douglas Eck. A hierarchical latent
eration and accompaniment with multi-track sequential vector model for learning long-term structure in music.
generative adversarial networks. In Proc. AAAI, 2018. In Proc. ICML, 2018.
[11] Douglas Eck and Jürgen Schmidhuber. Finding tempo- [25] Bob L. Sturm, João Felipe Santos, Oded Ben-Tal,
ral structure in music: Blues improvisation with LSTM and Iryna Korshunova. Music transcription modelling
recurrent networks. In Proc. IEEE Workshop on Neural and composition using deep learning. In Proc. CSMS,
Networks for Signal Processing, 2002. 2016.
[12] Ian J. Goodfellow et al. Generative adversarial nets. In [26] Li-Chia Yang, Szu-Yu Chou, and Yi-Hsuan Yang.
Proc. NIPS, 2014. MidiNet: A convolutional generative adversarial net-
work for symbolic-domain music generation. In Proc.
[13] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vin- ISMIR, 2017.
cent Dumoulin, and Aaron Courville. Improved train-
ing of Wasserstein GANs. In Proc. NIPS, 2017. [27] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu.
SeqGAN: Sequence generative adversarial nets with
[14] Gaëtan Hadjeres, François Pachet, and Frank Nielsen. policy gradient. In Proc. AAAI, 2017.
DeepBach: A steerable model for Bach chorales gen-
eration. In Proc. ICML, 2017.
APPENDIX
A. NETWORK ARCHITECTURES
We show in Table 3 the network architectures for the gen-
erator G, the discriminator D, the onset/offset feature ex-
tractor Do , the chroma feature extractor Dc and the dis-
criminator for the ablated-II model.
Input: <4×96×84×8
split along the track axis
substream I substream II
chroma stream
conv 32 1 × 1 × 12 (1, 1, 12) 32 1 × 6 × 1 (1, 6, 1)
onset stream
conv 64 1×6×1 (1, 6, 1) 64 1 × 1 × 12 (1, 1, 12) ··· × 8
concatenate along the channel axis
conv 64 1×1×1 (1, 1, 1)
concatenate along the channel axis
conv 128 1×4×3 (1, 4, 2)
conv 256 1×4×3 (1, 4, 3)
concatenate along the channel axis
conv 512 2×1×1 (1, 1, 1)
dense 1536
dense 1
Output: <
(b) discriminator D
Input: <4×96×1×8
Input: <4×96×84×8
conv 32 1×6×1 (1, 6, 1)
conv 64 1×4×1 (1, 4, 1) conv 128 1 × 1 × 12 (1, 1, 12)
conv 128 1×4×1 (1, 4, 1) conv 128 1×1×3 (1, 1, 2)
conv 256 1×6×1 (1, 6, 1)
Output: <4×1×1×128
conv 256 1×4×1 (1, 4, 1)
(c) onset/offset feature extractor Do conv 512 1×1×3 (1, 1, 3)
conv 512 1×4×1 (1, 4, 1)
conv 1024 2 × 1 × 1 (1, 1, 1)
Input: <4×4×12×8
flatten to a vector
conv 64 1 × 1 × 12 (1, 1, 12) dense 1
conv 128 1×4×1 (1, 4, 1)
Output: <
Output: <4×1×1×128
(e) discriminator for the ablated-II model
(d) chroma feature extractor Dc
Table 3. Network architectures for (a) the generator G, (b) the discriminator D, (c) the onset/offset feature extractor Do , (d)
the chroma feature extractor Dc and (e) the discriminator for the ablated-II model. For the convolutional layers (conv) and
the transposed convolutional layers (transconv), the values represent (from left to right): the number of filters, the kernel
size and the strides. For the dense layers (dense), the value represents the number of nodes. Each transposed convolutional
layer in G is followed by a batch normalization layer and then activated by ReLUs except for the last layer, which is
activated by sigmoid functions. The convolutional layers in D are activated by LeakyReLUs except for the last layer, which
has no activation function.
Figure 12. Sample eight-track piano-rolls seen in the training data. Each block represents a bar for a certain track. The
eight tracks are (from top to bottom) Drums, Piano, Guitar, Bass, Ensemble, Reed, Synth Lead and Synth Pad.
Figure 13. Randomly-chosen eight-track piano-rolls generated by the proposed model with DBNs. Each block represents
a bar for a certain track. The eight tracks are (from top to bottom) Drums, Piano, Guitar, Bass, Ensemble, Reed, Synth Lead
and Synth Pad.
Figure 14. Randomly-chosen eight-track piano-rolls generated by the proposed model with SBNs. Each block represents a
bar for a certain track. The eight tracks are (from top to bottom) Drums, Piano, Guitar, Bass, Ensemble, Reed, Synth Lead
and Synth Pad.
(a) modified end-to-end model (+DBNs)
Figure 15. Randomly-chosen five-track piano-rolls generated by the modified end-to-end models (see Appendix D) with
(a) DBNs and (b) SBNs. Each block represents a bar for a certain track. The five tracks are (from top to bottom) Drums,
Piano, Guitar, Bass and Ensemble.