Professional Documents
Culture Documents
Weakly Supervised Deep Recurrent Neural Networks For Basic Dance Step Generation
Weakly Supervised Deep Recurrent Neural Networks For Basic Dance Step Generation
Generation
○ Nelson Yalta (Waseda University), Shinji Watanabe (JHU), Kazuhiro Nakadai (HRI-JP), Tetsuya Ogata
(Waseda University)
Abstract : A deep recurrent neural network with audio input is applied to model basic dance steps. The proposed model employs multilayered
Long Short-Term Memory (LSTM) layers and convolutional layers to process the audio power spectrum. Then, another deep LSTM layer
decodes the target dance sequence. This end-to-end approach has an auto-conditioned decode configuration that reduces accumulation of
feedback error. Experimental results demonstrate that, after training using a small dataset, the model generates basic dance steps with low
cross entropy and maintains a motion beat F-measure score similar to that of a baseline dancer. In addition, we investigate the use of a
contrastive cost function for music-motion regulation. This cost function targets motion direction and maps similarities between music frames.
Experimental result demonstrate that the cost function improves the motion beat f-score.
2. Proposed Framework
However, with time-series data (such as dance data), the
An overview of the proposed system is shown Figure 1. model may freeze or the output may diverge from the target
due to accumulated feedback errors. To address these issues,
the output of the decoder sets its value to consider autoregres-
2.1 Deep Recurrent Neural Network sive noise accumulation by including the previous generated
Mapping high-dimensional sequences, such as motion, is step in the motion generation process.
a challenging task for deep neural networks (DNN) [5] be-
cause such sequences are not constrained to a fixed size. In
2.2 Auto-conditioned Decoder
addition, to generate motion from music, the proposed model
must map highly non-linear representations between music A conventional method uses as input the ground truth of
and motion [9]. In time signal modeling [14], DRNNs im- the given sequence as the input to train sequences with RNN
plemented with LSTM layers have shown remarkable per- models. During evaluations, the model that was accustomed
formance and stable training when deeper networks are em- to the ground truth in the training process may freeze or di-
ployed. Furthermore, using stacked convolutional layers in a verge from the target due to the accumulation of slight differ-
DRNN has demonstrated promising results for speech recog- ences between the trained and the self-generated sequence.
nition tasks [10]. This unified framework is referred to as a The auto-conditioned LSTM layer handles errors accumu-
CLDNN. In this framework, the convolutional layers reduce lated during sequence generation by conditioning the net-
the spectral variation of the input sound, and the LSTM lay- work using its own output during training. Thus, the network
ers perform time signal modeling. To construct the proposed can handle large sequences from a single input, maintain ac-
of model, we consider a DRNN with LSTM layers separated curacy, and mitigate error accumulation.
into two blocks [14]: one to reduce the music input sequence In a previous study [?] that employed an auto-conditioned
(encoder) and another for the motion output sequence (de- LSTM layer for complex motion synthesis, the conditioned
coder). This configuration can handle non-fixed dimensional LSTM layer was trained by shifting the input from the gen-
signals, such as motion, and avoids performance degradation erated motion with the ground truth motion after fixed repet-
due to the long-term dependency of RNNs. itive steps. In the proposed method, we only employ ground
The input to the network is the power spectrum from the truth motion at the beginning of the training sequence as a
audio represented as x1:n = (xt ∈ R |t = 1, ..., n) with n
b
target (Fig. 2). By modifying Eq. 2, the generated output is
frames and b frequency bins, and the ground truth sequence expressed as follows.
is represented as y1:n = (y(t) ∈ Rj |t = 1, ..., n) with n
frames and j joint axes. The following equations show the mt+1 = LSTMld ([g(xt ), yt′ ]) (3)
relations of the motion modeling:
and at t + 1:
4.2 Music vs. Motion beat which means that the dance generated by S2SMC was similar
We compared the results to two music beat retrieval sys- to the trained dance.
tems, i.e., the MADMOM and Music Analysis, Retrieval and Table 2 shows the f-score for the music tracks trained using
Synthesis for Audio Signals (MARSYAS) systems. MAD- the salsa dataset. Both models show better performance than
MOM [16] is a Python audio and music signal processing the dancer when tested under the same training conditions,
library that employs deep learning to process the music beat and S2SMC shows better performance than S2S under all
and MARSYAS is an open-source framework that obtains conditions. Note that the size of the dataset influences the
the music beat using an agent-based tempo that processes a performance of the models, and we employed a larger dataset
continuous audio signal [15]. The beat annotations of each compared to the previous experiment.
system were compared to manually annotated ground truth Table 3 shows the results of the mixed genre dataset. As
beat data. We then compared the ground truth beat data to can be seen, diverse results were obtained for each model.
the motion beat of the dancer for each dance. Here, the aver- The proposed methods cannot outperform the baseline, whereas
age f-score was used as a baseline. S2S outperformed S2SMC for most genres. The main reason
Table 1 compares the results of the music beat retrieval for this difference in the results is due to the complexity of
frameworks, the motion beat of the dancer (baseline) and the dataset and the variety of dance steps relative to the num-
the motion beat of the generated dances. As can be seen, ber of music samples; thus, the model could not group the
the proposed models demonstrate better performance than beat correctly (Fig 5).
MARSYAS, and S2SMC outperformed S2S in the evalua-
tions that used clean and noisy data for training. However,
Table. 3: F-score of mixed genres
both models did not perform well when a different music Bossa*
Method Bachata* Ballad* Rock* Hip Hop* Salsa*
track or music from different genres were used as the in- Nova
Dancer (baseline) 62.55 52.07 45.02 62.32 55.84 53.96
put. In addition, the proposed models did not outperform S2S 60.72 49.92 46.19 60.06 64.30 52.31
a model trained to only process the music beat (i.e., MAD- S2S-MC 56.63 48.48 40.91 64.87 63.85 53.71
MOM). S2SMC demonstrated lower cross entropy than S2S, *Trained data for S2S and S2SMC