2002 03788

GENERATING DIVERSE AND NATURAL TEXT-TO-SPEECH SAMPLES USING A
QUANTIZED FINE-GRAINED VAE AND AUTOREGRESSIVE PROSODY PRIOR
Guangzhi Sun1∗ Yu Zhang2 Ron J. Weiss2 Yuan Cao2 Heiga Zen2 Andrew Rosenberg2
Bhuvana Ramabhadran2 Yonghui Wu2
1 2
University of Cambridge Google Inc.
gs534@eng.cam.ac.uk,{ngyugh,ronw,yuancao,heigazen,rosenberg,bhuv,yonghui}@google.com
arXiv:2002.03788v1 [eess.AS] 6 Feb 2020
ABSTRACT the prosody of a provided reference speech and control local prosody
by varying the values of corresponding latent features.
Recent neural text-to-speech (TTS) models with fine-grained latent
There exist applications where generating samples of natural
features enable precise control of the prosody of synthesized speech.
speech corresponding to the same text with different prosody is desir-
Such models typically incorporate a fine-grained variational autoen-
able. For example, samples with diverse prosodic variations could be
coder (VAE) structure, extracting latent features at each input token
useful in data augmentation for ASR where a limited amount of real
(e.g., phonemes). However, generating samples with the standard
speech is available. Using a generative framework, such as VAE [14],
VAE prior often results in unnatural and discontinuous speech, with
to represent the fine-grained latent variable, naturally enables sam-
dramatic prosodic variation between tokens. This paper proposes a
pling of different prosody features for each phoneme. The prior over
sequential prior in a discrete latent space which can generate more
each latent variable is commonly modeled using a standard Gaussian
naturally sounding samples. This is accomplished by discretizing
distribution N (0, I). Since the prior is independent at each phoneme,
the latent features using vector quantization (VQ), and separately
the generated audio often exhibits discontinuous and unnatural arti-
training an autoregressive (AR) prior model over the result. We
facts such as long pauses between syllables or sudden increases in
evaluate the approach using listening tests, objective metrics of au-
energy or fundamental frequency (F0 ).
tomatic speech recognition (ASR) performance, and measurements
of prosody attributes. Experimental results show that the proposed A simple way to ameliorate these unnatural samples is to scale
model significantly improves the naturalness in random sample gen- down the standard deviation of the prior distribution during gen-
eration. Furthermore, initial experiments demonstrate that randomly eration, which decreases the likelihood of sampling outlier values.
sampling from the proposed model can be used as data augmentation However, this also suppresses the diversity of the generated audio
to improve the ASR performance. samples and does not eliminate discontinuities since consecutive sam-
ples are still independent. Attempts to introduce temporal correlation
Index Terms— text-to-speech, Tacotron 2, fine-grained VAE to sequential latent representations [15,16] often adopt autoregressive
decomposition of the prior and posterior distributions, parameteriz-
ing both using neural networks. More recently, [17] introduced the
1. INTRODUCTION
vector-quantized VAE (VQ-VAE), and a two-stage training approach
The fast-paced development of neural end-to-end TTS synthesis has for generate high fidelity speech samples, in which the posterior is
enabled the generation of speech approaching human levels of natu- trained for reconstruction and an autoregressive (AR) prior is trained
ralness [1–4]. Such models directly map an input text to a sequence of separately to fit the posteriors extracted from the training set.
acoustic features using an encoder-decoder network architecture [5]. This paper utilizes a two-staged training approach similar to [17].
In addition to the input text, there are many other sources of variation We extend Tacotron 2 [4] to incorporate a quantized fine-grained VAE
in speech, including speaker, background noise, channel properties (QFVAE) where the latent representation is quantized into a fixed
(e.g., reverberation), and prosody. These attributes can be accounted number of classes. We find that using a quantized representation
for in the synthesis model by learning a latent representation as an improves the naturalness over audio samples generated from the
additional input to the decoder [6–9]. continuous latent space, while still ensuring reasonable diversity
Prosody [10] collectively refers to stress, intonation and rhythm across samples. In the first stage, the TTS model is trained in a
in speech. Annotations for such factors are rarely available for train- teacher-forced setting to maximize the likelihood of the training set.
ing. Many recent end-to-end TTS models aiming to capture these In the second stage, an autoregressive (AR) prior network is trained
factors extract a latent representation from the target speech, and to fit the VAE posterior distribution over the training data learned in
factorize the observed attributes, such as speaker and text informa- the first stage. This network learns to model the temporal dynamics
tion, out of the prosody latent space [6, 8, 11, 12]. These approaches across latent features. Samples characterized by latent features can be
extract a single latent variable for an entire utterance, requiring a sin- drawn from the AR prior by providing an initial state. We compare
gle global representation to capture the full space of variation across two AR prior fitting schemes, one in the continuous space and another
speech signals of arbitrary length. In contrast, the model proposed in the quantized latent space.
in [13] uses a fine-grained structure to encode the prosody associated We evaluate the proposed model from a few different perspectives.
with each phoneme in the input sequence from the aligned target spec- Sample naturalness and completeness are measured using an ASR
trogram. This system can synthesize speech which closely resembles system trained on real speech data in addition to subjective listening
tests evaluating the naturalness of the generated speech. Sample
∗ Work performed while interning at Google Brain. diversity is evaluated by the average standard deviation per phoneme
To appear in Proc. ICASSP2020, May 04-08, 2020, Barcelona, Spain c IEEE 2020
z 2.1. Vector Quantization
……
Vector quantization in Fig. 2 is performed after the latents are drawn

2 VAE
from the posterior distribution by assigning a quantized embedding
ek to each latent vector zn by minimizing the Euclidean distance.
1 Phone alignment 3 To decoder k = argminj ||zn − ej ||22 (2)

Encoder Unlike the original VQ-VAE [17] which used a one-hot posterior
“b y t h e w a y” distribution, we maintain the Gaussian form of the posterior from
the standard VAE to make it possible to experiment with continuous
Fig. 1. Fine-grained VAE encoder structure. Location-sensitive or discrete representations within AR prior. Therefore the phoneme-
attention is used to align a reference spectrogram with the encoded level ELBO loss is used to train the VQ-VAE as before. The gradient
phoneme sequence. The latent features z and encoded phonemes are from the reconstruction loss term is back-propagated to the latent
concatenated and passed to the decoder. encoder by directly copying the gradient from previous layers to
quantized embeddings at each step to the latent vectors in the contin-
uous space. Furthermore, to update the quantized embeddings, the
Embedding Space
following quantization and commitment losses are optimized together
e8
e1 with the ELBO:
e1 e2e3 ek e8
LVQ = N 2 2
P
e3 n=1 ||sg[z n ] − en || 2 + γ||z n − sg[en ]|| 2 , (3)
z e53
To decoder where sg[·] is the stop-gradient operator, which is identity for the
Distance Calculation forward path and has zero partial derivatives for the backward path.
The first term is the quantization loss which moves the embeddings
Gradients e3
towards the latent vectors computed by the encoder, in order to min-
imize the error introduced in quantization. The second term is the
Fig. 2. Illustration of the vector quantization process. The embedding commitment loss which encourages the continuous latent vector to
codebook contains K different embeddings ek . Continuous latents z remain close to the quantized embedding, preventing the embedding
are quantized to their nearest neighbors in the codebook by Euclidean space from expanding too fast. The total loss optimized by the model
distance. The red line shows the straight-through gradient path. is the sum of the VAE loss L(p, q) and the VQ loss LVQ .
in three measurable prosody attributes. Lastly, its benefit as a data 3. AUTOREGRESSIVE PROSODY PRIOR
augmentation method is demonstrated by the ASR system trained on
audio samples generated from TTS systems. Once the TTS model is trained, the fine-grained VAE encoder shown
in Fig. 1 can be used to compute parameters for the posterior distribu-
tion over latents from a reference spectrogram. Since the posterior is
derived from a real speech spectrogram, samples from it will be natu-
2. QUANTIZED FINE-GRAINED VAE TTS MODEL
ral and coherent across phonemes. However, without such a reference
spectrogram, the model does not expose a method to generate natural
The fine-grained VAE structure used to model the prosody at samples. We therefore train an AR prior to model such temporal
phoneme-level, similar to that in [13], is shown in Fig. 1. The VAE coherency in the latent feature sequence from the posterior. This
component is integrated with the encoder of the Tacotron-2 model [4] AR prior is trained separately without affecting the training of the
and the target spectrogram is provided as an extra input to the encoder posterior. Because the encoder network uses vector quantization after
in order to extract latent prosody features. sampling from an underlying posterior distribution in the continuous
The spectrogram is first aligned with the phoneme sequence using latent space, we compare two different prior fitting schemes.
attention [18] according to phoneme encodings from the output of The first scheme aims to fit an Gaussian AR prior in the continu-
the encoder. The aligned spectrogram is then sent to the VAE to ous latent space so that the prior and the posterior at each time step
extract a sequence of latent representations for the prosody which come from the same family of distributions. A single layer LSTM is
is also aligned with the phoneme sequence. Finally, the phoneme used to model the prior, and is trained using teacher forcing from the
encodings are concatenated with the latent representations and sent latent feature sequence from the posterior. The output at each step
to the decoder. The system is trained by optimizing the fine-grained of the sequence is a diagonal Gaussian distribution whose mean and
evidence lower bound (ELBO) loss: standard deviation are functions of the previous latent features. This
prior is also conditioned on the phoneme encoding Y:
L(p, q) = Eq(z|X) [log p(X | Y, z)] p(zn | z<n , Y) = N (zn ; µ(z<n , Y), σ(z<n , Y)) , (4)
−β N
P
n=1 DKL (q(zn | X, Yn ) k p(zn )) , (1) where µ(z<n , Y) and σ(z<n , Y) are outputs of the prior LSTM.
The LSTM is trained by additionally minimizing the same KL diver-
where the first term is the reconstruction loss and the second term is gence in the continuous latent space as Eq. (1) except that the original
the KL divergence between prior and posterior. The prior is chosen prior p(zn ) is replaced with p(zn | z<n , Y). During generation, the
to be N (0, I). z represents the sequence of latent features and zn network is only provided with the phoneme encoding and an all-zero
corresponds to the latent representation for the n-th phoneme. X is initial state. Samples drawn from the continuous AR prior for each
the aligned spectrogram and Y represents the phoneme encoding. phoneme are quantized using the same embedding space.
2
Alternatively, the AR prior can be fit directly in the discrete
latent space, in which case the prior at each step is a categorical Model FFE MCD
distribution over the quantization embeddings space {ek }. This is
Global VAE 0.52 16.0
similar to training a neural language model [19]. To enable training
Baseline fine-grained VAE 0.18 8.6
the prior with KL divergence in the discrete space, single sample
estimation of the posterior distribution is used, hence the training QFVAE 32 classes 0.32 11.3
objective becomes the cross-entropy loss: QFVAE 256 classes 0.26 10.2
QFVAE 1024 classes 0.22 9.5
Q(e)
DKL (Q(e) k P (e)) = EQ(e) log ≈ − log P (ek ), (5)
P (e)
Table 1. Reconstruction performance for different TTS models. The
where Q(e) and P (e) are the discrete posterior and prior multinomial number of classes refers to the number of latent embeddings in the
distributions, respectively, and ek is a single sample drawn from the VQ codebook. The global VAE baseline uses a single latent vector to
posterior distribution. Hence the estimated posterior distribution is represent the prosody of an utterance.
one-hot at embedding ek (i.e. Q̃(ej ) = 1 only when j = k, and 0
otherwise) which yields the cross-entropy loss form. As before, a Prosody stddev
single layer LSTM is used to model the prior and the output at each Model Prior MOS WER E F0 Dur.
step is a categorical distribution after a softmax activation function.
The phoneme encoding Y is also used during training and generation. Real speech 7.2% - - -
Baseline Indep. scale=1.0 - 94.7% 0.50 58 35
Baseline Indep. scale=0.2 2.80 ± 0.08 26.8% 0.35 38 16
4. EXPERIMENTS Baseline Indep. scale=0.0 4.04 ± 0.06 10.6% 0.09 8 8
We evaluated multispeaker TTS models on the LibriTTS dataset [20], Baseline AR continuous 2.37 ± 0.07 32.0% 0.48 32 21
a multispeaker English corpus of approximately 585 hours of read QFVAE Indep. 3.43 ± 0.06 11.9% 0.38 32 24
audiobooks sampled at 24kHz. It covers a wide range of speakers, QFVAE AR discrete 3.45 ± 0.07 11.5% 0.39 32 16
recording conditions and speaking styles. We used a model following QFVAE AR continuous 3.98 ± 0.06 8.4% 0.29 24 12
[8], replacing the Gaussian mixture VAE (GMVAE) with a QFVAE.
Output audio is synthesized using a WaveRNN vocoder [21]. Table 2. Naturalness and diversity metrics computed across samples
To evaluate how much information was lost by quantizing the la- from different models. All models use a 3-dimensional fine-grained
tent prosody representation, we measured reconstruction performance VAE and sample independently from a continuous latent at each step
by encoding a reference signal and using the resulting posterior to (Indep.) or from an autoregressive (AR) prior over continuous or
reconstruct the same signal. In addition, we measured the naturalness discrete latents. The scale factor for the baseline scales the standard
of the synthesized speech using subjective listening tests, and speech deviation of the prior when sampling. Prosody metrics include rela-
intelligibility using speech recognition performance with a pretrained tive energy within each phonene (E), fundamental frequency in Hertz
ASR model. We also evaluated the diversity of the prosody in samples (F0 ), and duration in ms (Dur.).
from different models. A good system should be able to generate
natural audio samples with a relatively large prosody diversity.
Finally, we demonstrated that samples from the proposed model 4.2. Sample Naturalness and Diversity
could be used to augment real speech training data to improve per- We conducted subjective listening tests over 10 speakers each synthe-
formance of a speech recognition system. We encourage readers to sized 100 utterances with native English speakers asked to rate the
listen to audio examples on the accompanying web page1 . naturalness of speech samples on a 5-point scale in increments of 0.5.
Results are reported in terms of mean opinion score (MOS).
4.1. Reconstruction Performance Complementary to naturalness, we also report the word error rate
(WER) from an ASR model trained on real speech from the LibriTTS
Reconstruction, or copy synthesis, performance is measured using training set and evaluated on speech synthesized from transcripts in
F0 frame error (FFE) [22] and mel-cepstral distortion (MCD) [23] the LibriTTS test set2 . We used the sequence-to-sequence ASR model
computed from the first 13 MFCCs, which reflect how well a model from [24]. This metric verifies that the synthesized speech contains
captures the pitch and timbre of the original speech, respectively. the full content of the input text. However, even if the full text is
Lower values are better for both metrics. Results are shown in Table 1, synthesized, we expect the WER to increase for speech samples with
illustrating that quantizing the prosody latents in the QFVAE degrades prosody that is very inconsistent with that of real speech.
reconstruction compared to the baseline, since some information is Finally, we evaluated the diversity of samples from the proposed
discarded as part of the quantization process. prior by measuring the standard deviation of three prosody attributes
Compared to the global VAE model, which used a single 32- computed for each phoneme: relative energy, fundamental frequency
dimensional latent vector to represent the prosody across a full ut- (F0 ), and duration. The duration of a phoneme is represented by
terance, the baseline fine-grained VAE used a 3-dimensional latent the number of frames where the decoder attention assigns maximum
for each phoneme and achieves a significantly better reconstruction values to that phoneme, F0 is computed by the YIN pitch tracker [25],
performance. The QFVAE models have higher FFE and MCD than and the relative energy is the ratio of the average signal magnitude
the baseline, however their performances improve as the number within a phoneme with the average magnitude of the entire utterance.
of classes increased. With 1024 classes QFVAE performance ap- The standard deviations for each of these metrics is computed within
proaches that of the baseline. each phoneme, and the average standard deviation across all the
1 https://google.github.io/tacotron/publications/prosody_prior 2 WER on LibriTTS is different from WER on LibriSpeech.
3
phonemes is reported. To estimate these statistics, we synthesized
100 random samples for each of 3 randomly selected utterances from Model Prior WER
the test set, each using 3 different speaker IDs.
Real speech 7.2%
Results are shown in Table 2. Comparing independent sampling
Multispeaker Tacotron 2 20.0%
strategies from the baseline model, decreasing the scale reduced the
GMVAE-Tacotron [8] copy synthesis 16.4%
diversity in each attribute, while increasing naturalness MOS and
Baseline copy synthesis 11.8%
improving WER. This indicates a trade-off between the diversity and
naturalness that results from naive sampling from the baseline system. Baseline Indep. scale=1.0 25.6%
A scaling factor of 0.2 is the largest under which the system still Baseline Indep. scale=0.2 19.9%
generated somewhat natural audio samples. Samples using a scale Baseline Indep. scale=0.0 31.3%
of 1.0 were too poor (as reflected in the very high WER) that we did Baseline AR continuous 32.0%
not conduct listening tests with them. Fitting an AR prior over the QFVAE Indep. 16.9%
baseline’s continuous latent space results in worse naturalness than QFVAE AR discrete 17.6%
sampling independently for each phoneme with moderate scale of QFVAE AR continuous 22.3%
0.2, although the diversity becomes higher.
10 samples per transcript
QFVAE samples always result in reasonable MOS and WER
Tacotron 2 19.8%
regardless of the type of prior. This indicates the benefit of the reg-
Baseline Indep. scale=0.2 13.8%
ularization imposed by quantizing the latent space during training,
QFVAE Indep. 12.5%
even if the discrete representation is not used directly by the prior.
Real speech + QFVAE Indep. 6.0%
The most natural results with highest MOS and lowest WER resulted
from using an AR prior fit in the continuous space. Samples gen-
erated using this prior have MOS close to the baseline with neutral Table 3. WER on real speech from the LibriTTS test set for ASR
prosody (scale = 0.0), but with lower WER. However independent models trained only on synthesized speech sampled from TTS models.
samples had better diversity metrics, once again reflecting a similar,
but less pronounced, diversity-quality trade-off to the baseline model. Training data test test-other
Finally, the AR prior in the discrete space gives a similar diversity and Real speech 4.4% 12.4%
naturalness to the independent prior. We conjecture that using a single Real speech + QFVAE (Indep.) 10 samples 3.8% 11.4%
sample to estimate the discrete KL divergence brings uncertainty to
the model, resulting in a prior with large variance at each step.
Table 4. WER on real speech LibriSpeech test sets for ASR models
trained on the combination of real and synthesized speech.
4.3. Data Augmentation
One potential application of the proposed TTS model is to sample
is close to the copy synthesize result. Negligible improvement is
synthetic speech to help training ASR models, in a data augmentation
found when adding more copies for Tacotron 2 model since they do
procedure. We trained ASR models on synthesized audio from dif-
not contain much variation in prosody. Finally, we found that training
ferent TTS models and evaluated how well the resulting recognizers
on synthetic speech (oversampled ten times) with real speech in a data
generalized when evaluated on real speech.
augmentation configuration resulted in a 16% relative WER reduction
For each TTS model, we synthesized speech for the full set of compared to the model trained on real speech alone.
training transcripts in LibriTTS with randomized speaker IDs. The We repeated the evaluation of the data augmentation experiment
synthesized speech was downsampled to a rate of 16 kHz and then on the LibriSpeech corpus [29] in Table 4. The real speech used in
used to train the ASR model. We used an end-to-end encoder-decoder this comparison was the union of LibriTTS and LibriSpeech train-
ASR model with additive attention [26]. The model, training strategy, ing sets, the TTS model was trained on LibriTTS, which contained
and associated hyperparameters followed LAS-4-1024 in [27]. material that was not in LibriSpeech. Using the QFVAE for data
As shown in Table 3, the WER of the ASR model when trained augmentation resulted in relative WER reductions of 14% and 8% on
on real LibriTTS speech is 7.2%. Similar to [28], synthesizing speech the LibriSpeech test-clean and test-other sets, respectively.
using a baseline multispeaker Tacotron 2 model results in a signif-
icantly degraded WER of 20.0%, indicating that samples from this
model did not capture the full space of variation of real speech. 5. CONCLUSIONS
In a copy synthesis setting where the ground truth reference
speech is provided, the global VAE [8] improves on the Tacotron This paper proposed a quantized fine-grained VAE TTS model, and
baseline, and the fine-grained VAE (Baseline) improves even further. compared different prosody priors to synthesize natural and diverse
These results indicate the importance of explicitly modeling prosody audio samples. A set of evaluations for naturalness and diversity was
variation (VAE models) as well as speaker variation (Tacotron 2). provided. Results showed that the quantization improved the sample
naturalness while retaining a similar diversity. Sampling from an AR
Randomly sampling independently at each phoneme using the
prior further improved the naturalness. When generated samples were
baseline fine-grained VAE performs significantly worse than copy
used to train ASR systems, we demonstrated a potential application
synthesis. The lowest WER is obtained when the scale is set to an
that used prosody variations for data augmentation.
intermediate value, reflecting a reasonable trade-off between natu-
ralness and diversity. As in Table 2, the best QFVAE performance
results from sampling independently at each phoneme. 6. ACKNOWLEDGEMENTS
To explore the improvement from diversity, the best performing
QFVAE was used to generate 10 copies with varying prosody of the The authors thank Daisy Stanton, Eric Battenberg, and the Google
training data to train the ASR system. This gives 12.5% WER which Brain and Perception teams for their helpful feedback and discussions.
4
7. REFERENCES [15] J. Chung, K. Kastner, L. Dinh, K. Goel, A. Courville, and
Y. Bengio, “A recurrent latent variable model for sequential
[1] J. Sotelo, S. Mehri, K. Kumar, J. Santos, K. Kastner, data,” in Advances in Neural Information Processing Systems,
A. Courville, and Y. Bengio., “Char2Wav: End-to-end speech 2016.
synthesis.,” in Proc. International Conference on Learning
[16] O. Fabius and J. R. van Amersfoort, “Variational recurrent
Representations (ICLR), 2017.
auto-encoders,” in Proc. International Conference on Learning
[2] Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, Representations (ICLR), 2015.
N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le,
[17] A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neu-
Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous., “Tacotron:
ral discrete representation learning,” in Advances in Neural
Towards end-to-end speech synthesis.,” in Proc. Interspeech,
Information Processing Systems, 2017.
2017, pp. 4006–4010.
[18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
[3] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan,
Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,”
S. Narang, J. Raiman, and J. Miller., “Deep voice 3: 2000-
in Advances in Neural Information Processing Systems, 2017.
speaker neural text-to-speech.,” in Proc. International Confer-
ence on Learning Representations (ICLR), 2018. [19] T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khu-
[4] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, danpur, “Recurrent neural network based language model,” in
Z. Chen, Y. Zhang, Y. Wang, and R. J. Skerry-Ryan, “Natural Proc. Interspeech, 2010.
TTS synthesis by conditioning WaveNet on mel spectrogram [20] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen,
predictions.,” in Proc. ICASSP, 2018, pp. 4779–4783. and Y. Wu, “LibriTTS: A corpus derived from LibriSpeech for
[5] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence text-to-speech,” in Proc. Interspeech, 2019.
learning with neural networks,” in Advances in Neural Informa- [21] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury,
tion Processing Systems, 2014. N. Casagrande, E. Lockhart, F. Stimberg, A. Oord, S. Dieleman,
[6] R. J. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, and K. Kavukcuoglu, “Efficient neural audio synthesis,” in
J. Shor, R. Weiss, R. Clark, and R. A. Saurous, “Towards Proc. International Conference on Machine Learning (ICML),
end-to-end prosody transfer for expressive speech synthesis 2018, pp. 2415–2424.
with Tacotron,” in Proc. International Conference on Machine [22] W. Chu and A. Alwan, “Reducing F0 frame error of F0 tracking
Learning (ICML), 2018. algorithms under noisy conditions with an unvoiced/voiced
[7] Y. Wang, D. Stanton, Y. Zhang, R. J. S. Ryan, E. Battenberg, classification frontend,” in Proc. ICASSP, 2009.
J. Shor, Y. Xiao, Y. Jia, F. Ren, and R. A. Saurous, “Style [23] R. Kubichek, “Mel-cepstral distance measure for objective
tokens: Unsupervised style modeling, control and transfer in speech quality assessment,” in Proc. IEEE Pacific Rim Con-
end-to-end speech synthesis,” in Proc. International Conference ference on Communications Computers and Signal Processing,
on Machine Learning (ICML), 2018, pp. 5167–5176. 1993, vol. 1, pp. 125–128.
[8] W.-N. Hsu, Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Y. Wang, [24] K. Irie, R. Prabhavalkar, A. Kannan, A. Bruguier, D. Rybach,
Y. Cao, Y. Jia, Z. Chen, J. Shen, P. Nguyen, and R. Pang, “Hier- and P. Nguyen, “On the choice of modeling unit for sequence-
archical generative modeling for controllable speech synthesis,” to-sequence speech recognition,” in Proc. Interspeech, 2019,
in Proc. International Conference on Learning Representations pp. 3800–3804.
(ICLR), 2019.
[25] A. de Cheveigné and H. Kawahara, “YIN, a fundamental fre-
[9] S. Ma, D. Mcduff, and Y. Song., “A generative adversarial quency estimator for speech and music,” The Journal of the
network for style modeling in a text-to-speech system,” in Proc. Acoustical Society of America, vol. 111, no. 4, pp. 1917–1930,
International Conference on Learning Representations (ICLR), 2002.
2019.
[26] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen,
[10] M. Wagner and D. G. Watson, “Experimental and theoretical Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, et al.,
advances in prosody: A review,” in Language and Cognitive “State-of-the-art speech recognition with sequence-to-sequence
Processes, 2010, pp. 905–945. models,” in Proc. ICASSP, 2018, pp. 4774–4778.
[11] W.-N. Hsu, Y. Zhang, R. J. Weiss, Y.-A. Chung, Y. Wang, [27] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D.
Y. Wu, and J. Glass., “Disentangling correlated speaker and Cubuk, and Q. V. Le, “SpecAugment: A simple data aug-
noise for speech synthesis via data augmentation and adversarial mentation method for automatic speech recognition,” in Proc.
factorization,” in Proc. ICASSP, 2019. Interspeech, 2019.
[12] E. Battenberg, S. Mariooryad, D. Stanton, R. J. Skerry-Ryan, [28] J. Li, R. Gadde, B. Ginsburg, and V. Lavrukhin, “Training neural
M. Shannon, D. Kao, and T. Bagby, “Effective use of variational speech recognition systems with synthetic speech augmentation,”
embedding capacity in expressive end-to-end speech synthesis,” arXiv preprint arXiv:1811.00707, 2018.
arXiv: 1906.03402, 2019.
[29] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-
[13] Y. Lee and T. Kim, “Robust and fine-grained prosody control riSpeech: An ASR corpus based on public domain audio books,”
of end-to-end speech synthesis,” in Proc. ICASSP, 2019, pp. in Proc. ICASSP, 2015, pp. 5206–5210.
5911–5915.
[14] D. P. Kingma and M. Welling, “Auto-encoding variational
bayes,” in Proc. International Conference on Learning Repre-
sentations (ICLR), 2014.

2002 03788

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2002 03788

Uploaded by

Copyright:

Available Formats

GENERATING DIVERSE AND NATURAL TEXT-TO-SPEECH SAMPLES USING A

QUANTIZED FINE-GRAINED VAE AND AUTOREGRESSIVE PROSODY PRIOR

Vector quantization in Fig. 2 is performed after the latents are drawn

1 Phone alignment 3 To decoder k = argminj ||zn − ej ||22 (2)

You might also like