Professional Documents
Culture Documents
Lecture 10 - Text To Speech
Lecture 10 - Text To Speech
Text-to-Speech Synthesis
Andrew Senior (DeepMind London) Many thanks to
Heiga Zen
February 23rd, 2017@Oxford
Outline
Generative TTS
text
(concept)
frequency speech
transfer
by speech information
characteristics
magnitude
start-- end
Sound source
fundamental
voiced: pulse
frequency
unvoiced: noise
air flow
TEXT
Sentence segmentation
Word segmentation
Text normalization Text analysis
Part-of-speech tagging
Pronunciation
Prosody prediction
Speech synthesis
discrete ⇒ discrete Waveform generation
NLP
Frontend
SYNTHESIZED discrete ⇒ continuous
SEECH Speech
Backend
Random variables
Random variables
Synthesis
X W w
• Estimate posterior predictive distribution
→ p(x | w, X , W)
x
• Sample x̄ from the posterior distribution
where
O, o: Acoustic features X W w
L, l: Linguistic features
O L l
λ: Model
λ
Approximate {sum & integral} by best point estimates (like MAP) [3]
Duration model
• Typically run a parametric synthesizer on frames (e.g. 5ms windows)
• Need to know how many frames each phonetic unit lasts.
• Model this separately e.g. FFNN linguistic features → duration.
frequency
+ Speech
e(n)
0 x(n) = h(n)*e(n)
0 8 [kHz]
h(n)
White noise (unvoiced)
frequency
+ Speech
e(n)
0 x(n) = h(n)*e(n)
0 8 [kHz]
h(n)
White noise (unvoiced)
Generative TTS
l1 ... lN
l
...
q1 q2 q3 q4 : Discrete
T
XY
p(o | l, λ) = p(ot | qt , λ)P (q | l, λ) qt : hidden state at t
∀q t=1
T
XY
= N (ot ; µqt , Σqt )P (q | l, λ)
∀q t=1
• Data fragmentation
→ Ineffective, local representation
Generative TTS
yes no
yes no yes no
yes no yes no
...
Statistics of acoustic features o
yes no
yes no yes no
yes no yes no
...
Statistics of acoustic features o
Target
Frame-level acoustic feature o t o t−1 ot o t+1
ht
X
ht = g (Whl lt + bh ) λ̂ = arg min kot − ôt k2
λ t
ôt = Woh ht + bo λ = {Whl , Woh , bh , bo }
Target
Frame-level acoustic feature o t o t−1 ot o t+1
Recurrent
connections
X
ht = g (Whl lt + Whh ht−1 + bh ) λ̂ = arg min kot − ôt k2
λ t
ôt = Woh ht + bo λ = {Whl , Whh , Woh , bh , bo }
• Data fragmentation
→ Distributed representation [10]
• Data fragmentation
→ Distributed representation [10]
text
(concept)
frequency speech
transfer
by speech information
characteristics
magnitude
start-- end
Sound source
fundamental
voiced: pulse
frequency
unvoiced: noise
air flow
Generative TTS
world
0000000000000000000100
Acoustic Linguistic
⇒ feature ⇒
feature
o(t) l(n-1) l(n)
0000000000100000000000
hello
x(t) w(n)
(raw FFT spectrum) (1-hot representation of word)
Ô = arg max p(X | O)
X W w
O
λ̂ = arg max p(Ô | L̂, λ)p(λ)
L l
λ O
L̂ l̂
⇓
λ
{λ̂, Ô} = arg max p(X | O)p(O | L̂, λ)p(λ)
λ,O λ̂
o
Joint source-filter & acoustic model optimization
• HMM [18, 19, 20] ô
• NN [21, 22] x
Back propagation
Linguistic
features
... lt ...
Generative TTS
⇓ L l
• WaveNet [24] x
• SampleRNN [25]
WaveNet [24]
→ p(xn | x0 , . . . , xn−1 , λ) is modeled by convolutional NN
WaveNet [24]
→ p(xn | x0 , . . . , xn−1 , λ) is modeled by convolutional NN
Key components
• Causal dilated convolution: capture long-term dependency
• Gated convolution + residual + skip: powerful non-linearity
• Softmax at output: classification rather than regression
Hidden
layer3
Hidden
layer2
Hidden
layer1
Input
xn-16 ... xn-3 xn-2 xn-1
...
25
512
6
512
1x1
Residual block
25
6
512
512 256
Residual block 1x1 p(xn | x0 ,..., xn-1 )
Softmax
256 256 256
ReLU
ReLU
∑
1x1
1x1
30 512
512 256
1x1
Residual block
512
6
25
512
1x1
Residual block
6
25
512
... Skip connections
1x1 : 1x1 conv
Residual block : 2x1 dilated conv → Gated activation → 1x1 Conv + Residual connection
Amplitude
Time
Amplitude
Time
Amplitude
16
Time
Amplitude
16
Time
Res. block
Linguistic features
Res. block
Embedding
hn
Res. block
...
at time n
Res. block
WaveNet
• Sample-by-saple, non-linear, capable to take additional inputs
• Arbitrary-shaped signal distribution
Generative TTS
Integrated end-to-end
L̂ = arg max p(L | W)
X W w
L
λ̂ = arg max p(X | L̂, λ)p(λ)
λ
⇓ λ
λ̂ = arg max p(X | W, λ)p(λ)
λ λ̂
Bayesian end-to-end
λ̂ = arg max p(X | W, λ)p(λ) X W w
λ
x̄ ∼ f (w, λ̂) = p(x | w, λ̂)
x
⇓
x̄ ∼ fx (w, X , W) = p(x | w, X , W) λ
Z
= p(x | w, λ)p(λ | X , W)dλ
K
1 X
≈ p(x | w, λ̂k ) ← Ensemble x
K
k=1
• Less approximations
− Joint training, direct waveform modelling
− Moving towards integrated & Bayesian end-to-end TTS
Missing “listener”
→ “listener” in training / model itself?
[1] D. Klatt.
Real-time speech synthesis by rule.
Journal of ASA, 68(S1):S18–S18, 1980.
[8] H. Zen.
Acoustic modeling for speech synthesis: from HMM to RNN.
http://research.google.com/pubs/pub44630.html.
Invited talk given at ASRU 2015.
[9] S. Takaki and J. Yamagishi.
A deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes for statistical parametric
speech synthesis.
In Proc. ICASSP, pages 5535–5539, 2016.
[10] G. Hinton, J. McClelland, and D. Rumelhart.
Distributed representation.
In D. Rumelhart, J. McClelland, and the PDP Research Group, editors, Parallel distributed processing: Explorations in
the microstructure of cognition. MIT Press, 1986.
[11] H. Zen and A. Senior.
Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis.
In Proc. ICASSP, pages 3872–3876, 2014.
[12] X. Wang, S. Takaki, and J. Yamagishi.
Investigating very deep highway networks for parametric speech synthesis.
In Proc. ISCA SSW9, 2016.
[13] Y. Saito, S. Takamichi, and Saruwatari.
Training algorithm to deceive anti-spoofing verification for DNN-based speech synthesis.
In Proc. ICASSP, 2017.
[14] H. Zen, Y. Agiomyrgiannakis, N. Egberts, F. Henderson, and P. Szczepaniak.
Fast, compact, and high quality LSTM-RNN based statistical parametric speech synthesizers for mobile devices.
In Proc. Interspeech, 2016.
O L l
O L l O L l
Ô L̂ l̂
λ λ
λ
o
o λ̂
ô o
ô
x x x
x
(1) Bayesian (2) Auxiliary variables (3) Joint maximization (4) Step-by-step maximization
+ factorization e.g., statistical parametric TTS
X W w X W w X W w X W w
L l L l
O
L̂ l̂ L̂ l̂
λ λ λ
λ
λ̂ λ̂ λ̂
ô
x x x x
(5) Joint acoustic feature (6) Conditional WaveNet (7) Integrated end-to-end (8) Bayesian end-to-end
extraction + model training -based TTS