Professional Documents
Culture Documents
2211 09496
2211 09496
2211 09496
SOFT-LABEL GUIDANCE
ABSTRACT performance of these models, resulting in the need for careful extra
constraints. Intensity control for emotion conversion is investigated
Although current neural text-to-speech (TTS) models are able to in [17, 18], with similar methods. Some of the mentioned works also
generate high-quality speech, intensity controllable emotional TTS have degraded speech quality. As an example, [14] (which we refer
is still a challenging task. Most existing methods need external op- to as “MixedEmotion” later) is an autoregressive model with inten-
timizations for intensity calculation, leading to suboptimal results or sity values from RAR to weight the emotion embeddings. It adopts
degraded quality. In this paper, we propose EmoDiff, a diffusion- pretraining to improve synthetic quality, but still with obvious qual-
based TTS model where emotion intensity can be manipulated by a ity degradation.
proposed soft-label guidance technique derived from classifier guid- To overcome these issues, we need a conditional sampling
ance. Specifically, instead of being guided with a one-hot vector for method that can directly control emotions weighted with intensity.
the specified emotion, EmoDiff is guided with a soft label where the In this work, we propose a soft-label guidance technique, based
value of the specified emotion and Neutral is set to α and 1 − α on the classifier guidance technique [19, 20] in denoising diffusion
respectively. The α here represents the emotion intensity and can models [21, 22]. Classifier guidance is an efficient sampling tech-
be chosen from 0 to 1. Our experiments show that EmoDiff can nique that uses the gradient of a classifier to guide the sampling
precisely control the emotion intensity while maintaining high voice trajectory given a one-hot class label.
quality. Moreover, diverse speech with specified emotion intensity In this paper, based on the extended soft-label guidance, we
can be generated by sampling in the reverse denoising process. propose EmoDiff which is an emotional TTS model with sufficient
Index Terms— Emotional TTS, emotion intensity control, de- intensity controllability. Specifically, we first train an emotion-
noising diffusion models, classifier guidance unconditional acoustic model. Then an emotion classifier is trained
on any xt on the diffusion process trajectory where t is the diffusion
timestamp. In inference, we guide the reverse denoising process
1. INTRODUCTION with the classifier and a soft emotion label where the value of the
specified emotion and Neutral is set to α and 1 − α respectively,
Although current neural text-to-speech (TTS) models are able to instead of a one-hot distribution where only the specified emotion
generate high-quality speech, such as Grad-TTS [1], VITS [2] and is 1 while all others are 0. α ∈ [0, 1] here represents the emotion
VQTTS [3], intensity controllable emotional TTS is still a challeng- intensity. Our experiments show that EmoDiff can precisely control
ing task. Unlike prosody modelling in recent literatures [4–6] that no the emotion intensity while maintaining high voice quality. More-
specific label is provided in advance, emotional TTS typically uti- over, it also generates diverse speech samples even with the same
lizes dataset with categorical emotion labels. Mainstream emotional emotion as a strength of diffusion models [19].
TTS models [7, 8] can only synthesize emotional speech given the In short words, the main advantages of EmoDiff are:
emotion label without intensity controllability.
In intensity controllable TTS models, efforts have been made to 1. We define the emotion intensity as the weight for classifier
properly define and calculate emotion intensity values for training. guidance when using soft-labels. This achieves precise in-
The most preferred method to define and obtain emotion intensity tensity control in terms of classifier probability, needless for
is the relative attributes rank (RAR)[9], which is used in [10–14]. extra optimizations. Thus it enables us to generate speech
RAR seeks a ranking matrix by a max-margin optimization prob- with arbitrary specified emotion intensity effectively.
lem, which is solved by support vector machines. The solution is 2. It poses no harm to the synthesized speech. The generated
then fed to the model for training. As this is a manually constructed samples have good quality and naturalness.
and separated stage, it might result in suboptimal results that bring
bias into training. In addition to RAR, the operation on emotion em- 3. It also generates diverse samples even in the same emotion.
bedding space is also explored. [15] designs an algorithm to max-
imize distance between emotion embeddings, and interpolates the 2. DIFFUSION MODELS WITH CLASSIFIER GUIDANCE
embedding space to control emotion intensity. [16] quantizes the
distance of emotion embeddings to obtain emotion intensities. How- 2.1. Denoising Diffusion Models and TTS Applications
ever, the structure of the embedding space also greatly influences the
Denoising diffusion probabilistic models [21, 22] have proven suc-
∗ Corresponding author cessful in many generative tasks. In the score-based interpretation
[22, 23], diffusion models construct a forward stochastic differen- While Eq.(6) can effectively control sampling on class label c, it
tial equation (SDE) to transform the data distribution p0 (x0 ) into a cannot be directly applied to soft-labels, i.e. labels weighted with in-
known distribution pT (xT ), and use a corresponding reverse-time tensity, as the guidance p(c | x) is not well-defined now. Therefore,
SDE to generate realistic samples starting from noises. Thus, the we extend this technique for emotion intensity control in Section 3.2.
reverse process is also called “denoising” process. Neural networks
are then to estimate the score function ∇x log pt (xt ) for any t ∈ 3. EMODIFF
[0, T ] on the SDE trajectory, with score-matching objectives [22,
23]. In applications, diffusion models bypass the training instabil- 3.1. Unconditional Acoustic Model and Classifier Training
ity and mode collapse problem in GANs, and outperform previous The training of EmoDiff mainly includes the training of the un-
methods on sample quality and diversity [19]. conditional acoustic model and emotion classifier. We first train a
Denoising diffusion models have also been used in TTS [1, 24– diffusion-based acoustic model on emotional data, but don’t pro-
27] and vocoding [28, 29] tasks, with remarkable results. In this vide it with emotion conditions. This is referred to as “unconditional
paper, we build EmoDiff on the design of GradTTS [1]. Denote acoustic model training” as in Figure 1(a). This model is based on
x ∈ Rd a frame of mel-spectrogram, it constructs a forward SDE: GradTTS [1], except that we provide explicit duration sequence by
1 forced aligners to ease duration modeling. In this stage, the train-
dxt = Σ−1 (µ − xt )βt dt + βt dBt
p
(1) ing objective is Ldur + Ldiff , where Ldur is the `2 loss of logarithmic
2
duration, and Ldiff is the diffusion loss as Eq.(3). In practice, follow-
where Bt is a standard Brownian motion and t ∈ [0, 1] is the SDE ing GradTTS, we also adopt prior loss Lprior = − log N (x0 ; µ, I)
time index. βt is nreferred to as
o noise schedule such that βt is in- to encourage converging. For notation simplicity, we use Ldiff to
R1
creasing and exp − 0 βs ds ≈ 0. Then we have p1 (x1 ) ≈ denote diffusion and prior loss together in Figure 1(a).
N (x; µ, Σ). This SDE also indicates the conditional distribution After training, the acoustic model can estimate score function
xt | x0 ∼ N (ρ(x0 , Σ, µ, t), λ(Σ, t)), where ρ(·), λ(·) both has of noisy mel-spectrogram xt given input phoneme sequence y, i.e.
closed forms. Thus we can directly sample xt from x0 . In prac- ∇ log p(xt | y), which is unconditonal of emotion labels. Follow-
tice, we set Σ to identity matrix and λ(Σ, t) therefore becomes λt I ing Section 2.2, we then need an emotion classifier to distinguish
where λt is a scalar with known closed form. Meanwhile, we con- emotion categories e from noisy mel-spectrograms xt . Meanwhile,
dition the terminal distribution p1 (x1 ) on text, i.e. let µ = µθ (y), as we always have a text condition y, the classifier is formulated as
where y is the aligned phoneme representation of that frame. p(e | xt , y, t). As is shown in Figure 1(b), the input to the clas-
The SDE of Eq.(1) has a reverse-time counterpart: sifier consists of three components: SDE timestamp t, noisy mel-
spectrogram xt and phoneme-dependent Gaussian mean µ. This
1 −1 p classifier is trained with the standard cross-entropy loss LCE . Note
dxt = Σ (µ − xt ) − ∇x log pt (xt ) βt dt+ βt dB e t (2)
2 that we freeze the acoustic model parameters in this stage, and only
update the weights in emotion classifier.
where ∇ log pt (xt ) is the score function that is to be estimated, and
As we always need text y as condition along through the paper,
Be t is a reverse-time Brownian motion. It shares the trajectory of
we omit it and denote this classifier as p(e | x) in later sections to
distribution pt (xt ) with forward SDE in Eq.(1). So, solving it from simplify the notation, if no ambiguity is caused.
x1 ∼ N (µ, Σ), we can end up with a realistic sample x0 ∼ p(x0 |
y). A neural network sθ (xt , y, t) is trained to estimate the score
function, in the following score-matching [23] objective: 3.2. Intensity Controllable Sampling with Soft-Label Guidance
min L = Ex0 ,y,t [λt ksθ (xt , y, t) − ∇xt log p(xt | x0 )k2 ]. (3) In this section, we extend the classifier guidance to soft-label guid-
θ ance which can control emotion weighted with intensity. Suppose
2.2. Conditional Sampling Based on Classifier Guidance the number of basic emotions is m, and every basic emotion ei has
a one-hot vector form ei ∈ Rm , i ∈ {0, 1, ..., m − 1}. For each
Denoising diffusion models provide a new way of modeling condi- ei , only the i-th dimension is 1. We specially use e0 to denote Neu-
tional probabilities p(x | c) where c is a class label. Suppose we tral. For an emotion weighted with intensity α on ei , we define it to
now have an unconditional generative model p(x), and a classifier be d = αei + (1 − α)e0 . Then the gradient of log-probability of
p(c | x). By Bayes formula, we have clasifier p(d | x) w.r.t x can be defined as
∇x log p(x | c) = ∇x log p(c | x) + ∇x log p(x). (4)
∇x log p(d | x) , α∇x log p(ei | x) + (1 − α)∇x log p(e0 | x).
In the diffusion framework, to sample from conditional distribution (5)
p(x | c), we need to estimate score function ∇x log p(xt | c). By The intuition of this definition is that, intensity α stands for the con-
Eq.(4), we only need to add the gradient from a classifier to the un- tribution of emotion ei on the sampling trajectory of x. Larger α
conditional model. This conditional sampling method is named clas- means we sample x along a trajectory with large “force” towards
sifier guidance [19, 20], and is also used in unsupervised TTS [30]. emotion ei , otherwise e0 . Thus we can extend Eq.(4) to
In practice, classifier gradients are often scaled [19, 30] to con-
trol the strength of guidance. Instead of original ∇x log p(c | x) in ∇x log p(x | d) = α∇x log p(ei | x) + (1 − α)∇x log p(e0 | x)
Eq.(4), we now use γ∇x log p(c | x), where γ ≥ 0 is called guid- + ∇x log p(x). (6)
ance level. Larger γ will result in highly class-correlated samples
while smaller one will encourage sample variability [19]. When the intensity α is 1.0 (100% emotion ei ) or 0.0 (100% Neu-
Different from ordinary classifiers, the input to the classifier tral), the above operation reduces to the standard classifier guidance
used here is all the xt along the trajectory of SDE in Eq.(1), in- form Eq.(4). Hence we can use the soft-label guidance Eq.(5) in
stead of clean x0 only. The time index t can be anything in [0, 1]. the sampling process, and generate a realistic sample with specified
Thus, the classifier can also be denoted as p(c | xt , t). emotion d = αei + (1 − α)e0 with intensity α.
Fig. 1: Training and sampling diagrams of EmoDiff. In training, xt is directly sampled from known distribution p(xt | x0 ). When sampling
with a certain emotion intensity, the score function ∇x log pt (xt ) is estimated by score estimator. “SG” means stop gradient operation.
Figure 1(c) illustrates the intensity controllable sampling pro- Table 1: MOS and MCD comparisons. MOS is presented with 95%
cess. After feeding the acoustic model and obtaining phoneme- confidence interval. Note that “GradTTS w/ emo label” cannot con-
dependent µ sequence, we sample x1 ∼ N (µ, I) and simulate trol emotion intensity.
reverse-time SDE from t = 1 to t = 0 through a numerical sim-
ulator. In each simulator update, we feed the classifier with current MOS MCD25
xt and get the output probabilities pt (· | xt ). Eq.(6) is then used
GT 4.73±0.09 -
to calculate the guidance term. Similar as Section 2.2, we also scale
GT (voc.) 4.69±0.10 2.96
the guidance term with guidance level γ. At the end, we obtain x̂0
MixedEmotion [14] 3.43±0.12 6.62
which is not only intelligible with input text, but also correspond-
GradTTS w/ emo label 4.16±0.10 5.75
ing to the target emotion d with intensity α. This lead to precise
intensity that correlates well to classifier probability. EmoDiff (ours) 4.13±0.10 5.98
Generally, in addition to intensity control, our soft-label guid-
ance is capable for Pmore complicated control on mixed emotions 4 emotional categories Angry, Happy, Sad, Surprise together with a
[14]. Denote d = m−1 i=0 wi ei a combination of all emotions where
Neutral category. There are 350 parallel utterances per speaker and
wi ∈ [0, 1], m−1
P
i=0 wi = 1, Eq.(5) can be generalized to emotion category, amounting to about 1.2 hours each speaker. Mel-
m−1
X spectrogram and forced alignments were extracted by Kaldi [32] in
∇x log p(d | x) , wi ∇x log p(ei | x). (7) 12.5ms frame shift and 50ms frame length, followed by cepstral nor-
i=0 malization. Audio samples in these experiments are available 1 .
Then Eq.(6) can also be expressed in such generalized form. This In this paper, we only consider single-speaker emotional TTS
extension can also be interpreted from the probabilistic view. As the problem. Throughout the following sections, we trained an uncondi-
combination weights {wi } can be viewed as a categorical distribu- tional GradTTS acoustic model on all 10 English speakers for a rea-
tion pe (·) over basic emotions {ei }, Eq.(7) is equivalent to sonable data coverage, and a classifier on a female speaker (ID:0015)
∇x log p(d | x) , Ee∼pe ∇x log p(e | x) (8) only. The unconditional GradTTS model was trained with Adam op-
timizer at 10−4 learning rate for 11M steps. We used exponential
= −∇x CE [pe (·), p(· | x)] (9)
moving average on model weights as it is reported to improve dif-
where CE is the cross-entropy function. Eq.(9) implies the fact that fusion model’s performance [22]. The structure of the classifier is
we are actually decreasing the cross-entropy of target emotion dis- a 4-layer 1D CNN, with BatchNorm and Dropout in each block. In
tribution pe and classifier output p(· | x), when sampling along the inference stage, guidance level γ was fixed to 100.
the gradient ∇x log p(d | x). The gradient of cross-entropy w.r.t We chose HifiGAN [33] trained on all the English speakers here
x can guide the sampling process. Hence, this soft-label guidance as a vocoder for all the following experiments.
technique can generally be used to control any arbitrary complex
emotion as a weighted combination of several basic emotions. 4.2. Emotional TTS Quality
In Figure 1(c), we use cross-entropy as a concise notation for
soft-label guidance term. In our intensity control scenario, it reduces We first measure the speech quality, which contains audio quality
to Eq.(5) mentioned before. and speech naturalness. We did comparisons of the proposed EmoD-
iff with the following systems:
4. EXPERIMENTS AND RESULTS
1. GT and GT (voc.): ground truth recording and analysis syn-
4.1. Experimental Setup thesis result (vocoded with GT mel-spectrogram).
We used the English part of the Emotional Speech Dataset (ESD)
[31] to perform all the experiments. It has 10 speakers, each with 1 https://cantabile-kwok.github.io/EmoDiff-intensity-ctrl/
Fig. 2: Classification probabilities when controlling on intensity α ∈ {0.0, 0.2, 0.4, 0.6, 0.8, 1.0}. Errorbars represent standard deviation.