Case Studies

VITS
Definition
VITS (Conditional Variational Autoencoder with Adversarial Learning for End-to-End
Text-to-Speech) is an End-to-End text to discourse model that utilizes a few innovations from
GANs, VAE, and Normalizing Flows. This model is a feed-forward model: in this sort of model,
data moves just in one direction.
The automated human voice age has been essentially exceptional by profound learning.
Planning to impersonate human talking, text to speech (TTS) has seen colossal advancement
with close to human-equality execution. Then again, singing voice synthesis (SVS), which plans
to create performing voice from verses and music scores, has likewise been progressed with
comparable brain display systems stem in the discourse age. Contrasted and discourse
amalgamation, engineered singing shouldn't simply be articulated accurately as per the verses,
yet additionally, adjust to the names of the music score.
Performance
Albeit these two-stage human voice age frameworks made tremendous advancement, the
autonomous preparation of the two phases, i.e., the brain acoustic model and the brain vocoder,
likewise prompts a befuddle between the preparation and deduction stages, bringing about
corrupted execution. In particular, the brain vocoder is prepared to utilize the ground truth
middle-of-the-road acoustic portrayal, e.g., Mel spectrum, yet the anticipated portrayal from the
acoustic model is taken on during deduction, bringing about distributional contrast between the
genuine and anticipated transitional portrayals. There are a few stunts to ease this jumble issue,
including taking on the anticipated acoustic elements in brain vocoder calibrating and
antagonistic preparation. A clear arrangement is to plug the two phases to turn into a brought-
together model prepared in a start-to-finish way. In text-to-discourse union, this sort of
arrangement has been of late investigated, including FastSpeech2s, EATS, Glow-WaveGAN and
VITS. As a general rule, these works blend acoustic model and brain vocoder into one
model empowering start-to-finish learning or taking on a new inert portrayal rather than mel-
range to all the more effectively restrict the two sections’ work on a similar circulation.
Hypothetically, end-to-end preparation can accomplish better sound quality and less complex
preparation, what's more, the surmising process. Among the start-to-finish models, VITS utilizes
a variational autoencoder (VAE) to interface with the acoustic model furthermore, the vocoder,
which embraces variational derivation increased with normalizing streams and an ill-disposed
preparing process, producing more normal sounding than the current two-stage models (Zhang et
al 2022, pp. 7237-7241).
Apparently, VISinger is the main start-to-finish arrangement in taking care of the two-
stage crisscross issue in the singing age. It is non-minor to take on VITS in performing voice
union, since singing has a significant distinction from talking, in spite of the fact that they both
advance from a similar human vocal framework. In the first place, phoneme level mean and
fluctuation of acoustic elements are taken on in the flow-based earlier encoder of VITS. In the
singing errand, we present a length controller and an edge earlier organization to get the casing
level mean and difference all things being equal, displaying the rich acoustic variety in singing
and prompting more normal singing execution. As an ablation study, we discover that just
expanding the number of layers of the stream without adding the casing earlier organization
cannot accomplish the same execution gain. Second, as intonational delivering is fundamental in
singing, we especially model the intonational perspectives by acquainting an F0 indicator with
additional aide the casing earlier organization, driving to more steady singing with normal sound.
At last, to get to the next level the cadence conveyance in singing, we change the term indicator
to explicitly anticipate the phoneme to note span proportion, assisted with singing note
standardization. Investigates an expert Mandarin singing corpus show that the proposed VISinger
altogether outflanks the FastSpeech+Neural-Vocoder two-stage approach and the oracle VITS.
Tacotron 2 Case Study: Es-Network
Introduction
Text-to-speech is a way in which input text is converted to speech on a computer.
Traditional methods like statistical parametric speech synthesis (SPSS) consist of independent
components that resulted in difficulty in adjusting the system. Due to this, researchers have
developed systems that demonstrate high power in producing more emotional and natural
speech. These systems are based on neural networks and are end-to-end models. When compared
to SPSS, end-to-end TTS are controllable in the sense that they allow more conditioning, and
also are easier to design. However, this system can only produce average synthesis due to over-
smoothness. This is because they have only thrived in the generalization of features rather than
particular features. The omission of particular features leads to the disappearance of detailed
speech structure.
Over-smoothness can be solved by using HMM-based speech synthesis or using
Generative adversarial networks (GANs). HMM introduces global variance to the model thus
reserving speech features. GANs visualize over-smoothed speech as a low-resolution image and
thus they upsample the speech to yield a higher resolution spectrogram. However, these methods
have limitations too. GANs may suffer mode collapse difficulties since they are unstable while
HMM leads to a noisy speech since spectrogram predicted from them suffer redundant distortion.
It is these drawbacks that motivated Yifun Lin and Zin Zheng to propose a new neural network,
known as the estimated Network (Es-Network). The primary objective of Es-Network is solely to
produce more natural language and reduce over-smoothness.
Es-Network Perfomance.
Unlike end-to-end models that tend to focus on general features of speech, Es-Network
focuses on the individual features. To achieve this, Lin and Zheng employed a new method
called estimated tokens which involved grouping randomly initialized learnable prototype
vectors. This can be represented as [q1, q2,…,qn], where n is the total number of estimated vector
tokens, and q is the estimated token. The estimated output for abstracting general features is
computed as follows:
n
q t=∑ e t ,i qi
i=1
et,i of every annotation qi is computed as follows;
e t , i=vt tanh(W qt +V qi +b)
Estimated tokens optimize by reducing estimate loss L es which constitutes the disparity between
the query vector and estimated output. When this method was used, it showed that token-based
estimated network have the little capability as compared to auto-encoder-based network but
generates more details features, and thus is prudent for ideal accurate mel spectrograms residual.
Es-Network (Es-Tacotron 2) is an improvement of Tacotron 2. Es-Tacotron 2 is
implemented as follows:
 The network is first pre-trained by reducing estimate loss as follows:
Les =¿
 Calculating estimated residual Y ℜ=( y ℜ1 , … , y ren ), which can be done as:
Y ℜ=Y −Y es , Y = (ya,yb,…,yn)
Y refers to targeted mel spectrogram while Yes stands for estimated mel
spectrogram.
 Training Tacotron 2 model. Tacotron 2 is consists of encoder and decoder.
Encoder is responsible for converting character sequence C = [c 1,c2,…,cn] to
hidden feature representation h = (h1,h2,…,hL); h = Encoder(C)
h is then consumed by Location-sensitive attention which summarizes the
encoded sequence as a fixed-length context vector for every decoder output step.
The purpose of the decoder is to predict a mel spectrogram Y′=(y′1,y′2,...,y′T)
from encoded sequence to one frame at a time.
ct = Attention (lt,h,αt−1)
lt = ψpre + lstm (y′t−1,ct−1)
αt−1 is the attention vector, lt result of LSTM layers. The mel spectrogram frame y
′t
is predicted from joining context vector ct and lt: y′t=fmain(ct,lt). The other task is
predicting ‘stop token’ S = (s1,..,st): s′t = fsec(ct,lt).
The third and last task is predicting the estimated residual of the target mel
spectrogram:
y′ =f (y′ ).
ret thi t
 The output of the third task is then passed through a five-layers convolutional
post-net. The output is an improved predicted residual. The model is then
optimized by reducing summed square error. The result is a modified WaveNet
which is the adopted as the neural voconder.
When the Es-Tacotron2 and Tacotron 2 were evaluated using Griffin-Lim algorithm, the
reults were astonishing and Es-Tacotron2 produced a better synthesized speech (67.5%) as
compared to Tacotron 2 (14.0%).
Festival case study: Frisian_TTS.
The Frisian language is one of the languages spoken in the Netherlands in the province of
Fryslan. It consists of three main dialects Klaaifrysk, Súd-Westhoeks, and Wâldfrysk, and other
languages which are composed of Frisian and Dutch. Frisisan_TTS is based on a Dutch TTS
which is built on Festival, NeXTeNS. NeXTeNS architecture is derived from the standard
Festival system architecture. A lot of advanced features in NeXTeNS were not available in
Frisian prototype TTS such as NP chunking, ToDI labeling, and POS tagging. The main aim of
producing this prototype was to see if some utterances are better.
Implementation of Frisian TTS.
Phoneme set. Computer-readable phoneme was created using Worldbet which allows
nasalized diphthongs and triphthongs. This was done by inserting Frisian phonemes between
phoneme files in NeXTeNS and Dutch ones. Mapping of Frisian phonemes was done on Dutch
counterparts and if Dutch phonemes were absent, Festival outputted an error. NeXTeNS was
referred and letter-t-sound rules, and an empty lexicon to create a basic synthesizer.
Token Module. Tokenization is a process of changing unknown tokens such as numbers,
abbreviations, acronyms, symbols, and dates to words. A standard file with specific details was
created in NeXTeNS. The file implemented only a number-to-word and abbreviations. For
example, ‘ienentritich’ (lit. “one-and-thirty”).
POS Module. POS stands for Part-of-Speech and is responsible for a break and accent
breaking. A simple function guess_pos-function for content word was created because there is no
POS tagging for Frisian.
Syntactic Module. Due to the lack of a syntax parser for Frisian, the default option from
NeXTeNS was used.
Phrasing Module. The phrasing module predicts breaks by punctuations. Punctuations are
done by POS or punctuation cart tree. Since Frisian has no tagging, a default method, known as
the punctuation cart tree was chosen.
Intonation Module. To get sentence accent, adjectives, verbs, and nouns are used in
NeXTeNS. Frisian uses simple content word division. This replaced the rule of getting accents in
NeXTeNS with a rule that gives an accent to each word that is not a member of the method word
list (guess_pos-list). Second accents were removed in a group of accents.
Word Module. This involves transforming the graphemic word into a phonemic one
which is possible means of a pronunciation lexicon. If there is no occurrence of the particular
lexicon, it is built with the help of letter-to-sound rules (LtS), after which the prosodic structure
of the word is built.

Postlexical Module. This module took care of assimilation between inside words and
word boundaries. It was implemented by mapping phones to their Dutch counterparts since they
are the same as for Frisian.
Duration Module. This involves defining special rules of every phoneme. The default
duration module in Festival defines every phoneme with the same length. In Frisian case,
duration module from NeXTeNS was used.
Fundamental frequency control. Most intonation is similar between Dutch and Frisian.
Therefore, ToDI was implemented.
Frisian TTS Evaluation and Perfomance.
To evaluate the model, 20 judges were involved. Sentences for evaluation were obtained
from magazines, newspapers, and publications. The evaluation was done in terms of
intelligibility, acceptability, and quality. The scores were given in the scale of 7. The scores were
average as shown in Table 1 below. This is an indication that, the TTS has a potential of
improvement which is the main objective of text-to-speech system.
Table 1: Evaluation results.

Reference
Dijkstra, J., Pols, L.C. and Son, R.J.V., 2004. Frisian TTS, an example of bootstrapping TTS for
minority languages. In Fifth ISCA Workshop on Speech Synthesis.
Liu, Y. and Zheng, J., 2019. Es-Tacotron2: Multi-task Tacotron 2 with pre-trained estimated
network for reducing the over-smoothness problem. Information, 10(4), p.131.
Zhang, Y., Cong, J., Xue, H., Xie, L., Zhu, P. and Bi, M., 2022, May. Visinger: Variational
inference with adversarial learning for end-to-end singing voice synthesis. In ICASSP
2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP) (pp. 7237-7241). IEEE.

Case Studies

Uploaded by

Copyright:

Available Formats

You might also like

Case Studies

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Case Studies

Uploaded by

Copyright:

Available Formats

VITS

VITS (Conditional Variational Autoencoder with Adversarial Learning for End-to-End

data moves just in one direction.

yet additionally, adjust to the names of the music score.

al 2022, pp. 7237-7241).

singing, we especially model the intonational perspectives by acquainting an F0 indicator with

Tacotron 2 Case Study: Es-Network

Text-to-speech is a way in which input text is converted to speech on a computer.

Over-smoothness can be solved by using HMM-based speech synthesis or using

produce more natural language and reduce over-smoothness.

et,i of every annotation qi is computed as follows;

e t , i=vt tanh(W qt +V qi +b)

Es-Network (Es-Tacotron 2) is an improvement of Tacotron 2. Es-Tacotron 2 is

 Calculating estimated residual Y ℜ=( y ℜ1 , … , y ren ), which can be done as:

 Training Tacotron 2 model. Tacotron 2 is consists of encoder and decoder.

Encoder is responsible for converting character sequence C = [c 1,c2,…,cn] to

hidden feature representation h = (h1,h2,…,hL); h = Encoder(C)

h is then consumed by Location-sensitive attention which summarizes the

The purpose of the decoder is to predict a mel spectrogram Y′=(y′1,y′2,...,y′T)

from encoded sequence to one frame at a time.

predicting ‘stop token’ S = (s1,..,st): s′t = fsec(ct,lt).

post-net. The output is an improved predicted residual. The model is then

optimized by reducing summed square error. The result is a modified WaveNet

which is the adopted as the neural voconder.

compared to Tacotron 2 (14.0%).

Festival case study: Frisian_TTS.

producing this prototype was to see if some utterances are better.

Implementation of Frisian TTS.

example, ‘ienentritich’ (lit. “one-and-thirty”).

POS tagging for Frisian.

NeXTeNS was used.

the punctuation cart tree was chosen.

list (guess_pos-list). Second accents were removed in a group of accents.

which is possible means of a pronunciation lexicon. If there is no occurrence of the particular

of the word is built.

are the same as for Frisian.

duration module from NeXTeNS was used.

Therefore, ToDI was implemented.

Frisian TTS Evaluation and Perfomance.

improvement which is the main objective of text-to-speech system.

Table 1: Evaluation results.

minority languages. In Fifth ISCA Workshop on Speech Synthesis.

network for reducing the over-smoothness problem. Information, 10(4), p.131.

2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing

(ICASSP) (pp. 7237-7241). IEEE.

You might also like