Case Studies

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 9

VITS

Definition

VITS (Conditional Variational Autoencoder with Adversarial Learning for End-to-End

Text-to-Speech) is an End-to-End text to discourse model that utilizes a few innovations from

GANs, VAE, and Normalizing Flows. This model is a feed-forward model: in this sort of model,

data moves just in one direction.

The automated human voice age has been essentially exceptional by profound learning.

Planning to impersonate human talking, text to speech (TTS) has seen colossal advancement

with close to human-equality execution. Then again, singing voice synthesis (SVS), which plans

to create performing voice from verses and music scores, has likewise been progressed with

comparable brain display systems stem in the discourse age. Contrasted and discourse

amalgamation, engineered singing shouldn't simply be articulated accurately as per the verses,

yet additionally, adjust to the names of the music score.

Performance

Albeit these two-stage human voice age frameworks made tremendous advancement, the

autonomous preparation of the two phases, i.e., the brain acoustic model and the brain vocoder,

likewise prompts a befuddle between the preparation and deduction stages, bringing about

corrupted execution. In particular, the brain vocoder is prepared to utilize the ground truth

middle-of-the-road acoustic portrayal, e.g., Mel spectrum, yet the anticipated portrayal from the

acoustic model is taken on during deduction, bringing about distributional contrast between the

genuine and anticipated transitional portrayals. There are a few stunts to ease this jumble issue,

including taking on the anticipated acoustic elements in brain vocoder calibrating and

antagonistic preparation. A clear arrangement is to plug the two phases to turn into a brought-
together model prepared in a start-to-finish way. In text-to-discourse union, this sort of

arrangement has been of late investigated, including FastSpeech2s, EATS, Glow-WaveGAN and

VITS. As a general rule, these works blend acoustic model and brain vocoder into one

model empowering start-to-finish learning or taking on a new inert portrayal rather than mel-

range to all the more effectively restrict the two sections’ work on a similar circulation.

Hypothetically, end-to-end preparation can accomplish better sound quality and less complex

preparation, what's more, the surmising process. Among the start-to-finish models, VITS utilizes

a variational autoencoder (VAE) to interface with the acoustic model furthermore, the vocoder,

which embraces variational derivation increased with normalizing streams and an ill-disposed

preparing process, producing more normal sounding than the current two-stage models (Zhang et

al 2022, pp. 7237-7241).

Apparently, VISinger is the main start-to-finish arrangement in taking care of the two-

stage crisscross issue in the singing age. It is non-minor to take on VITS in performing voice

union, since singing has a significant distinction from talking, in spite of the fact that they both

advance from a similar human vocal framework. In the first place, phoneme level mean and

fluctuation of acoustic elements are taken on in the flow-based earlier encoder of VITS. In the

singing errand, we present a length controller and an edge earlier organization to get the casing

level mean and difference all things being equal, displaying the rich acoustic variety in singing

and prompting more normal singing execution. As an ablation study, we discover that just

expanding the number of layers of the stream without adding the casing earlier organization

cannot accomplish the same execution gain. Second, as intonational delivering is fundamental in

singing, we especially model the intonational perspectives by acquainting an F0 indicator with

additional aide the casing earlier organization, driving to more steady singing with normal sound.
At last, to get to the next level the cadence conveyance in singing, we change the term indicator

to explicitly anticipate the phoneme to note span proportion, assisted with singing note

standardization. Investigates an expert Mandarin singing corpus show that the proposed VISinger

altogether outflanks the FastSpeech+Neural-Vocoder two-stage approach and the oracle VITS.

Tacotron 2 Case Study: Es-Network

Introduction

Text-to-speech is a way in which input text is converted to speech on a computer.

Traditional methods like statistical parametric speech synthesis (SPSS) consist of independent

components that resulted in difficulty in adjusting the system. Due to this, researchers have

developed systems that demonstrate high power in producing more emotional and natural

speech. These systems are based on neural networks and are end-to-end models. When compared

to SPSS, end-to-end TTS are controllable in the sense that they allow more conditioning, and

also are easier to design. However, this system can only produce average synthesis due to over-

smoothness. This is because they have only thrived in the generalization of features rather than

particular features. The omission of particular features leads to the disappearance of detailed

speech structure.

Over-smoothness can be solved by using HMM-based speech synthesis or using

Generative adversarial networks (GANs). HMM introduces global variance to the model thus

reserving speech features. GANs visualize over-smoothed speech as a low-resolution image and

thus they upsample the speech to yield a higher resolution spectrogram. However, these methods

have limitations too. GANs may suffer mode collapse difficulties since they are unstable while
HMM leads to a noisy speech since spectrogram predicted from them suffer redundant distortion.

It is these drawbacks that motivated Yifun Lin and Zin Zheng to propose a new neural network,

known as the estimated Network (Es-Network). The primary objective of Es-Network is solely to

produce more natural language and reduce over-smoothness.

Es-Network Perfomance.

Unlike end-to-end models that tend to focus on general features of speech, Es-Network

focuses on the individual features. To achieve this, Lin and Zheng employed a new method

called estimated tokens which involved grouping randomly initialized learnable prototype

vectors. This can be represented as [q1, q2,…,qn], where n is the total number of estimated vector

tokens, and q is the estimated token. The estimated output for abstracting general features is

computed as follows:

n
q t=∑ e t ,i qi
i=1

et,i of every annotation qi is computed as follows;

e t , i=vt tanh(W qt +V qi +b)

Estimated tokens optimize by reducing estimate loss L es which constitutes the disparity between

the query vector and estimated output. When this method was used, it showed that token-based

estimated network have the little capability as compared to auto-encoder-based network but

generates more details features, and thus is prudent for ideal accurate mel spectrograms residual.

Es-Network (Es-Tacotron 2) is an improvement of Tacotron 2. Es-Tacotron 2 is

implemented as follows:
 The network is first pre-trained by reducing estimate loss as follows:

Les =¿

 Calculating estimated residual Y ℜ=( y ℜ1 , … , y ren ), which can be done as:

Y ℜ=Y −Y es , Y = (ya,yb,…,yn)

Y refers to targeted mel spectrogram while Yes stands for estimated mel

spectrogram.

 Training Tacotron 2 model. Tacotron 2 is consists of encoder and decoder.

Encoder is responsible for converting character sequence C = [c 1,c2,…,cn] to

hidden feature representation h = (h1,h2,…,hL); h = Encoder(C)

h is then consumed by Location-sensitive attention which summarizes the

encoded sequence as a fixed-length context vector for every decoder output step.

The purpose of the decoder is to predict a mel spectrogram Y′=(y′1,y′2,...,y′T)

from encoded sequence to one frame at a time.

ct = Attention (lt,h,αt−1)
lt = ψpre + lstm (y′t−1,ct−1)
αt−1 is the attention vector, lt result of LSTM layers. The mel spectrogram frame y

′t

is predicted from joining context vector ct and lt: y′t=fmain(ct,lt). The other task is

predicting ‘stop token’ S = (s1,..,st): s′t = fsec(ct,lt).

The third and last task is predicting the estimated residual of the target mel

spectrogram:

y′ =f (y′ ).
ret thi t
 The output of the third task is then passed through a five-layers convolutional

post-net. The output is an improved predicted residual. The model is then

optimized by reducing summed square error. The result is a modified WaveNet

which is the adopted as the neural voconder.

When the Es-Tacotron2 and Tacotron 2 were evaluated using Griffin-Lim algorithm, the

reults were astonishing and Es-Tacotron2 produced a better synthesized speech (67.5%) as

compared to Tacotron 2 (14.0%).

Festival case study: Frisian_TTS.

The Frisian language is one of the languages spoken in the Netherlands in the province of

Fryslan. It consists of three main dialects Klaaifrysk, Súd-Westhoeks, and Wâldfrysk, and other

languages which are composed of Frisian and Dutch. Frisisan_TTS is based on a Dutch TTS

which is built on Festival, NeXTeNS. NeXTeNS architecture is derived from the standard

Festival system architecture. A lot of advanced features in NeXTeNS were not available in

Frisian prototype TTS such as NP chunking, ToDI labeling, and POS tagging. The main aim of

producing this prototype was to see if some utterances are better.

Implementation of Frisian TTS.

Phoneme set. Computer-readable phoneme was created using Worldbet which allows

nasalized diphthongs and triphthongs. This was done by inserting Frisian phonemes between

phoneme files in NeXTeNS and Dutch ones. Mapping of Frisian phonemes was done on Dutch

counterparts and if Dutch phonemes were absent, Festival outputted an error. NeXTeNS was

referred and letter-t-sound rules, and an empty lexicon to create a basic synthesizer.
Token Module. Tokenization is a process of changing unknown tokens such as numbers,

abbreviations, acronyms, symbols, and dates to words. A standard file with specific details was

created in NeXTeNS. The file implemented only a number-to-word and abbreviations. For

example, ‘ienentritich’ (lit. “one-and-thirty”).

POS Module. POS stands for Part-of-Speech and is responsible for a break and accent

breaking. A simple function guess_pos-function for content word was created because there is no

POS tagging for Frisian.

Syntactic Module. Due to the lack of a syntax parser for Frisian, the default option from

NeXTeNS was used.

Phrasing Module. The phrasing module predicts breaks by punctuations. Punctuations are

done by POS or punctuation cart tree. Since Frisian has no tagging, a default method, known as

the punctuation cart tree was chosen.

Intonation Module. To get sentence accent, adjectives, verbs, and nouns are used in

NeXTeNS. Frisian uses simple content word division. This replaced the rule of getting accents in

NeXTeNS with a rule that gives an accent to each word that is not a member of the method word

list (guess_pos-list). Second accents were removed in a group of accents.

Word Module. This involves transforming the graphemic word into a phonemic one

which is possible means of a pronunciation lexicon. If there is no occurrence of the particular

lexicon, it is built with the help of letter-to-sound rules (LtS), after which the prosodic structure

of the word is built.


Postlexical Module. This module took care of assimilation between inside words and

word boundaries. It was implemented by mapping phones to their Dutch counterparts since they

are the same as for Frisian.

Duration Module. This involves defining special rules of every phoneme. The default

duration module in Festival defines every phoneme with the same length. In Frisian case,

duration module from NeXTeNS was used.

Fundamental frequency control. Most intonation is similar between Dutch and Frisian.

Therefore, ToDI was implemented.

Frisian TTS Evaluation and Perfomance.

To evaluate the model, 20 judges were involved. Sentences for evaluation were obtained

from magazines, newspapers, and publications. The evaluation was done in terms of

intelligibility, acceptability, and quality. The scores were given in the scale of 7. The scores were

average as shown in Table 1 below. This is an indication that, the TTS has a potential of

improvement which is the main objective of text-to-speech system.

Table 1: Evaluation results.


Reference

Dijkstra, J., Pols, L.C. and Son, R.J.V., 2004. Frisian TTS, an example of bootstrapping TTS for

minority languages. In Fifth ISCA Workshop on Speech Synthesis.

Liu, Y. and Zheng, J., 2019. Es-Tacotron2: Multi-task Tacotron 2 with pre-trained estimated

network for reducing the over-smoothness problem. Information, 10(4), p.131.

Zhang, Y., Cong, J., Xue, H., Xie, L., Zhu, P. and Bi, M., 2022, May. Visinger: Variational

inference with adversarial learning for end-to-end singing voice synthesis. In ICASSP

2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing

(ICASSP) (pp. 7237-7241). IEEE.

You might also like