Professional Documents
Culture Documents
Textbook Time Domain Representation of Speech Sounds A Case Study in Bangla Asoke Kumar Datta Ebook All Chapter PDF
Textbook Time Domain Representation of Speech Sounds A Case Study in Bangla Asoke Kumar Datta Ebook All Chapter PDF
Textbook Time Domain Representation of Speech Sounds A Case Study in Bangla Asoke Kumar Datta Ebook All Chapter PDF
https://textbookfull.com/product/signal-analysis-of-hindustani-
classical-music-1st-edition-asoke-kumar-datta/
https://textbookfull.com/product/aging-in-a-second-language-a-
case-study-of-aging-immigration-and-an-english-learner-speech-
community-1st-edition-steven-l-arxer/
https://textbookfull.com/product/sound-in-the-time-domain-1st-
edition-mikio-tohyama-auth/
https://textbookfull.com/product/the-feeling-of-embodiment-a-
case-study-in-explaining-consciousness-glenn-carruthers/
Electromyography in Clinical Practice A Case Study
Approach Bashar Katirji
https://textbookfull.com/product/electromyography-in-clinical-
practice-a-case-study-approach-bashar-katirji/
https://textbookfull.com/product/parametric-time-frequency-
domain-spatial-audio-first-edition-delikaris-manias/
https://textbookfull.com/product/cybersecurity-in-nigeria-a-case-
study-of-surveillance-and-prevention-of-digital-crime-aamo-
iorliam/
https://textbookfull.com/product/environmental-remote-sensing-in-
flooding-areas-a-case-study-of-ayutthaya-thailand-chunxiang-cao/
https://textbookfull.com/product/infectious-diseases-a-case-
study-approach-jonathan-cho/
Asoke Kumar Datta
Time Domain
Representation
of Speech
Sounds
A Case Study in Bangla
Time Domain Representation of Speech Sounds
Asoke Kumar Datta
123
Asoke Kumar Datta (emeritus)
Indian Statistical Institute
Kolkata, West Bengal, India
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
The book is dedicated to my revered father
late Maheshwar Datta
Acknowledgements
The author gratefully acknowledges with thanks the free and full cooperation of
colleagues from my parent department the Electronics and Communication
Sciences Unit of the Indian Statistical Institute (ISI), CDAC, Kolkata, and Sir
C. V. Raman Centre for Physics and Music (CVRCPM) of Jadavpur University,
Kolkata. My special thanks must go to Ex. Prof. Nihar Ranjan Ganguly, late Bijon
Mukherjee, and late Krishna Mohan Pattanaik of my parent department. I also
gratefully acknowledge the helping hand extended by Sri Amiya Saha,
Ex. Executive Director of CDAC, Kolkata, as well as the fullest cooperation of
Dr. Shyamal Das Mondal and Arup Saha of the same institution. I have been lucky
to get cooperation from many students and co-workers from ISI, CDAC, Kolkata,
and Sir C. V. Raman Centre for Physics and Music during the long period of
investigation in the field. I also wish to thank Dr. Ranjan Sengupta of CVRCPM
also for constantly encouraging me for publishing this book.
vii
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Spectral Domain Representation . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Time-Domain Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Automatic Speech Recognition (ASR) . . . . . . . . . . . . . . . . . . . . . 6
1.5 Speech Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Spectral Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Spectral Structure of Bangla Phones . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Spectra of Oral Vowels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Spectra of Nasal Vowels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Spectra of Aspirated Vowels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Dynamical Spectral Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Cognition of Phones . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......... 23
3.1 Place of Articulation of Plosives and Vowels . . . . . . . ......... 23
3.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . ......... 23
3.1.2 Machine Identification of Place of Articulation
of Plosives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.3 Experimental Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Cognition of Place of Articulation . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Manipulation of the Signals . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.2 Preparation of the Listening Set . . . . . . . . . . . . . . . . . . . . 40
3.2.3 Results and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
ix
x Contents
Prof. Asoke Kumar Datta obtained his M.Sc. in pure math, and he has worked at
the Indian Statistical Institute from 1955 to 1994. He retired from the Electronics
and Communication Sciences Department as HOD and is ISI Visiting Professor. He
is President, BOM-BOM, Kolkata; Senior Guest Researcher, Sir C. V. Raman
Centre for Physics and Music, JU; Executive Member, Society for Natural
Language Technology Research, Kolkata; and Life Member, Acoustical Society of
India. He received the J. C. Bose Memorial Award, 1969; Sir C. V. Raman Award,
1982–1983 and 1998–1999; S. K. Mitra Memorial Award, 1984; and Sri C Achyut
Menon Prize, 2001. His areas of academic interest include pattern recognition, AI,
speech, music, and consciousness.
xiii
Prologue
xv
xvi Prologue
periodic. The speech signal generally is a non-stationary signal. However, for the
practical application, it is assumed that short-term (about 10–20 ms) segments of
speech can be taken as stationary. The short-term speech representation is histori-
cally inherited from speech coding applications. The formant frequencies, the
resonance structures in the speech spectra was first used to recognize vowels can be
traced back to 1950 with AT&T’s Bell Labs. A detailed spectral domain study of
phones in ‘Bangla’ a standard major dialect of India and Bangladesh has recently
been published (Datta, Asoke Kumar, Acoustics of Bangla Speech Sounds,
Springer, 2018).
The first evidence of doubt crept in the early 1993 of the last century in the
Indian Statistical Institute (ISI) in Kolkata, India, when some signals were produced
with same spectral structures but sounding as different vowels. The continued
efforts also produced VCV syllables where there was no formant transition yet the
place of articulation of different plosives can be clearly distinguished. These
experiments showed that formants are neither necessary nor sufficient for the
cognition of different phones. Moreover, these further indicated that time-domain
features (shape features) may be a reliable alternative in speech research. These
developments led the group in ISI start working on using time-domain features for
ASR, TTS, and singing synthesis with encouraging successes. The results were
demonstrated in an ESCA conference in 1993, later on published in a book (Datta
Asoke Kumar, Epoch Synchronous Overlap Add (ESOLA), A Concatenative
Synthesis Procedure for Speech, Springer, 2018).
Slowly, a viable time-domain representation of speech signals for both objective
and subjective analyses, an alternative to the well-known spectral representation,
evolved. This book presents its history and the extent of the development along
with that of spectral domain representation in the cognitive domain as well as the
technology domain. All the cognitive experiments related to this development along
with details of technology development related to both ASR and TTS are given.
A new model using cohort formed through manner-based labeling has been
successfully experimented with in relation to the use of lexical knowledge in ASR
which merits inclusion in the book.
India has many official dialects. The spoken language technology development is
a burgeoning area. In fact, TTS and ASR taken together form the most powerful
technology to empower people in a developing country where functional literacy is
low. This book endeavors to present time-domain representation in such a way that
research and development in ASR or TTS in all these dialects may be done easily
and seamlessly using the information in this book. In short, this book simply may
be a guidebook for the development of ASR and TTS in all the Indian Standard
Dialects in an indigenous novel way of using signal domain parameters.
Chapter 1
Introduction
1.1 General
Speech is the most important basic attributes that helped in the evolution of man,
making him distinctively different from the other primates to such an extent that it
appears to rule over the whole of the animate world. In a sense, one may say that
for common man, the prime vehicle for developing the individuality is speech. Even
in thinking most of us use this verbal medium internalized for this purpose since
the time of human started using speech. It is generally believed that man started
speaking about 100,000 and 200,000 years ago. Interestingly this almost coincides
with the time of appearance of Homo sapiens. The basic ability of vocalization is
said to be inherited from apes. In its simple form, vocalization is used by primates
and other animals primarily for out of sight communication with others and nor-
mally referred to as “calls” to distinguish it from speech, a sophisticated method
for messaging. Neanderthals had the same DNA-coding region of the FOXP2 gene,
generally known to be responsible for speech, as modern man. The earliest mem-
bers of Homo sapiens may or may not have fully developed language. (One may
note that the time we are talking about is when writing was not developed. So lan-
guage here means only spoken language or speech.) The scholars agree that a period
of proto-linguistic stage may have lasted for a considerably long period. The seed
of modern speech may have been sowed at the Upper Paleolithic period, roughly
50,000 years ago. It is generally believed that acquisition of vocal language orig-
inated from the so-called sing-song speech, “baby talk” or “motherese” used by
parents to talk to their infants. The motherese, a medium all infants perceive and
eventually use to process their respective languages, preceded by prelinguistic foun-
dations of the proto-language(s) evolved by early hominins. Gradually the developing
difficulties in foraging circumstances (Lyons et al. 1998) together with increasing
airwave would not be transmitted if the air molecules were not interconnected by
Akasa. The first sound giving of the second the second a third and so on, expand-
ing akasa in the same way as the waves propagate in water (bichitaranganyaya
ripple like). Udyoktakara said the first sound gives not one in a circle but infinite
number in all direction, a spherical shell (kadambakorakanya Kadamba-bud like
blooming). Naya thinkers also held that each sound wave is destroyed by its suc-
cessor corresponding to the cancelation of the backpropagation. The similarity of
Nyaya thinkers, though arrived through the holistic approach of philosophy, with the
modern scientific theory is remarkable.
Scientific investigations related to the acoustics of speech, both objective and
subjective, are historically conducted in the spectral/timbral domain. Once the signal
is captured, which is of course a time series of pressure pulses the rest traditionally
becomes the research matter of its spectral structure. This was going on since the
beginning of speech research in nineteenth century or even earlier (Flanagan 1972;
Helmholtz 1954; Bell 1906). During this early period, science and engineering for
speech were closely coupled and important milestones thereof are summarized in
Flanagan’s classical book (Flanagan 1972). It got a fillip in 60s of the last century
when Gunnar Fant has published his “source-filter model of speech production” (Fant
1970).
The conversion from time to frequency domain is based on three basic methods:
Fourier transforms, digital filter banks, and linear prediction. In speech processing,
the Fourier transform takes a time series or a function of continuous time and maps it
into a frequency spectrum. The theoretical support for the first process came through
the seminal paper by Joseph Fourier in 1807 (Fourier 1808). This transform can be
rigorously used only when the series is purely periodic. The speech signal generally
is a nonstationary signal and not exactly periodic. It is known as quasi-periodic, some
of them even quasi-random. However for practical application of Fourier transform,
it is assumed that short-term (about 10–20 ms) segments of speech are stationary and
periodic. The short-term speech representation is historically inherited from speech
coding applications (Hermansky 1997). However despite this discrepancy Fourier
transform has been profitably and most widely used since its beginning. The formant
frequencies, the resonance structures in the speech spectra was first used to recognize
vowels can be traced back to 1950 with ATT&T’s Bell Labs (Davis et al. 1952).
Another approach of estimating the spectral envelope uses a filter bank, where
the signal is broken down into a number of frequency bands with characteristic
bandwidths in which the signal energy is measured.
Homer Dudley presented such a bank in 1939 breaking speech down into its
acoustic components using 10 bandpass filters in 1939–40 at Bell Laboratory exhibits
at both the 1939 New York World’s Fair and the 1939 Golden Gate International
Exposition. Liljencrants developed a speech spectrum analyzer using 51 channel
4 1 Introduction
filter bank. In India in 2012 V Ujjwal, R Amekar developed gamma tone filter bank
for representing speech as early as 1939. The most interesting and living example is
human cochlea which uses about 30,000 filters in the basilar membrane spanning a
frequency range of about 20 Hz to 20 kHz.
The other useful method for speech analysis is that of cepstrum analysis. The
speech is modeled by a time-varying filter for the vocal tract, which is excited by an
appropriate source. In the frequency domain, the log power spectrum of the output
signal is the sum of the log power spectra of source. For the purpose of speech
recognition, the speech sounds are characterized by the size and shape of filter which
is represented by the spectrum of the filter. The composite log power spectrum passes
through a low pass filter to retain only the characteristics of this filter. This can be
realized by taking the inverse Fourier transform of the log power spectrum and
retaining only the first few coefficients. The resultant spectrum is called cepstrum
and the coefficients are called cepstral coefficients (Hermansky 1990).
Mel-Frequency Cepstral Coefficients (MFCC) which use the cosine transform of
the real logarithm of the short-term energy spectrum expressed on a mel-frequency
scale (Dautrich et al. 1983) are being widely used in ASR.
Linear Predictive Coding is based upon the idea that voiced speech is almost
periodic and predictable. The number of previous samples used for linearly predicting
the present sample defines the number of coefficients (weight). This is equivalent to
the number of poles present in the source which is treated as a linear system. This
linear prediction will characterize the speech spectrum (Atal 1974). The coefficients
(weighting factors) are called Linear Predictive Coefficients (LPC) and the numbers
of coefficients the LPC order. LPC was used as early as 1983 in speech recognition.
In Perceptual Linear Prediction (PLP) the LPC and filter bank approaches are
combined by fitting an all-pole model to the set of energies produced by a perceptually
motivated filter bank. The cepstrum is then computed from the model parameters
(Dautrich et al. 1983). This is an efficient speech representation and used extensively
in DARPA evaluations of the large vocabulary ASR technology.
Even for cognition, the human being is supposed to be using spectral domain
representation. As mentioned earlier the inner ear which analyzes the sound signal
contains the primary analyzer, the cochlea. This contains a large array of resonators
(approximately 30,000 fibers). The characteristic frequencies (CF) of them range
from approximately 20 Hz to 20 kHz. These are used to break down a signal into
its spectral components. There have been experiments to show that firings from the
associated nerve fibers can give a conforming description of the formant structure of
the input sound. High firing rates of auditory nerves have been found in neurons whose
CF (Characteristic Frequencies) corresponds to formant frequencies. It is reported
that the excitation pattern of the auditory nerves over the cochlea produces some
patterns which may be called “auditory spectra” of the signal. These have uncanny
similarity with the spectral components produced by the LEA of Fant (1970). It was
universally held that, formants (their stationary states for vowel cognition) and their
movements (for place of articulation of most consonants) account for the perception
of place of articulation of all phonemes.
1.2 Spectral Domain Representation 5
The first report on speech research in India was published from Indian Statistical
Institute (ISI) in Kolkata, in 1968 (Dutta Majumdar and Datta 1968a, b, 1969). The
technique of digital filtering for spectral estimation was used here in 1973. The first
spectral analyzer of Kay Elemetrics came to ISI in 1972 making spectral analysis
much easier. This prompted the group to take up spectral analysis of speech sounds,
in right earnest particularly of vowels in different Indian languages. These were
successively reported for Hindi in 1973, Telugu (Dutta Majumdar et al. 1978) in
1978 and Bangla in 1988 (Datta et al. 1988). The study on consonantal sounds
(plosives) revealed the importance of transitory movements (Datta et al. 1981).
It may be of interest to note that Fourier transform gives two different comple-
mentary informations, namely, the amplitude spectra and the phase spectra. It is also
known only the two together represent the signal in its totality. For the inverse trans-
formation, both are necessary. Yet, except for very rare exception, only the amplitude
spectra are used in acoustic representation.
In the early 90s of the last century in ISI, the first evidences of doubt on the necessity
of spectral representation of sound in cognition of vowels crept in when some signals
were produced with same spectral structures but sounding as different vowels (Datta
1993). The continued efforts in the same direction also produced VCV syllables
where there was no formant transition yet the different plosive can be clearly distin-
guished. These experiments showed that formants are neither necessary nor sufficient
for cognition of different phones (Datta et al. 2011). Moreover, these further indicated
the possibility of time-domain features (shape features) as a reliable alternative in
speech research. This aspect is discussed elaborately in Chap. 4. These developments
led the group in ISI start working on using time-domain features for ASR, TTS, and
singing synthesis with encouraging successes. The results were demonstrated in an
ESCA conference in 1993 (Datta et al. 1990).
One of the interesting characteristics of the quasi-periodic sound is the recent find-
ings of what is generally known as random perturbations. Their cognitive influence is
in the quality of sound. These are manifested as small random differences in funda-
mental frequencies (Jitter), amplitude (shimmer) and complexity (CP) between two
consecutive periods in a speech signal. Obviously, spectral dimensional approach
cannot detect these. An exhaustive study on these for different quasi-periodic signals
in a different context has been conducted in ISI. This is included in one chapter.
Slowly a viable time-domain representation of speech for both objective and sub-
jective analysis, alternative to the well known spectral representation, evolved in ISI.
The book presents its history and the extent of this development in the technology
domain as well as a comparison with spectral domain approach. The deficiency of
spectral domain representation in the cognitive domain is presented. All the cognitive
experiments related to this development along with details of technology develop-
6 1 Introduction
ment related to both ASR and TTS is given. The later stage technology developments
were done in CDAC, a Govt. of India sponsored all India institution.
It is generally believed that in human cognition though phoneme recognition
plays a major role, its accuracy depends on the lexical knowledge of the listener.
However how the brain surmises the word without knowing the phonemes is yet
not clear. Many theories including higher linguistic analysis inter alia involving,
syntax, pragmatics, semantics, etc., abound. One interesting and novel development
in automatic recognition of spoken word of exploiting the lexical knowledge uses
some presumption of the possible words on the basis of manner of production of
phones needs a specific mention here. This is described in a later chapter.
India has many official dialects. The spoken language technology development
is a burgeoning area. In fact, TTS and ASR, taken together, form the most powerful
technology to empower people in a country like India. The book endeavors to present
the related issues in such a way that research and development in ASR or TTS in
all these languages may be done seamlessly using the information in this book. In
short, this book simply may be a guidebook for the development of ASR and TTS
in all the Indian Standard Dialects.
The technology of Automatic Speech Recognition (ASR) has progressed greatly over
the last seven decades. The study of automatic speech recognition and transcription
can be traced back to 1950 with ATT&T’s Bell Labs. In 1952, at Bell Laboratories,
Davis, Biddulph, and Balashek built a system for isolated digit recognition for a sin-
gle speaker (Davis et al. 1952), using the formant frequencies measured/estimated
during vowel regions of each digit. Olson and Belar of RCA Laboratories in 1956
recognized 10 distinct syllables for a single speaker (Olson and Belar 1956) Fry and
Denes tried to build a phoneme recognizer to recognize four vowels and nine conso-
nants in 1959 at University College in England (Fry 1959) and use the first statistical
syntax at phoneme level. In the late 1960s, Reddy at Carnegie Mellon University
conducted a pioneering research in the field of continuous speech recognition by
dynamic tracking of phonemes (Reddy 1966). As early as 1968 Dutta Majumder and
Datta of ISI, Kolkata proposed a model for spoken word recognition in Indian lan-
guages (Dutta Majumdar et al. 1968a, b). In 1975 DRAGON system was developed
and it was capable of recognizing one thousand of English words (Baker 1875). In the
1980s, a big shift in speech recognition methodology took place when the use of con-
ventional template-based approach (a straightforward pattern recognition paradigm)
was replaced by the use of rigorous statistical modeling like Hidden Markov Model
(HMM) (Rabiner 1989). The SPHINX system was developed at Carnegie Melon
University (CMU) based on the HMM method for a 1000 word database to achieve
high word accuracy (Lee et al. 1990). Major techniques include the Maximum Like-
1.4 Automatic Speech Recognition (ASR) 7
lihood Linear Regression (MLLR) (Leggetter and Woodland 1995; Varga and Moore
1990) Model Decomposition, (Gales and Young 1993) Parallel Model Composition
(PMC) and the (Shinoda and Lee 2001) Structural Maximum A Posteriori (SMAP)
method. Although read speech and similar types of speech, e.g., news broadcasts,
reading a text, etc., can be recognized with accuracy higher than 85% using state-
of-the-art speech recognition technology for English and other European languages,
recognition accuracy drastically decreases for spontaneous speech. Broadening the
application of speech recognition depends crucially on raising recognition perfor-
mance for spontaneous speech. The research for spontaneous speech recognition
started in twenty-first century. For this purpose, it is necessary to build a large spon-
taneous speech corpus for constructing the acoustic and language models.
Research on automatic speech recognition began in India began in 1963. While
continuous speech recognition has not been attempted, phone recognition in different
Indian languages has been undertaken. In the later period, isolated word recognition
has also been attempted. In an earlier paragraph, we have presented the time line.
describes a procedure which uses HMMs with formants as the acoustic observations.
This helps to fix the problems of traditional formant synthesizers. Formants are indeed
a good spectral representation for HMMs as we can assume, like MFCCs, that each
formant is statistically independent of the others.
The development of concatenative synthesis, a fully time-domain approach, in
India began in the early 90s of the last century. Indian Statistical Institute (ISI)
played the seminal role in it. The interesting story behind this development is that
1993 was earmarked for the birth centenary celebration of Late Professor Prasanta
Chandra Mahalanobis, the Founder Director of ISI also known as the “Father of
Statistics in India” The group in the Electronics and Communication Sciences Unit
of ISI to this centenary celebration. Intensive efforts of about 8 months produced
Epoch Synchronous Non-Overlap Add algorithm (ESNOLA) for concatenative syn-
thesis (Dan and Datta 1993). We had the satisfaction that the Centenary celebration
was inaugurated with a welcoming speech and Rabindra Sangeet by synthesized
ESNOLA synthesis system and was appreciated by the audience. This was the first
TTS in an Indian dialect. It again resurfaced around 2005 in CDAC, Kolkata. The
new overlap add version ESOLA was developed with the inclusion of rudimentary
prosodic structure. The corresponding TTS system produced almost natural sounding
Bangla speech. It was used by the Election Commission (EC) of India for automated
announcement of election results of State Assembly in 2005. At even this point of
time in India, this is the only indigenous TTS system available only for Bangla,
awaiting societal use for the empowerment of functionally illiterate mass and of
visually disabled persons (ESOLA Book). Bengal has a rich and really large literary
treasure. A good TTS would be a boon to the visually challenged people allowing
them to have a taste of this treasure at will.
The concatenative synthesis was felt to be potentially a more natural, simpler,
and better approach in terms of quality of sound than the parametric approaches.
The most important research interest in this area is the modification and sometimes
even regeneration of short segments of sounds to take care of pitch modification and
complexity manipulation required to obtain natural continuity and prosody require-
ments. Special methodology had to be developed for these purposes. These led to
a microscopic examination of a single waveform from the segment representing a
speech event to ascertain the role of different parts of the waveform in perception
of phonetic quality as well as manipulation of loudness, pitch, and timbre. In fact,
this study actually led to the development of the “time-domain representation” an
alternative to the spectral domain representation of speech sound. In India, the first
concatenative speech synthesis Epoch Synchronous Non-Overlap Add algorithm
(ESNOLA) (Datta et al. 1990) appeared in 1993. Along with speech ESNOLA also
demonstrated synthesis of singing by producing one Bangla Rabindra Sangeet in the
same conference. This was the first TTS in an Indian dialect. Later on, the ESOLA
(Epoch Synchronous Overlap Add algorithm) was developed around 2002.
References 9
References
Ainsworth, W. A. (1973). A system for converting English text into speech. IEEE Transactions on
Audio and Electroacoustics, 23, 288–290.
Allen, J. (1976). Synthesis of speech from unrestricted text. Proceedings of the IEEE, 64, 422–433.
Allen, J., Hunnicutt, S., Carlson, R., & Granstrom, B. (1979). MITalk-79: The 1979 MIT text-
to-speech system. In J. J. Wolf & D. H. Klatt (Eds.), ASA-50 speech communication papers
(pp. 507–510). New York: Acoustical Society of America.
Atal, B. S. (1974). Effectiveness of linear prediction characteristics of the speech wave for automatic
speaker identification and verification. The Journal of the Acoustical Society of America, 55,
1304–1312.
Baker, J. K. (1875). The DRAGON system—An overview. IEEE Transactions on Acoustics, Speech,
and Signal Processing ASSP, 23, 24–29.
Bell, A. G. (1906). The mechanism of speech, Funk & Wagnalls, New York, Reprinted from the
proceedings of the first summer meeting of the American association to promote the teaching of
speech to the deaf.
Choudhury, L., & Datta, A. K. (1988). Consonance between physics and philosophy regarding
nature and propagation of sound. Jouranl of Acoustic Social Industries, 26(3–4), 508–513.
Dan, T., & Datta, A. K. (1993) PSNOLA approach to synthesis of singing. In Proceedings of P C
Mahalanobis Birth Centenary, Volume IAPRDT3 (pp. pp. 388–394). Calcutta, Indian Statistical
Institute.
Datta, A. K. (1993). Do ear perceive vowels through formants? In Proceedings of 3rd European
Conference on Speech Communication and Technology. Genova, Italy, September 21–23, 1993
(also in Proceedings of P C Mahalanobis Birth Centenary, Volume IAPRDT3, Indian Statistical
Institute, Calcutta, pp. 434–441) .
Datta, A. K. Epoch synchronous concatenative synthesis of speech and singing: A study on Indian
context. Springer (in press).
Datta, A. K., Ganguly, N. R., & Dutta Majumdar, D. (1981). Acoustic features of consonants: A
study based on Telugu speech sounds. Acustica, 47, 72–82.
Datta, A. K., Ganguly, N. R., & Mukherjee, B. (1988). Acoustic phonetics of non-nasal standard
Bengali vowels. A Spectrographic Study, JIETE, 34, 50–56.
Datta, A. K., Ganguly, N. R., & Mukherjee, B. (1990). Intonation in segment-concatenated speech.
In Proceedings of ESCA Workshop on Speech Synthesis (pp 153–156). Autrans, France.
Datta, A. K., & Mukherjee, B. (2011). On the role of formants in cognition of vowels and place
of articulation of plosives. In Solvi Ystad, Mitsuko Aramaki, & Richard Konrad- (Eds.), Speech,
sound and music processing: Embracing research in India. Martinet: Kristofer Jensen and Sang-
hamitra Mohanty, Springer.
Dautrich, B. A., Rabiner, L. R., & Martin, T. B. (1983). On the effects of varying filter bank
parameters on isolated word recognition. IEEE Transactions on Acoustics, Speech, and Signal
Processing, 31(4), 793–807.
Davis, K. H., Biddulph, R., & Balashek, S. (1952). Automatic recognition of spoken digits. The
Journal of the Acoustical Society of America, 24(6), 637–642.
Dutoit, T. (1994). High quality text-to-speech synthesis: A comparison of four candidate algorithms.
In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP) (pp. 565–568).
Dutta Majumdar, D., & Datta, A. K. (1968a). Some studies in automatic speech coding and recog-
nition procedure. Indian Journal of Physics, 12, 425–443.
Dutta Majumdar, D., & Datta, A. K. (1968b). A model for spoken word recognition. In International
Conference on Instrumentation and Automation. Milan: Italy.
Dutta Majumdar, D., & Datta, A. K. (1969). An analyzer coder for machine recognition of speech.
JITE, 15, 233–243.
Dutta Majumdar, D., Datta, A. K., & Ganguly, N. R. (1978). Some studies on acoustic phonetic
features of human speech in relation to Hindi speech sounds. Acustica, 1, 55–64.
10 1 Introduction
Falk, D. (2004). Prelinguistic evolution in early hominins: Whence motherese? Behavioral and
Brain Sciences, 27(4), 535.
Fant, G. (1970). Acoustic theory of speech production. Mouton De Gruyter.
Flanagan, J. L. (1972). Speech analysis synthesis and perception (2nd ed.). Berlin, Heidelberg, New
York: Springer.
Flanagan, J. L., & Ishizaka, K. (1978). Computer model to characterize the air volume displaced
by the vibrating vocal cords. Journal of the Acoustical Society of America, 63, 1558–1563.
Fourier, J. (1808). Mémoire sur la propagation de la chaleur dans les corps solides, présenté le 21
Décembre 1807 à l’Institut national—Nouveau Bulletin des sciences par la Société philomatique
de Paris. I. Paris: Bernard. March 1808.
Fry, D. B. (1959). Theoretical aspects of the mechanical speech recognition. Journal of the British
Institution of Radio Engineers, 19(4), 211–229.
Galaburda, A. M., & Panda, D. N. (1982). Roles of architectonics and connections in the study
of primate evolution. In E. Armstrong & D. Falk (Eds.), Primate brain evolution: Methods and
concepts (pp. 203–216). New York: Plenum Press.
Gales, M. J. F., & Young, S. J. (1993). Parallel model combination for speech recognition in noise.
Technical Report, CUED/F-INFENG/TR 135.
Helmholtz, H. L. F. (1954). On the sensations of tone as a physiological basis for the theory of music
(2nd ed.) Dover Publications, New York, translated from the fourth (and last) German edition of
1877.
Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis for speech. The Journal of The
Acoustical Society of America, 87, 1738–1752.
Hermansky, H. (1997). Auditory modeling in automatic recognition of speech. In Proceedings of
the First European Conference on Signal Analysis and Prediction (pp. 17–21). Prague, Czech
Republic.
Klatt, D. H. (1982).The KLATTalk text-to-speech conversion system. In Proceedings of IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP (pp. 1589–1592).
Lee, K. F., Hon, H. W., & Reddy, R. (1990). An overview of the SPHINX speech recognition system.
IEEE Transactions on Acoustics, Speech, and Signal Processing, 38(1), 35–45.
Leggetter, C. J., & Woodland, P. C. (1995). Maximum likelihood linear regression for speaker
adaptation of continuous density hidden Markov models. Computer Speech and Language, 9,
171–185.
Lyons, D. M., Kim, S., Schatzberg, A. F., & Levine, S. (1998). Postnatal foraging demands
alter adrenocortical activity and psychosocial development. Developmental Psychobiology, 32,
285–291.
Mathews, M. V., & Moore, F. R. (1970). GROOVE—A program to compose, store, and edit functions
of time. Communications of the ACM, 13(12), 715.
National Conference on Innovative Paradigms in Engineering & Technology (NCIPET-2012). In
Proceedings published by International Journal of Computer Applications® (IJCA) 20.
Olson, H. F., & Belar, H. (1956). Phonetic typewriter. The Journal of the Acoustical Society of
America, 28(6), 1072–1081.
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech
recognition. Proceedings of the IEEE, 77(2), 257–286.
Reddy, D. R. (1966). An approach to computer speech recognition by direct analysis of the speech
wave. Tech. Report No. C549, Computer Science Dept., Stanford University.
Schroeder, M. (1993). A brief history of synthetic speech. Speech Communication, 13, 231–237.
Shinoda, K., & Lee, C. H. (2001). A structural Bayes approach to speaker adaptation. IEEE Trans-
actions on Speech and Audio Proceedings, 9(3), 276–287.
Tokuda, K., Zen, H., & Black, A.W. (2002) An HMM—based speech synthesis system applied to
English. In IEEE Speech Synthesis Workshop, Santa Monica, California, September 11–13, 2002.
Tomasello, M., & Camaioni, L. (1997). A comparison of the gestural communication of apes and
human infants. Human Development, 40, 7–24.
References 11
Varga, A. P., & Moore, R. K. (1990). Hidden Markov model decomposition of speech and noise. In
Proceedings on ICASSP (pp. 845–848).
Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (1999). Simultaneous
modeling of spectrum, pitch and duration in HMM-based speech synthesis. In Proceedings of
Eurospeech 99 (pp. 2347–2350).
Chapter 2
Spectral Domain
2.1 Introduction
ically inherited from speech coding applications (Hermansky 1997). The formant
frequencies, the resonance structures in the speech spectra was first used to recog-
nize vowels can be traced back to 1950 with AT&T’s Bell Labs.
The second method for estimating the spectral envelope is via a filter bank, which
separates the signal frequency bandwidth in a number of frequency bands in which
the signal energy is measured. As early as 1939 Homer Dudley represented speech
breaking it down into its acoustic components using 10 bandpass filters 1939–40
New York World’s Fair. Liljencrants, J developed a speech spectrum analyzer using
51 channel filter bank (Liljencrants 1965). In India, Ujjwal and Amekar (2012)
developed gamma tone filter bank for representing speech.
Another useful method for speech analysis is that of cepstrum analysis. Here
the speech is modeled by a time-varying filter for the vocal tract, which is excited
by an appropriate source. In the frequency domain, the log power spectrum of the
output signal is the sum of the log power spectra of source. The composite log
power spectrum passes through a low pass filter to retain only the characteristics of
this filter. The resultant spectrum is called cepstrum and the coefficients are called
cepstral coefficients (Hermansky 1990).
Mel-Frequency Cepstral Coefficients (MFCC), a variant of the Cepstral coeffi-
cients is widely used in speech recognition to represent different speech sounds.
MFCC are the results of a cosine transform of the real logarithm of the short-term
energy spectrum expressed on a mel-frequency scale (Dautrich et al. 1983).
Linear Predictive Coding is based upon the idea that voiced speech is almost
periodic and so it is predictable. The number of previous samples used for linearly
predicting the present sample defines the number of coefficients (weight) or codes
and is equivalent to the number of poles present in the linear system. Therefore, linear
prediction will theoretically allow us to characterize the speech spectrum (Atal 1974).
The coefficients (weighting factors) are called Linear Predictive Coefficients (LPC),
and the number of coefficients is called the LPC order. LPC was used as early as
1983 in speech recognition.
Perceptual Linear Prediction (PLP) combines the LPC and filter bank approaches
by fitting an all-pole model to the set of energies produced by a perceptually motivated
filter bank and then computing the cepstrum from the model parameters (Hermansky).
This is also found to be one of the most efficient speech representations in extensive
DARPA evaluations of the large vocabulary continuous Speech ASR technology
(Cook et al. 1996; Woodland et al. 1996).
We have already noted the three methods for spectral analysis, namely, Fourier
transform, digital filter banks, and linear prediction. Of these three the most com-
monly used one is the Fourier transform while our ear uses the filter bank method. In
fact, almost all speech research uses in reality harmonic analysis of Helmholtz’s era,
nineteenth century, in the name of frequency domain analysis. It could be because
of two reasons one is the legacy and the other is the very strong and substantiated
belief that the ear also does so. Even when we use Fourier transform, we look only
at the amplitude spectra and the phase spectra are neglected. Let us peruse Fig. 2.1a
and b. These are composed with the fundamental and one harmonic, both same for
the two figures only the phase of the harmonic is different for the two figures. The
2.1 Introduction 15
Fig. 2.1 Two different waveforms generated from same two harmonics
result is two different waveforms as expected. The point is that the harmonics alone
do not really represent the signal itself. If we want to represent the signal fully we
have to take heed of the phase spectra.
Be that as it may, we have been using harmonic analysis as spectral analysis
for the last three centuries with quite satisfactory results in objectively defining
phones and speech-related events covering many languages creating extremely useful
technology for the use of humanity. In the next section, we shall describe in brief
spectral characteristics of phones in one language Bangla to see how beautifully
spectrum works in objective representation of them and how they correlate with our
perception of phones, with an example in the case of Bangla vowels.
Fig. 2.2 Illustration of formants with respect to the vowel/æ/ (Datta 2018)
In general, first formant is associated with tongue height and the second formant
frequency with the back to front position of the tongue hump. It is now common
to use the first two formants for a reliable estimate for objectively determining the
articulatory position of a vowel. Figure 2.8 presents one example each for the seven
Bangla vowels.
Figure 2.3 presents the black and white spectrogram of a steady state of vowel
(æ) followed by the normal spectra at the right. The x-axis of the spectrogram rep-
resents the time, y-axis the frequency, and the grayness gives a comparative idea
of the strength of energy at a particular time and frequency of the harmonics. The
spectrograms are very useful in understanding the dynamic movement of timbral
quality.
2.3 Spectra of Oral Vowels 17
An exhaustive study has been done in Indian Statistical Institute, Kolkata, and
CDAC—Kolkata on formants of Bangla vowels. It may be pertinent to briefly intro-
duce the results. Figure 2.4 represent the mean position and an estimate of the spread
of Bangla oral vowels in F1 –F2 plane for data of both sexes pooled together. The
dots represent the mean position of the vowels. The ovals give an idea of the spread
where the widths and the heights of the ovals are standard deviations of F2 and F1
values, respectively. Assuming normal distribution the ovals cover only about 68%
of the data. That the formant frequencies F1 , F2 , and F3 for a vowel closely follow
normal distribution was reported as early as 1978. Though the ovals appear to be
disjoint actually this is so because they contain only a part of the data.
As an example of correlating spectral data with perception one may cite the
technique that enables one to represent formant data, together with F0 values, into
the traditional perceptual evaluation of the category of a vowel utterance in terms of
height and backness of the tongue. This technique transformed Fig. 2.4 into Fig. 2.5
which represents Bangla vowels in this perceptual frame.
Nasal vowels are produced when the velum is open and the nasopharynx is coupled
with the oropharynx. Nasals are said to be characterized by nasal formants and
anti-formants. In general, these studies reveal following acoustic cues for oral/nasal
distinction:
F. Strengthening of F0 ,
G. Weakening of F1 ,
H. Strengthening of F2 ,
I. Raising of F1 and F2 , and
J. Presence of nasal formants and anti-formants.
Fig. 2.5 Perceptual vowel diagram for Bangla vowels drawn from objective data (Datta 2018)
Fig. 2.6 Formants of Bangla oral and nasal vowels (Datta 2018)
SIPI. No, täss' on: juo! (On antamaisillaan, vaan pidättää.) Ei.
Mutta kuulehan: sanos ensin, oletkos, Sanna, koskaan ollut
rakkauden piehkinässä?
SANNA. Kukako? Se sama, joka sitte teki niin viisaasti, että minut
jätti. Muutenhan minä olisin hullujenhuoneesen joutunut.
(Yleistä naurua).
(Naurua.)
LOIKKANEN. No, nyt se taas lippaisee niin, että kintut vilkkaa eikä
seisahdu ennenkuin Syrjälän kujan suussa.
Viides kohtaus.
SIPI. On. — — —
Esirippu.
Neljäs näytös.
Ensimmäinen kohtaus.
Toinen kohtaus.
SIIRI. Ha-ha-ha-ha-ha!!
Kolmas kohtaus.
ANTTI. Eipä tuota ole paljon aikaakaan. Sinne kun jäi portille tytär
hevosen luo vuottamaan.
ROUVA VALLSTRÖM. Hilmako?
ANTTI. Loikkasen? —
ANTTI. Ei, rouva hyvä, ei. Rouva ei tunne. Ei hän sitä tee —
koskaan.
SIIRI (puolikovaan). Täti! Minä olen kuullut kaikki. Minä tiedän jo…
ymmärrän jo kaikki… Se on hirveätä! Kuule, täti, pyydä Hilmaa
tänne! Pyydäthän? Sano, että minä tahdon tavata häntä —
välttämättä. Mene ja sano, täti kulta! Minä puhuttelen sill'aikaa Sipiä.
Mutta elä vaan sano Hilmalle, että hän on täällä.
Neljäs kohtaus.
SIIRI ja SIPI.
SIPI. No, Siiri? Miksi sinä tänne jäit? Vai etkö tahdo hyvästiäkään
minulle enää sanoa?
SIPI. No?
SIIRI. Sipi!!!
SIPI. Turhaan sinä minua nyt noin tuomitset, Siiri. Sinä tiedät
kuitenkin varsin hyvin, kuinka paljon sinua rakastan. Yksi sana, yksi
liike, yksi ainoa viittaus sinulta oli kylliksi, että olin valmis sinua
seuraamaan, tekemään kaikki, jättämään kaikki!
SIPI. Kaikki, mitä olen tehnyt, olen vaan sinun tähtesi tehnyt, Siiri.
SIPI. Muistat kai, mitä minulle sanoit, kun sinua ensi kerran kosin,
kun vielä kauppapalvelija olin.