Textbook Time Domain Representation of Speech Sounds A Case Study in Bangla Asoke Kumar Datta Ebook All Chapter PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 53

Time Domain Representation of Speech

Sounds A Case Study in Bangla Asoke


Kumar Datta
Visit to download the full and correct content document:
https://textbookfull.com/product/time-domain-representation-of-speech-sounds-a-case
-study-in-bangla-asoke-kumar-datta/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Signal Analysis of Hindustani Classical Music 1st


Edition Asoke Kumar Datta

https://textbookfull.com/product/signal-analysis-of-hindustani-
classical-music-1st-edition-asoke-kumar-datta/

Aging in a Second Language: A Case Study of Aging,


Immigration, and an English Learner Speech Community
1st Edition Steven L. Arxer

https://textbookfull.com/product/aging-in-a-second-language-a-
case-study-of-aging-immigration-and-an-english-learner-speech-
community-1st-edition-steven-l-arxer/

Sound in the Time Domain 1st Edition Mikio Tohyama


(Auth.)

https://textbookfull.com/product/sound-in-the-time-domain-1st-
edition-mikio-tohyama-auth/

The Feeling of Embodiment: A Case Study in Explaining


Consciousness Glenn Carruthers

https://textbookfull.com/product/the-feeling-of-embodiment-a-
case-study-in-explaining-consciousness-glenn-carruthers/
Electromyography in Clinical Practice A Case Study
Approach Bashar Katirji

https://textbookfull.com/product/electromyography-in-clinical-
practice-a-case-study-approach-bashar-katirji/

Parametric time-frequency domain spatial audio First


Edition Delikaris-Manias

https://textbookfull.com/product/parametric-time-frequency-
domain-spatial-audio-first-edition-delikaris-manias/

Cybersecurity in Nigeria A Case Study of Surveillance


and Prevention of Digital Crime Aamo Iorliam

https://textbookfull.com/product/cybersecurity-in-nigeria-a-case-
study-of-surveillance-and-prevention-of-digital-crime-aamo-
iorliam/

Environmental Remote Sensing in Flooding Areas: A Case


Study of Ayutthaya, Thailand Chunxiang Cao

https://textbookfull.com/product/environmental-remote-sensing-in-
flooding-areas-a-case-study-of-ayutthaya-thailand-chunxiang-cao/

Infectious Diseases: A Case Study Approach Jonathan Cho

https://textbookfull.com/product/infectious-diseases-a-case-
study-approach-jonathan-cho/
Asoke Kumar Datta

Time Domain
Representation
of Speech
Sounds
A Case Study in Bangla
Time Domain Representation of Speech Sounds
Asoke Kumar Datta

Time Domain Representation


of Speech Sounds
A Case Study in Bangla

123
Asoke Kumar Datta (emeritus)
Indian Statistical Institute
Kolkata, West Bengal, India

ISBN 978-981-13-2302-7 ISBN 978-981-13-2303-4 (eBook)


https://doi.org/10.1007/978-981-13-2303-4

Library of Congress Control Number: 2018952609

© Springer Nature Singapore Pte Ltd. 2018


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
The book is dedicated to my revered father
late Maheshwar Datta
Acknowledgements

The author gratefully acknowledges with thanks the free and full cooperation of
colleagues from my parent department the Electronics and Communication
Sciences Unit of the Indian Statistical Institute (ISI), CDAC, Kolkata, and Sir
C. V. Raman Centre for Physics and Music (CVRCPM) of Jadavpur University,
Kolkata. My special thanks must go to Ex. Prof. Nihar Ranjan Ganguly, late Bijon
Mukherjee, and late Krishna Mohan Pattanaik of my parent department. I also
gratefully acknowledge the helping hand extended by Sri Amiya Saha,
Ex. Executive Director of CDAC, Kolkata, as well as the fullest cooperation of
Dr. Shyamal Das Mondal and Arup Saha of the same institution. I have been lucky
to get cooperation from many students and co-workers from ISI, CDAC, Kolkata,
and Sir C. V. Raman Centre for Physics and Music during the long period of
investigation in the field. I also wish to thank Dr. Ranjan Sengupta of CVRCPM
also for constantly encouraging me for publishing this book.

vii
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Spectral Domain Representation . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Time-Domain Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Automatic Speech Recognition (ASR) . . . . . . . . . . . . . . . . . . . . . 6
1.5 Speech Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Spectral Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Spectral Structure of Bangla Phones . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Spectra of Oral Vowels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Spectra of Nasal Vowels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Spectra of Aspirated Vowels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Dynamical Spectral Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Cognition of Phones . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......... 23
3.1 Place of Articulation of Plosives and Vowels . . . . . . . ......... 23
3.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . ......... 23
3.1.2 Machine Identification of Place of Articulation
of Plosives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.3 Experimental Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Cognition of Place of Articulation . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Manipulation of the Signals . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.2 Preparation of the Listening Set . . . . . . . . . . . . . . . . . . . . 40
3.2.3 Results and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

ix
x Contents

3.3.1 Spectral Cues of Nasal/Oral Distinction . . . . . . . . . . . . . . 44


3.3.2 Cognitive Cues of Nasal/Oral Distinction . . . . . . . . . . . . . 46
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4 Time-Domain Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1 State Phase Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1.2 State Phase Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1.3 Analysis–Resynthesis Using State Phase . . . . . . . . . . . . . . 67
4.1.4 Coding for Data Packet . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.1.5 Error Detection and Correction . . . . . . . . . . . . . . . . . . . . . 71
4.1.6 Resynthesis Using Linear Interpolation . . . . . . . . . . . . . . . 72
4.1.7 Decoding and Regeneration . . . . . . . . . . . . . . . . . . . . . . . 73
4.1.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2 Morphological Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.2.2 Spectral Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2.3 F 0 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.2.4 Estimation of GI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2.5 Lx Signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2.6 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.2.7 Experimental Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.2.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5 Time-Domain Representation of Phones . . . . . . . . . . . . . . . . . . . . . . 95
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2 Manner of Articulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.3.1 Vowels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.3.2 Consonants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.4 Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.4.1 Labeling Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.4.2 Manner-Based Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.4.3 Lexical Knowledge for Phone Disambiguation . . . . . . . . . 112
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6 Random Perturbations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.2 Perturbation Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.3 Perturbation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.4 Cognitive Aspects of Random Perturbations . . . . . . . . . . . . . . . . . 125
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Contents xi

7 Nonlinearity in Speech Signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131


7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.2 Chaos in Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.3 Fractals in Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
About the Author

Prof. Asoke Kumar Datta obtained his M.Sc. in pure math, and he has worked at
the Indian Statistical Institute from 1955 to 1994. He retired from the Electronics
and Communication Sciences Department as HOD and is ISI Visiting Professor. He
is President, BOM-BOM, Kolkata; Senior Guest Researcher, Sir C. V. Raman
Centre for Physics and Music, JU; Executive Member, Society for Natural
Language Technology Research, Kolkata; and Life Member, Acoustical Society of
India. He received the J. C. Bose Memorial Award, 1969; Sir C. V. Raman Award,
1982–1983 and 1998–1999; S. K. Mitra Memorial Award, 1984; and Sri C Achyut
Menon Prize, 2001. His areas of academic interest include pattern recognition, AI,
speech, music, and consciousness.

xiii
Prologue

Scientific investigations related to the acoustics of speech, both objective and


subjective, are traditionally done in the spectral domain. Once the signal is cap-
tured, which is of course a time series of displacements, the rest usually becomes
investigations of its spectral structures. This was going on since the beginning of
speech research in nineteenth century or even earlier (Hermann von Helmholtz). It
got a fillip in 1960 of the last century when Gunnur Fant, also known as the father
of modern speech research, has proposed his Source-Filter Model of Speech
Production and published it. In fact, the development of related technologies like
automatic speech recognition (ASR), text-to-speech synthesis (TTS), automatic
speaker verification, and automatic spoken language identification (ASLID) may be
said to be done primarily and traditionally using the spectral domain representation
of speech sound.
The human being is also traditionally believed to be using primarily some
spectral domain representations for the cognition. The inner ear which analyzes the
sound signal contains the primary analyzer, the cochlea. This contains a large array
of resonators (approximately 30,000 fibers). The characteristic frequencies
(CF) of them range from approximately 20 Hz to 20 KHz. These are used to break
down a signal into its spectral components. There have been experiments to show
that firings from the associated nerve fibers can give a conforming description of the
formant structure of the input sound. High firing rates of auditory nerves have been
found in neurons’ characteristic frequencies (CF) which correspond to formant
frequencies. It is reported that the excitation pattern of the auditory nerves over the
cochlea produces some patterns which may be called ‘auditory spectra’ of the
signal. These have significant similarities with the spectral components produced by
the LEA of Fant. It was universally held that formants (a spectral term commonly
used in speech research) and their movements account for the perception of the
place of articulation of all phonemes.
A strong theoretical support to the spectral domain approach came through the
seminal paper by Joseph Fourier in 1807. Though in a rigorous sense this transform
can be used only when the series is purely periodic, this has found favor with
speech scientists though it is well known that speech signals are never fully

xv
xvi Prologue

periodic. The speech signal generally is a non-stationary signal. However, for the
practical application, it is assumed that short-term (about 10–20 ms) segments of
speech can be taken as stationary. The short-term speech representation is histori-
cally inherited from speech coding applications. The formant frequencies, the
resonance structures in the speech spectra was first used to recognize vowels can be
traced back to 1950 with AT&T’s Bell Labs. A detailed spectral domain study of
phones in ‘Bangla’ a standard major dialect of India and Bangladesh has recently
been published (Datta, Asoke Kumar, Acoustics of Bangla Speech Sounds,
Springer, 2018).
The first evidence of doubt crept in the early 1993 of the last century in the
Indian Statistical Institute (ISI) in Kolkata, India, when some signals were produced
with same spectral structures but sounding as different vowels. The continued
efforts also produced VCV syllables where there was no formant transition yet the
place of articulation of different plosives can be clearly distinguished. These
experiments showed that formants are neither necessary nor sufficient for the
cognition of different phones. Moreover, these further indicated that time-domain
features (shape features) may be a reliable alternative in speech research. These
developments led the group in ISI start working on using time-domain features for
ASR, TTS, and singing synthesis with encouraging successes. The results were
demonstrated in an ESCA conference in 1993, later on published in a book (Datta
Asoke Kumar, Epoch Synchronous Overlap Add (ESOLA), A Concatenative
Synthesis Procedure for Speech, Springer, 2018).
Slowly, a viable time-domain representation of speech signals for both objective
and subjective analyses, an alternative to the well-known spectral representation,
evolved. This book presents its history and the extent of the development along
with that of spectral domain representation in the cognitive domain as well as the
technology domain. All the cognitive experiments related to this development along
with details of technology development related to both ASR and TTS are given.
A new model using cohort formed through manner-based labeling has been
successfully experimented with in relation to the use of lexical knowledge in ASR
which merits inclusion in the book.
India has many official dialects. The spoken language technology development is
a burgeoning area. In fact, TTS and ASR taken together form the most powerful
technology to empower people in a developing country where functional literacy is
low. This book endeavors to present time-domain representation in such a way that
research and development in ASR or TTS in all these dialects may be done easily
and seamlessly using the information in this book. In short, this book simply may
be a guidebook for the development of ASR and TTS in all the Indian Standard
Dialects in an indigenous novel way of using signal domain parameters.
Chapter 1
Introduction

1.1 General

Speech is the most important basic attributes that helped in the evolution of man,
making him distinctively different from the other primates to such an extent that it
appears to rule over the whole of the animate world. In a sense, one may say that
for common man, the prime vehicle for developing the individuality is speech. Even
in thinking most of us use this verbal medium internalized for this purpose since
the time of human started using speech. It is generally believed that man started
speaking about 100,000 and 200,000 years ago. Interestingly this almost coincides
with the time of appearance of Homo sapiens. The basic ability of vocalization is
said to be inherited from apes. In its simple form, vocalization is used by primates
and other animals primarily for out of sight communication with others and nor-
mally referred to as “calls” to distinguish it from speech, a sophisticated method
for messaging. Neanderthals had the same DNA-coding region of the FOXP2 gene,
generally known to be responsible for speech, as modern man. The earliest mem-
bers of Homo sapiens may or may not have fully developed language. (One may
note that the time we are talking about is when writing was not developed. So lan-
guage here means only spoken language or speech.) The scholars agree that a period
of proto-linguistic stage may have lasted for a considerably long period. The seed
of modern speech may have been sowed at the Upper Paleolithic period, roughly
50,000 years ago. It is generally believed that acquisition of vocal language orig-
inated from the so-called sing-song speech, “baby talk” or “motherese” used by
parents to talk to their infants. The motherese, a medium all infants perceive and
eventually use to process their respective languages, preceded by prelinguistic foun-
dations of the proto-language(s) evolved by early hominins. Gradually the developing
difficulties in foraging circumstances (Lyons et al. 1998) together with increasing

© Springer Nature Singapore Pte Ltd. 2018 1


A. K. Datta, Time Domain Representation of Speech Sounds,
https://doi.org/10.1007/978-981-13-2303-4_1
2 1 Introduction

bipedalism, the postnatal mothers required foraging-related changes in maternal care.


They had to put their babies down for other work. This resulted in the increase of
distal mother–infant communications (Falk 2004). This notion of “putting the baby
down” led to the emergence of proto-speech. The biological capacity for speech pro-
duction to cope with the needed development of language evolved incrementally but
consistently over the time within the hominin line (Tomasello and Camaioni 1997).
Even today language acquisition intrinsically include motherese. It also means that
prelinguistic substrates for proto-language(s) began to evolve from infant-directed
vocalizations as brain size started to increase in bipedal hominins (Galaburda and
Panda 1982). Thus, though the development of brain size may not be dependent on
the development of spoken language, the reverse seems to be generally true.
This subjective phenomeon is being so important for human evolution demands
also an objective insight. For this reason, even before the emergence of modern
science, man began to investigate speech using this paradigm of science. In the sixth
century BC, the ancient Greek philosopher Pythagoras wanted to know why some
combinations of musical sounds seemed more beautiful than others, and he found
answers in terms of numerical ratios representing the harmonic overtone series on a
string. Aristotle (384–322 BC) understood that sound consisted of compressions and
rarefactions of air which “falls upon and strikes the air which is next to it…” a very
good expression of the nature of wave motion. The deeper research on sound began
only after the technology of recording sound came into being. The first reported
recording of sound was by Édouard-Léon Scott de Martinville (delivered to the
French Academy on 26 January 1857). The electrical recording along with the use
of sensitive microphones to capture the sound was introduced in 1925. This greatly
improved the audio quality of records. The real breakthrough appeared through the
introduction of digital recorders. British scientist Alec Reeves files the first patent
describing pulse-code modulation (PCM) in 1938. In 1957 Max Mathews (Mathews
and Moore 1970) developed the process to digitally record sound through computer.
Thus began the era of modern research in acoustic in general as well as speech in
particular.
It may not be out of order to speak a word or two about investigations on sound
by early Indian thinkers. The quest for knowledge may broadly be divided into two:
(a) Objective (modern science which is reductionist is an example) and (b) Subjec-
tive (philosophy). In early India, philosophers exercised their mind over which they
considered of primary importance to human development namely sound (shabda
bramha). As early as fourth century BC Sabar Swami said that the sound created
by the first impact sets in vibratory motion in air particles resulting in rarefaction
and condensation (pracaya) (Choudhury and Datta 1988). Toward the propagation
of sound the early Nyaya-Vaisesika thinkers hold that the first sound thus produced
in Akasa by the impact of the vibrating molecules against the contiguous molecules
of air, produces a second in the contiguous Akasa, and the second sound a third and
so on in the analogous way as waves are generated in water, until the last sound
sets up vibrations in the eardrum (karnasaskuli). Since the Akasa is motionless, the
1.1 General 3

airwave would not be transmitted if the air molecules were not interconnected by
Akasa. The first sound giving of the second the second a third and so on, expand-
ing akasa in the same way as the waves propagate in water (bichitaranganyaya 
ripple like). Udyoktakara said the first sound gives not one in a circle but infinite
number in all direction, a spherical shell (kadambakorakanya  Kadamba-bud like
blooming). Naya thinkers also held that each sound wave is destroyed by its suc-
cessor corresponding to the cancelation of the backpropagation. The similarity of
Nyaya thinkers, though arrived through the holistic approach of philosophy, with the
modern scientific theory is remarkable.
Scientific investigations related to the acoustics of speech, both objective and
subjective, are historically conducted in the spectral/timbral domain. Once the signal
is captured, which is of course a time series of pressure pulses the rest traditionally
becomes the research matter of its spectral structure. This was going on since the
beginning of speech research in nineteenth century or even earlier (Flanagan 1972;
Helmholtz 1954; Bell 1906). During this early period, science and engineering for
speech were closely coupled and important milestones thereof are summarized in
Flanagan’s classical book (Flanagan 1972). It got a fillip in 60s of the last century
when Gunnar Fant has published his “source-filter model of speech production” (Fant
1970).

1.2 Spectral Domain Representation

The conversion from time to frequency domain is based on three basic methods:
Fourier transforms, digital filter banks, and linear prediction. In speech processing,
the Fourier transform takes a time series or a function of continuous time and maps it
into a frequency spectrum. The theoretical support for the first process came through
the seminal paper by Joseph Fourier in 1807 (Fourier 1808). This transform can be
rigorously used only when the series is purely periodic. The speech signal generally
is a nonstationary signal and not exactly periodic. It is known as quasi-periodic, some
of them even quasi-random. However for practical application of Fourier transform,
it is assumed that short-term (about 10–20 ms) segments of speech are stationary and
periodic. The short-term speech representation is historically inherited from speech
coding applications (Hermansky 1997). However despite this discrepancy Fourier
transform has been profitably and most widely used since its beginning. The formant
frequencies, the resonance structures in the speech spectra was first used to recognize
vowels can be traced back to 1950 with ATT&T’s Bell Labs (Davis et al. 1952).
Another approach of estimating the spectral envelope uses a filter bank, where
the signal is broken down into a number of frequency bands with characteristic
bandwidths in which the signal energy is measured.
Homer Dudley presented such a bank in 1939 breaking speech down into its
acoustic components using 10 bandpass filters in 1939–40 at Bell Laboratory exhibits
at both the 1939 New York World’s Fair and the 1939 Golden Gate International
Exposition. Liljencrants developed a speech spectrum analyzer using 51 channel
4 1 Introduction

filter bank. In India in 2012 V Ujjwal, R Amekar developed gamma tone filter bank
for representing speech as early as 1939. The most interesting and living example is
human cochlea which uses about 30,000 filters in the basilar membrane spanning a
frequency range of about 20 Hz to 20 kHz.
The other useful method for speech analysis is that of cepstrum analysis. The
speech is modeled by a time-varying filter for the vocal tract, which is excited by an
appropriate source. In the frequency domain, the log power spectrum of the output
signal is the sum of the log power spectra of source. For the purpose of speech
recognition, the speech sounds are characterized by the size and shape of filter which
is represented by the spectrum of the filter. The composite log power spectrum passes
through a low pass filter to retain only the characteristics of this filter. This can be
realized by taking the inverse Fourier transform of the log power spectrum and
retaining only the first few coefficients. The resultant spectrum is called cepstrum
and the coefficients are called cepstral coefficients (Hermansky 1990).
Mel-Frequency Cepstral Coefficients (MFCC) which use the cosine transform of
the real logarithm of the short-term energy spectrum expressed on a mel-frequency
scale (Dautrich et al. 1983) are being widely used in ASR.
Linear Predictive Coding is based upon the idea that voiced speech is almost
periodic and predictable. The number of previous samples used for linearly predicting
the present sample defines the number of coefficients (weight). This is equivalent to
the number of poles present in the source which is treated as a linear system. This
linear prediction will characterize the speech spectrum (Atal 1974). The coefficients
(weighting factors) are called Linear Predictive Coefficients (LPC) and the numbers
of coefficients the LPC order. LPC was used as early as 1983 in speech recognition.
In Perceptual Linear Prediction (PLP) the LPC and filter bank approaches are
combined by fitting an all-pole model to the set of energies produced by a perceptually
motivated filter bank. The cepstrum is then computed from the model parameters
(Dautrich et al. 1983). This is an efficient speech representation and used extensively
in DARPA evaluations of the large vocabulary ASR technology.
Even for cognition, the human being is supposed to be using spectral domain
representation. As mentioned earlier the inner ear which analyzes the sound signal
contains the primary analyzer, the cochlea. This contains a large array of resonators
(approximately 30,000 fibers). The characteristic frequencies (CF) of them range
from approximately 20 Hz to 20 kHz. These are used to break down a signal into
its spectral components. There have been experiments to show that firings from the
associated nerve fibers can give a conforming description of the formant structure of
the input sound. High firing rates of auditory nerves have been found in neurons whose
CF (Characteristic Frequencies) corresponds to formant frequencies. It is reported
that the excitation pattern of the auditory nerves over the cochlea produces some
patterns which may be called “auditory spectra” of the signal. These have uncanny
similarity with the spectral components produced by the LEA of Fant (1970). It was
universally held that, formants (their stationary states for vowel cognition) and their
movements (for place of articulation of most consonants) account for the perception
of place of articulation of all phonemes.
1.2 Spectral Domain Representation 5

The first report on speech research in India was published from Indian Statistical
Institute (ISI) in Kolkata, in 1968 (Dutta Majumdar and Datta 1968a, b, 1969). The
technique of digital filtering for spectral estimation was used here in 1973. The first
spectral analyzer of Kay Elemetrics came to ISI in 1972 making spectral analysis
much easier. This prompted the group to take up spectral analysis of speech sounds,
in right earnest particularly of vowels in different Indian languages. These were
successively reported for Hindi in 1973, Telugu (Dutta Majumdar et al. 1978) in
1978 and Bangla in 1988 (Datta et al. 1988). The study on consonantal sounds
(plosives) revealed the importance of transitory movements (Datta et al. 1981).
It may be of interest to note that Fourier transform gives two different comple-
mentary informations, namely, the amplitude spectra and the phase spectra. It is also
known only the two together represent the signal in its totality. For the inverse trans-
formation, both are necessary. Yet, except for very rare exception, only the amplitude
spectra are used in acoustic representation.

1.3 Time-Domain Representation

In the early 90s of the last century in ISI, the first evidences of doubt on the necessity
of spectral representation of sound in cognition of vowels crept in when some signals
were produced with same spectral structures but sounding as different vowels (Datta
1993). The continued efforts in the same direction also produced VCV syllables
where there was no formant transition yet the different plosive can be clearly distin-
guished. These experiments showed that formants are neither necessary nor sufficient
for cognition of different phones (Datta et al. 2011). Moreover, these further indicated
the possibility of time-domain features (shape features) as a reliable alternative in
speech research. This aspect is discussed elaborately in Chap. 4. These developments
led the group in ISI start working on using time-domain features for ASR, TTS, and
singing synthesis with encouraging successes. The results were demonstrated in an
ESCA conference in 1993 (Datta et al. 1990).
One of the interesting characteristics of the quasi-periodic sound is the recent find-
ings of what is generally known as random perturbations. Their cognitive influence is
in the quality of sound. These are manifested as small random differences in funda-
mental frequencies (Jitter), amplitude (shimmer) and complexity (CP) between two
consecutive periods in a speech signal. Obviously, spectral dimensional approach
cannot detect these. An exhaustive study on these for different quasi-periodic signals
in a different context has been conducted in ISI. This is included in one chapter.
Slowly a viable time-domain representation of speech for both objective and sub-
jective analysis, alternative to the well known spectral representation, evolved in ISI.
The book presents its history and the extent of this development in the technology
domain as well as a comparison with spectral domain approach. The deficiency of
spectral domain representation in the cognitive domain is presented. All the cognitive
experiments related to this development along with details of technology develop-
6 1 Introduction

ment related to both ASR and TTS is given. The later stage technology developments
were done in CDAC, a Govt. of India sponsored all India institution.
It is generally believed that in human cognition though phoneme recognition
plays a major role, its accuracy depends on the lexical knowledge of the listener.
However how the brain surmises the word without knowing the phonemes is yet
not clear. Many theories including higher linguistic analysis inter alia involving,
syntax, pragmatics, semantics, etc., abound. One interesting and novel development
in automatic recognition of spoken word of exploiting the lexical knowledge uses
some presumption of the possible words on the basis of manner of production of
phones needs a specific mention here. This is described in a later chapter.
India has many official dialects. The spoken language technology development
is a burgeoning area. In fact, TTS and ASR, taken together, form the most powerful
technology to empower people in a country like India. The book endeavors to present
the related issues in such a way that research and development in ASR or TTS in
all these languages may be done seamlessly using the information in this book. In
short, this book simply may be a guidebook for the development of ASR and TTS
in all the Indian Standard Dialects.

1.4 Automatic Speech Recognition (ASR)

The technology of Automatic Speech Recognition (ASR) has progressed greatly over
the last seven decades. The study of automatic speech recognition and transcription
can be traced back to 1950 with ATT&T’s Bell Labs. In 1952, at Bell Laboratories,
Davis, Biddulph, and Balashek built a system for isolated digit recognition for a sin-
gle speaker (Davis et al. 1952), using the formant frequencies measured/estimated
during vowel regions of each digit. Olson and Belar of RCA Laboratories in 1956
recognized 10 distinct syllables for a single speaker (Olson and Belar 1956) Fry and
Denes tried to build a phoneme recognizer to recognize four vowels and nine conso-
nants in 1959 at University College in England (Fry 1959) and use the first statistical
syntax at phoneme level. In the late 1960s, Reddy at Carnegie Mellon University
conducted a pioneering research in the field of continuous speech recognition by
dynamic tracking of phonemes (Reddy 1966). As early as 1968 Dutta Majumder and
Datta of ISI, Kolkata proposed a model for spoken word recognition in Indian lan-
guages (Dutta Majumdar et al. 1968a, b). In 1975 DRAGON system was developed
and it was capable of recognizing one thousand of English words (Baker 1875). In the
1980s, a big shift in speech recognition methodology took place when the use of con-
ventional template-based approach (a straightforward pattern recognition paradigm)
was replaced by the use of rigorous statistical modeling like Hidden Markov Model
(HMM) (Rabiner 1989). The SPHINX system was developed at Carnegie Melon
University (CMU) based on the HMM method for a 1000 word database to achieve
high word accuracy (Lee et al. 1990). Major techniques include the Maximum Like-
1.4 Automatic Speech Recognition (ASR) 7

lihood Linear Regression (MLLR) (Leggetter and Woodland 1995; Varga and Moore
1990) Model Decomposition, (Gales and Young 1993) Parallel Model Composition
(PMC) and the (Shinoda and Lee 2001) Structural Maximum A Posteriori (SMAP)
method. Although read speech and similar types of speech, e.g., news broadcasts,
reading a text, etc., can be recognized with accuracy higher than 85% using state-
of-the-art speech recognition technology for English and other European languages,
recognition accuracy drastically decreases for spontaneous speech. Broadening the
application of speech recognition depends crucially on raising recognition perfor-
mance for spontaneous speech. The research for spontaneous speech recognition
started in twenty-first century. For this purpose, it is necessary to build a large spon-
taneous speech corpus for constructing the acoustic and language models.
Research on automatic speech recognition began in India began in 1963. While
continuous speech recognition has not been attempted, phone recognition in different
Indian languages has been undertaken. In the later period, isolated word recognition
has also been attempted. In an earlier paragraph, we have presented the time line.

1.5 Speech Synthesis

Internationally the development in the speech synthesis systems in various languages


has been continuing for several decades. It is expected that a TTS should be able to
synthesize any sentence, including arbitrary word sequences, with proper intelligi-
bility and naturalness (Allen 1976; Allen et al. 1979; Dutoit 1994). The relevance of
spectral domain parameters in speech synthesis may be said to begin with the devel-
opment by Wagner (Flanagan and Ishizaka 1978). Obata and Teshima in 1932 intro-
duced the third formant of the vowel (Schroeder 1993), a remarkable development.
The beginning of parametric synthesizers may be traced back to the VOCODER
(Voice Coder) developed at Bell Laboratories. Homer Dudley made the VODER
(Voice Operating Demonstrator) in 1939. Gunnar Fant developed the first cascade
formant synthesizer Orator Verbis Electris I (OVE I) around the same time. OVE II
came out 10 years after OVE I and had separately modeled the transfer function of
the vocal tract for vowels, nasals, and obstruent consonants. Systematic development
of Text-To-Speech Synthesis (Klatt 1982) by Klatt et al. in 1982 may be said to begin
in the late 70s.
Late twentieth and early twenty-first centuries saw a new approach known as
Hidden Markov Model (HMM) synthesis systems to evolve. HMM is a finite state
machine generating a sequence of discrete time observations at a given time t. It
changes states at Markov process in accordance with a state transition probability, to
generate data in accordance with known output probability distribution for the current
state. Yoshimura et al. in 1999 and Tokuda et al. in 2002 described some of the early
such systems to generate parameters for synthesis. They used five streams namely
for, MFCCs, log F0, delta log F0, delta delta log F0 and F0. Acero (Ainsworth 1973)
8 1 Introduction

describes a procedure which uses HMMs with formants as the acoustic observations.
This helps to fix the problems of traditional formant synthesizers. Formants are indeed
a good spectral representation for HMMs as we can assume, like MFCCs, that each
formant is statistically independent of the others.
The development of concatenative synthesis, a fully time-domain approach, in
India began in the early 90s of the last century. Indian Statistical Institute (ISI)
played the seminal role in it. The interesting story behind this development is that
1993 was earmarked for the birth centenary celebration of Late Professor Prasanta
Chandra Mahalanobis, the Founder Director of ISI also known as the “Father of
Statistics in India” The group in the Electronics and Communication Sciences Unit
of ISI to this centenary celebration. Intensive efforts of about 8 months produced
Epoch Synchronous Non-Overlap Add algorithm (ESNOLA) for concatenative syn-
thesis (Dan and Datta 1993). We had the satisfaction that the Centenary celebration
was inaugurated with a welcoming speech and Rabindra Sangeet by synthesized
ESNOLA synthesis system and was appreciated by the audience. This was the first
TTS in an Indian dialect. It again resurfaced around 2005 in CDAC, Kolkata. The
new overlap add version ESOLA was developed with the inclusion of rudimentary
prosodic structure. The corresponding TTS system produced almost natural sounding
Bangla speech. It was used by the Election Commission (EC) of India for automated
announcement of election results of State Assembly in 2005. At even this point of
time in India, this is the only indigenous TTS system available only for Bangla,
awaiting societal use for the empowerment of functionally illiterate mass and of
visually disabled persons (ESOLA Book). Bengal has a rich and really large literary
treasure. A good TTS would be a boon to the visually challenged people allowing
them to have a taste of this treasure at will.
The concatenative synthesis was felt to be potentially a more natural, simpler,
and better approach in terms of quality of sound than the parametric approaches.
The most important research interest in this area is the modification and sometimes
even regeneration of short segments of sounds to take care of pitch modification and
complexity manipulation required to obtain natural continuity and prosody require-
ments. Special methodology had to be developed for these purposes. These led to
a microscopic examination of a single waveform from the segment representing a
speech event to ascertain the role of different parts of the waveform in perception
of phonetic quality as well as manipulation of loudness, pitch, and timbre. In fact,
this study actually led to the development of the “time-domain representation” an
alternative to the spectral domain representation of speech sound. In India, the first
concatenative speech synthesis Epoch Synchronous Non-Overlap Add algorithm
(ESNOLA) (Datta et al. 1990) appeared in 1993. Along with speech ESNOLA also
demonstrated synthesis of singing by producing one Bangla Rabindra Sangeet in the
same conference. This was the first TTS in an Indian dialect. Later on, the ESOLA
(Epoch Synchronous Overlap Add algorithm) was developed around 2002.
References 9

References

Ainsworth, W. A. (1973). A system for converting English text into speech. IEEE Transactions on
Audio and Electroacoustics, 23, 288–290.
Allen, J. (1976). Synthesis of speech from unrestricted text. Proceedings of the IEEE, 64, 422–433.
Allen, J., Hunnicutt, S., Carlson, R., & Granstrom, B. (1979). MITalk-79: The 1979 MIT text-
to-speech system. In J. J. Wolf & D. H. Klatt (Eds.), ASA-50 speech communication papers
(pp. 507–510). New York: Acoustical Society of America.
Atal, B. S. (1974). Effectiveness of linear prediction characteristics of the speech wave for automatic
speaker identification and verification. The Journal of the Acoustical Society of America, 55,
1304–1312.
Baker, J. K. (1875). The DRAGON system—An overview. IEEE Transactions on Acoustics, Speech,
and Signal Processing ASSP, 23, 24–29.
Bell, A. G. (1906). The mechanism of speech, Funk & Wagnalls, New York, Reprinted from the
proceedings of the first summer meeting of the American association to promote the teaching of
speech to the deaf.
Choudhury, L., & Datta, A. K. (1988). Consonance between physics and philosophy regarding
nature and propagation of sound. Jouranl of Acoustic Social Industries, 26(3–4), 508–513.
Dan, T., & Datta, A. K. (1993) PSNOLA approach to synthesis of singing. In Proceedings of P C
Mahalanobis Birth Centenary, Volume IAPRDT3 (pp. pp. 388–394). Calcutta, Indian Statistical
Institute.
Datta, A. K. (1993). Do ear perceive vowels through formants? In Proceedings of 3rd European
Conference on Speech Communication and Technology. Genova, Italy, September 21–23, 1993
(also in Proceedings of P C Mahalanobis Birth Centenary, Volume IAPRDT3, Indian Statistical
Institute, Calcutta, pp. 434–441) .
Datta, A. K. Epoch synchronous concatenative synthesis of speech and singing: A study on Indian
context. Springer (in press).
Datta, A. K., Ganguly, N. R., & Dutta Majumdar, D. (1981). Acoustic features of consonants: A
study based on Telugu speech sounds. Acustica, 47, 72–82.
Datta, A. K., Ganguly, N. R., & Mukherjee, B. (1988). Acoustic phonetics of non-nasal standard
Bengali vowels. A Spectrographic Study, JIETE, 34, 50–56.
Datta, A. K., Ganguly, N. R., & Mukherjee, B. (1990). Intonation in segment-concatenated speech.
In Proceedings of ESCA Workshop on Speech Synthesis (pp 153–156). Autrans, France.
Datta, A. K., & Mukherjee, B. (2011). On the role of formants in cognition of vowels and place
of articulation of plosives. In Solvi Ystad, Mitsuko Aramaki, & Richard Konrad- (Eds.), Speech,
sound and music processing: Embracing research in India. Martinet: Kristofer Jensen and Sang-
hamitra Mohanty, Springer.
Dautrich, B. A., Rabiner, L. R., & Martin, T. B. (1983). On the effects of varying filter bank
parameters on isolated word recognition. IEEE Transactions on Acoustics, Speech, and Signal
Processing, 31(4), 793–807.
Davis, K. H., Biddulph, R., & Balashek, S. (1952). Automatic recognition of spoken digits. The
Journal of the Acoustical Society of America, 24(6), 637–642.
Dutoit, T. (1994). High quality text-to-speech synthesis: A comparison of four candidate algorithms.
In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP) (pp. 565–568).
Dutta Majumdar, D., & Datta, A. K. (1968a). Some studies in automatic speech coding and recog-
nition procedure. Indian Journal of Physics, 12, 425–443.
Dutta Majumdar, D., & Datta, A. K. (1968b). A model for spoken word recognition. In International
Conference on Instrumentation and Automation. Milan: Italy.
Dutta Majumdar, D., & Datta, A. K. (1969). An analyzer coder for machine recognition of speech.
JITE, 15, 233–243.
Dutta Majumdar, D., Datta, A. K., & Ganguly, N. R. (1978). Some studies on acoustic phonetic
features of human speech in relation to Hindi speech sounds. Acustica, 1, 55–64.
10 1 Introduction

Falk, D. (2004). Prelinguistic evolution in early hominins: Whence motherese? Behavioral and
Brain Sciences, 27(4), 535.
Fant, G. (1970). Acoustic theory of speech production. Mouton De Gruyter.
Flanagan, J. L. (1972). Speech analysis synthesis and perception (2nd ed.). Berlin, Heidelberg, New
York: Springer.
Flanagan, J. L., & Ishizaka, K. (1978). Computer model to characterize the air volume displaced
by the vibrating vocal cords. Journal of the Acoustical Society of America, 63, 1558–1563.
Fourier, J. (1808). Mémoire sur la propagation de la chaleur dans les corps solides, présenté le 21
Décembre 1807 à l’Institut national—Nouveau Bulletin des sciences par la Société philomatique
de Paris. I. Paris: Bernard. March 1808.
Fry, D. B. (1959). Theoretical aspects of the mechanical speech recognition. Journal of the British
Institution of Radio Engineers, 19(4), 211–229.
Galaburda, A. M., & Panda, D. N. (1982). Roles of architectonics and connections in the study
of primate evolution. In E. Armstrong & D. Falk (Eds.), Primate brain evolution: Methods and
concepts (pp. 203–216). New York: Plenum Press.
Gales, M. J. F., & Young, S. J. (1993). Parallel model combination for speech recognition in noise.
Technical Report, CUED/F-INFENG/TR 135.
Helmholtz, H. L. F. (1954). On the sensations of tone as a physiological basis for the theory of music
(2nd ed.) Dover Publications, New York, translated from the fourth (and last) German edition of
1877.
Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis for speech. The Journal of The
Acoustical Society of America, 87, 1738–1752.
Hermansky, H. (1997). Auditory modeling in automatic recognition of speech. In Proceedings of
the First European Conference on Signal Analysis and Prediction (pp. 17–21). Prague, Czech
Republic.
Klatt, D. H. (1982).The KLATTalk text-to-speech conversion system. In Proceedings of IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP (pp. 1589–1592).
Lee, K. F., Hon, H. W., & Reddy, R. (1990). An overview of the SPHINX speech recognition system.
IEEE Transactions on Acoustics, Speech, and Signal Processing, 38(1), 35–45.
Leggetter, C. J., & Woodland, P. C. (1995). Maximum likelihood linear regression for speaker
adaptation of continuous density hidden Markov models. Computer Speech and Language, 9,
171–185.
Lyons, D. M., Kim, S., Schatzberg, A. F., & Levine, S. (1998). Postnatal foraging demands
alter adrenocortical activity and psychosocial development. Developmental Psychobiology, 32,
285–291.
Mathews, M. V., & Moore, F. R. (1970). GROOVE—A program to compose, store, and edit functions
of time. Communications of the ACM, 13(12), 715.
National Conference on Innovative Paradigms in Engineering & Technology (NCIPET-2012). In
Proceedings published by International Journal of Computer Applications® (IJCA) 20.
Olson, H. F., & Belar, H. (1956). Phonetic typewriter. The Journal of the Acoustical Society of
America, 28(6), 1072–1081.
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech
recognition. Proceedings of the IEEE, 77(2), 257–286.
Reddy, D. R. (1966). An approach to computer speech recognition by direct analysis of the speech
wave. Tech. Report No. C549, Computer Science Dept., Stanford University.
Schroeder, M. (1993). A brief history of synthetic speech. Speech Communication, 13, 231–237.
Shinoda, K., & Lee, C. H. (2001). A structural Bayes approach to speaker adaptation. IEEE Trans-
actions on Speech and Audio Proceedings, 9(3), 276–287.
Tokuda, K., Zen, H., & Black, A.W. (2002) An HMM—based speech synthesis system applied to
English. In IEEE Speech Synthesis Workshop, Santa Monica, California, September 11–13, 2002.
Tomasello, M., & Camaioni, L. (1997). A comparison of the gestural communication of apes and
human infants. Human Development, 40, 7–24.
References 11

Varga, A. P., & Moore, R. K. (1990). Hidden Markov model decomposition of speech and noise. In
Proceedings on ICASSP (pp. 845–848).
Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., & Kitamura, T. (1999). Simultaneous
modeling of spectrum, pitch and duration in HMM-based speech synthesis. In Proceedings of
Eurospeech 99 (pp. 2347–2350).
Chapter 2
Spectral Domain

2.1 Introduction

In a sense the beginning of spectral domain representation of sound was made by


the ancient Greek philosopher Pythagoras, as early as the sixth century BC, when he
wondered why some combinations of musical sounds seemed more beautiful than
others, and he found answers in terms of numerical ratios representing the harmonic
overtone series on a string sound. This is probably the first known query about a
dimension in sound other than pitch and loudness. We have to wait for more than
2000 thousand years till 1862 when Helmholtz (in his book “On the Sensations of
Tone”) first showed that a musical sound is composed of a number of pure tones by
an apparatus called resonator. The Helmholtz resonator, as it is now called, consists
of a rigid container of a known volume, nearly spherical in shape, with a small neck
and hole in one end and a larger hole in the other end to admit the sound.
Spectral analysis of speech dates back to the nineteenth century or even earlier,
Henry Sweet’s study on phonetics (Sweet 1890), Alexander Graham Bell’s effort to
make speech visible to deaf people (Bell 1906). One may include here also Hermann
Von Helmholtz’s study on tones. The important milestones in science and technology
for during this early period are summarized in Flanagan’s classical book (Flanagan
1972). All investigations during this period were done in the frequency domain.
The conversion of the time series of speech signal to frequency domain is
based on three basic methods: Fourier transform, digital filter banks, and linear
prediction. Fourier transform takes a time series or a function of continuous time
and maps a specific portion of it into a frequency spectrum. During this period
the source is held to be same (quasi-stationarity). The real theoretical support
to the first process came through the seminal paper by Joseph Fourier in 1807
(Fourier 1807). Unfortunately, this transform can only be rigorously used only
when the series is periodic. The speech signal generally is a nonstationary signal.
However for practical application, it is assumed that short-term (about 10–20 ms)
segments of speech are stationary. The short-term speech representation is histor-

© Springer Nature Singapore Pte Ltd. 2018 13


A. K. Datta, Time Domain Representation of Speech Sounds,
https://doi.org/10.1007/978-981-13-2303-4_2
14 2 Spectral Domain

ically inherited from speech coding applications (Hermansky 1997). The formant
frequencies, the resonance structures in the speech spectra was first used to recog-
nize vowels can be traced back to 1950 with AT&T’s Bell Labs.
The second method for estimating the spectral envelope is via a filter bank, which
separates the signal frequency bandwidth in a number of frequency bands in which
the signal energy is measured. As early as 1939 Homer Dudley represented speech
breaking it down into its acoustic components using 10 bandpass filters 1939–40
New York World’s Fair. Liljencrants, J developed a speech spectrum analyzer using
51 channel filter bank (Liljencrants 1965). In India, Ujjwal and Amekar (2012)
developed gamma tone filter bank for representing speech.
Another useful method for speech analysis is that of cepstrum analysis. Here
the speech is modeled by a time-varying filter for the vocal tract, which is excited
by an appropriate source. In the frequency domain, the log power spectrum of the
output signal is the sum of the log power spectra of source. The composite log
power spectrum passes through a low pass filter to retain only the characteristics of
this filter. The resultant spectrum is called cepstrum and the coefficients are called
cepstral coefficients (Hermansky 1990).
Mel-Frequency Cepstral Coefficients (MFCC), a variant of the Cepstral coeffi-
cients is widely used in speech recognition to represent different speech sounds.
MFCC are the results of a cosine transform of the real logarithm of the short-term
energy spectrum expressed on a mel-frequency scale (Dautrich et al. 1983).
Linear Predictive Coding is based upon the idea that voiced speech is almost
periodic and so it is predictable. The number of previous samples used for linearly
predicting the present sample defines the number of coefficients (weight) or codes
and is equivalent to the number of poles present in the linear system. Therefore, linear
prediction will theoretically allow us to characterize the speech spectrum (Atal 1974).
The coefficients (weighting factors) are called Linear Predictive Coefficients (LPC),
and the number of coefficients is called the LPC order. LPC was used as early as
1983 in speech recognition.
Perceptual Linear Prediction (PLP) combines the LPC and filter bank approaches
by fitting an all-pole model to the set of energies produced by a perceptually motivated
filter bank and then computing the cepstrum from the model parameters (Hermansky).
This is also found to be one of the most efficient speech representations in extensive
DARPA evaluations of the large vocabulary continuous Speech ASR technology
(Cook et al. 1996; Woodland et al. 1996).
We have already noted the three methods for spectral analysis, namely, Fourier
transform, digital filter banks, and linear prediction. Of these three the most com-
monly used one is the Fourier transform while our ear uses the filter bank method. In
fact, almost all speech research uses in reality harmonic analysis of Helmholtz’s era,
nineteenth century, in the name of frequency domain analysis. It could be because
of two reasons one is the legacy and the other is the very strong and substantiated
belief that the ear also does so. Even when we use Fourier transform, we look only
at the amplitude spectra and the phase spectra are neglected. Let us peruse Fig. 2.1a
and b. These are composed with the fundamental and one harmonic, both same for
the two figures only the phase of the harmonic is different for the two figures. The
2.1 Introduction 15

Fig. 2.1 Two different waveforms generated from same two harmonics

result is two different waveforms as expected. The point is that the harmonics alone
do not really represent the signal itself. If we want to represent the signal fully we
have to take heed of the phase spectra.
Be that as it may, we have been using harmonic analysis as spectral analysis
for the last three centuries with quite satisfactory results in objectively defining
phones and speech-related events covering many languages creating extremely useful
technology for the use of humanity. In the next section, we shall describe in brief
spectral characteristics of phones in one language Bangla to see how beautifully
spectrum works in objective representation of them and how they correlate with our
perception of phones, with an example in the case of Bangla vowels.

2.2 Spectral Structure of Bangla Phones

Let us begin our acquaintance of spectral structure with quasi-periodic signal of


vowel [æ]. Figure 2.2 presents the amplitude spectrum (hereinafter referred simply
as spectrum). The x-axis represents the frequency of the constituent harmonic com-
ponents in Hertz. The vertical axis represents the amplitude of the harmonics in dB.
The mathematical maxima in the graph of the narrow hills represent the amplitude
of the harmonics. The corresponding frequencies of the harmonics are given by their
respective the x-values. The harmonic structure of a vowel has characteristic hills
representing the resonances caused by the different cavities primarily two major ones
created by the height and front/back position of the tongue hump. These hills can be
easily visualized if an envelope (thick line in the figure) drawn covering the harmonic
components. These resonances are commonly known as formants. As we have seen
in the last section, the articulatory position of a vowel can be determined from the
measurement of the formant frequencies.
16 2 Spectral Domain

Fig. 2.2 Illustration of formants with respect to the vowel/æ/ (Datta 2018)

Fig. 2.3 (Datta 2018) spectrogram of vowel [æ]

In general, first formant is associated with tongue height and the second formant
frequency with the back to front position of the tongue hump. It is now common
to use the first two formants for a reliable estimate for objectively determining the
articulatory position of a vowel. Figure 2.8 presents one example each for the seven
Bangla vowels.
Figure 2.3 presents the black and white spectrogram of a steady state of vowel
(æ) followed by the normal spectra at the right. The x-axis of the spectrogram rep-
resents the time, y-axis the frequency, and the grayness gives a comparative idea
of the strength of energy at a particular time and frequency of the harmonics. The
spectrograms are very useful in understanding the dynamic movement of timbral
quality.
2.3 Spectra of Oral Vowels 17

2.3 Spectra of Oral Vowels

An exhaustive study has been done in Indian Statistical Institute, Kolkata, and
CDAC—Kolkata on formants of Bangla vowels. It may be pertinent to briefly intro-
duce the results. Figure 2.4 represent the mean position and an estimate of the spread
of Bangla oral vowels in F1 –F2 plane for data of both sexes pooled together. The
dots represent the mean position of the vowels. The ovals give an idea of the spread
where the widths and the heights of the ovals are standard deviations of F2 and F1
values, respectively. Assuming normal distribution the ovals cover only about 68%
of the data. That the formant frequencies F1 , F2 , and F3 for a vowel closely follow
normal distribution was reported as early as 1978. Though the ovals appear to be
disjoint actually this is so because they contain only a part of the data.
As an example of correlating spectral data with perception one may cite the
technique that enables one to represent formant data, together with F0 values, into
the traditional perceptual evaluation of the category of a vowel utterance in terms of
height and backness of the tongue. This technique transformed Fig. 2.4 into Fig. 2.5
which represents Bangla vowels in this perceptual frame.

2.4 Spectra of Nasal Vowels

Nasal vowels are produced when the velum is open and the nasopharynx is coupled
with the oropharynx. Nasals are said to be characterized by nasal formants and
anti-formants. In general, these studies reveal following acoustic cues for oral/nasal
distinction:
F. Strengthening of F0 ,
G. Weakening of F1 ,
H. Strengthening of F2 ,
I. Raising of F1 and F2 , and
J. Presence of nasal formants and anti-formants.

Fig. 2.4 Distribution of Bangla vowels in F1 -F2 plane (Datta 2018)


18 2 Spectral Domain

Fig. 2.5 Perceptual vowel diagram for Bangla vowels drawn from objective data (Datta 2018)

Fig. 2.6 Formants of Bangla oral and nasal vowels (Datta 2018)

As regards to the cues A to D studies in SCB reported that the strengthening of F0


on nasalization is observed for all central and front vowels except [ε̃], the weakening
of F1 for all except for vowel [ĩ] and [c̃] and the raising of F2 except for vowels
[õ] and [ũ]. Examination of spectrograms shows consistent occurrences of nasal
formants. For all vowels taken together, nasal formants are found to be clustered in
the region of 900, 1250 and 1650 Hz. A study reported that one or two harmonics
between F0 and F1 lying in the neighborhood of 400 Hz. plays a pivotal role in the
nasal/oral distinction. Figure 2.6 presents the mean positions of nasal vowels and the
oral vowels of SCB in F1 -F2 plane.
Another random document with
no related content on Scribd:
LOIKKANEN. Sepä lystiä olisi nähdä.

SIPI. No, täss' on: juo! (On antamaisillaan, vaan pidättää.) Ei.
Mutta kuulehan: sanos ensin, oletkos, Sanna, koskaan ollut
rakkauden piehkinässä?

SANNA. Minäkökö? Hö, hö! Mitenkäs muuten minä näin viisaaksi


olisin tullut? Sillä, nähkääs, ensin sitä on niin tuhma kuin pässin pää,
kun rakastuu, sitte, niin kauan kuin rakastaa, sitä on hullu, ja sitte,
kun on rakastanut, niin siitä hulluudesta tulee niin viisaaksi, niin
viisaaksi, että voipi selittää sekä tähdet että taivaat — phy-hyyy!
(Viheltää ja tekee kädellään kaaren ylös ilmaan.)

LOIKKANEN. Mutta kukas hullu se oli niin viisas, että tuohon


Sannaan rakastui?

SANNA. Kukako? Se sama, joka sitte teki niin viisaasti, että minut
jätti. Muutenhan minä olisin hullujenhuoneesen joutunut.

SIPI. No, tuoss' on: juo! Eläkä viisastele enää!

SANNA (tyhjentää ryypyn ja pyyhkii suunsa turkin hihaan.) Äh-


häh! Sepäs maistui. No, nyt minä olen niin iloinen, että voin teille
näyttää vaikka ne tikapuut, jotka Jaakoppi unissaan näki.

(Yleistä naurua).

SIPI. Missäs Sanna ne on nähnyt?

SANNA. Viipurin poliisikamarissahan minä ne näin. Eikä ne olleet


tuollaiset tavalliset tikapuut, vaan yksi ainoa pitkä, pitkä lauta ja siinä
reikiä toinen toistaan ylempänä, kuta ylempänä, sitä suurempia, ja
siellä lihan, ihan yläällä niin suuri reikä, kuin pappilan pesutiinu, ja
siitä sitä loiskahti suoraa tietä taivaasen.

(Naurua.)

LOIKKANEN. Nyt se Sanna muistelee niitä entisiä


markkinareissujaan, kun se päissään oli.

SIPI. Entäs se "vanha Loikka"? Tanssihan se, ett'ei aika hukkaan


mene.

SANNA. Hyvä on, koska herassyötinki käskee. (Tanssii,


tömisyttäen jalkojaan, ja rallattaa:)

"Vanha Loikka se luuli, että siitä tulis juttu…"

LOIKKANEN. Vallesmanni tulee, vallesmanni tulee!!

SANNA (taukoo ja kiljaisee yht'äkkiä sekä lähtee juoksemaan,


minkä jaksaa, puodin kautta ulos. Yleistä naurua.)

LOIKKANEN. No, nyt se taas lippaisee niin, että kintut vilkkaa eikä
seisahdu ennenkuin Syrjälän kujan suussa.

SIPI. Mitä sinä nyt häntä joutavia säikytit.

(Katsojat vetäytyvät pois puodin ovelta, johon sen


sijaan ilmestyy ANTTI.)

LOIKKANEN. Eihän tässä enää jouda tuon hullutuksia


katselemaan. —
Hyvästi nyt siksi!

SIPI. No, hyvästi sitte.


LOIKKANEN (menee oikealle).

Viides kohtaus.

SIPI, ANTTI ja viimeiseksi rouva VALLSTRÖM.

SIPI. No? Takaisinkos Antti…? Vieläkös nyt mitä…?

ANTTI. Olettehan niin hyvä, että panette kirjaan ne ostokseni?

SIPI. Antti siis sai?

ANTTI. Sainhan minä. Niitä oli suoloja…

SIPI. Ei tarvitse. En minä niitä kuitenkaan kirjaan pane.

ANTTI. No? Mitenkäs muuten?

SIPI. Saatte ne — ilmaiseksi.

ANTTI. Ei suinkaan. Sitä minä en… Enhän minä tahdo väärin


käyttää…

SIPI. Maksakaa sitte! Mutta minä tahdon nyt olla kuitti.

ANTTI. Vai on se niinikään?

SIPI. On. — — —

ANTTI. No, en minäkään niitä sitte lahjaksi ota.

SIPI. Kuinka tahdotte. — Ja kun tulee velkakirjan maksupäivä, niin


saapi kääntyä Loikkasen luo.
ANTTI. Mitä te…? Elkää… elkää sitä tehkö!!

SIPI. Nyt se on jo myöhäistä, kun ei äskeinen ehdotukseni


kelvannut: Se on jo siirretty.

ANTTI (itsekseen). Herra — auta! (Kalpenee ja horjahtaa oven


pieltä vastaan, josta pidellen pysytteleiksi pystyssä.)

SIPI. Te olette ylpeä, Antti! Ette tahdo väärin käyttää hyvyyttäni,


mutta ette myöskään osaa olla siitä kiitollinen. Vaan nyt kai sen
käsitätte, kun kotiin palaatte ja taikina jää alustamatta, leipä
paistamatta, perhe syömättä?

ROUVA VALLSTRÖM (oikealta, ovella). No, kauppias Rahikainen,


tässä minä olen. Jokos lähdetään?

SIPI. Ja-ha. Minä olen valmis, rouva Vallström (Menee oikealle.)

ANTTI (joka sisällisesti taistellen, pää käsien peitossa on jäänyt


yksin huoneesen, huomaa kotvasen kuluttua, että muut ovat poissa,
ja juoksee salin ovelle, huutaen). Rouva Vallström! Rouva Vallström!
(Kun ei kuulu vastausta, kääntyy hän, ikäänkuin tointuen ja malttaen
mieltään, ottaa lakkinsa ja menee puodin kautta ulos.)

Esirippu.

Neljäs näytös.

Kyökkikammari rouva Vallströmin luona. Vasemmalla, etualalla,


kyökin ovi, keskemmällä ruokabufetti ja peränurkassa valkoinen
kaakeliuuni. Perällä ovi sisähuoneisiin. Oikealla kaksi ikkunaa, niiden
välissä ruokapöytä, jonka toinen puolikas on nostettu ylös, ja
etualalla vanha nahkasohva. Katossa riippuu lamppu, ikkunoissa
uutimet, lattialla kotikutoisia mattoja ja tuoleja siellä täällä
seinävierissä muiden huonekalujen välissä.

Ensimmäinen kohtaus.

Rouva VALLSTRÖM ja SINKKONEN.

SINKKONEN (istuu sohvassa, tyhjentäen kahvikuppiansa.) Suur'


kiitos! Paljon kiitoksia! (Nousee, panee kuppinsa pöydälle ja pyyhkii
kädellä suutaan.)

ROUVA VALLSTRÖM (joka istuu keskemmällä suojaa syrjittäin


pöydän ääressä ja kutoo sukkaa). Ei kestä, herastuomari Sinkkonen.

SINKKONEN (istuutuu paikoilleen). Niin. Sitähän se on, niinkuin


sanotaan, että "vanha suola janottaa."

ROUVA VALLSTRÖM. Kyllähän se niin on, mutta pahasti tehty


sittenkin, että sillä lailla jättää…

SINKKONEN. Pahasti. Eihän se Hilma parka syypää ollut


mihinkään.

ROUVA VALLSTRÖM. Päinvastoin. Ihan syyttömästihän se


kärsimään joutui, kun Siiri siihen kovaksi onneksi pyörähti.

SINKKONEN. Niin kai se sitte oli sallittu.

ROUVA VALLSTRÖM. Kuka sen tietää?! Vaan syytönhän se


osaksi oli
Siirikin, kun se sen teatterijutun takia minua pelkäsi ja lähti
Valkeapään luota silloin itselleen asuntoa kuulustelemaan.

SINKKONEN. Niin, niin, eipä hän toki liene kauppias Rahikaista


tahallaan puoleensa maanitellut.

ROUVA VALLSTRÖM. Eipä suinkaan. Ja vaikka minä alussa


olinkin niin vihoissani Siirille, niin tuli kuitenkin sääli tyttöletukkaa.

SINKKONEN. Ka, arvaahan tuon.

ROUVA VALLSTRÖM. Sillä siinä on kuitenkin hellyyttä tuon


kevytmielisen kuoren alla.

SINKKONEN. Niin, ka, ja rouvan oma kasvatti.

ROUVA VALLSTRÖM. No, ja Sipikinhän se sitte tuli siihen väliin,


sovitteli ja selitteli hänkin, — kun rupesi täällä käymään.

SINKKONEN. Niin vai? (osottaen päännyökkäyksellä perälle).


Siellähän ne kuuluvat olevan sisässä molemmat nytkin.

(Perältä kuuluu tuon tuostakin vilkasta puhelua.)

ROUVA VALLSTRÖM. Täällähän se on Sipikin. — Enhän minä voi


häntä poiskaan ajaa.

SINKKONEN (kuivasti naurahtaen). Ei-pä.

ROUVA VALLSTRÖM. Haastanut minä kyllä olen hänelle siitä


asiasta ja sanonut, ett'ei se niin vaan käy laatuun hylätä…

SINKKONEN. No, mitäs hän…?


ROUVA VALLSTRÖM. Puolustaa itseään vaan sillä, ett'ei Hilmaa
rakasta, vaan Siiriä rakastaa.

SINKKONEN. Vaikk'ei se hänestä välitäkään?

ROUVA VALLSTRÖM. Senkin olen sanonut, mutta sittekin. — "Ja


enhän minä", sanoo, "vaan Hilmalla pääse; täytyy ottaa koko perhe
niskoilleni."

SINKKONEN. Jaa, jaa, sitähän se on.

ROUVA VALLSTRÖM. Ja sittehän se sattui niin, että Sipin täytyi


täällä useammin käydä, meillä kun on yhdessä sen
hätäaputoimikunnan viljanjako hoidettavana.

SINKKONEN. Niin, niin. — Eihän sitä rouva yksin miten… ilman


miehisen miehen apua. Ja siitähän on niin paljon puuhaa.

ROUVA VALLSTRÖM. Puuhaahan siitä on ollut. — Ja — toisesta


toiseen:
Voi, voi, sitä kurjuutta, minkä siinä on saanut nähdä.

SINKKONEN. Näkeehän sitä, Jumala paratkoon, tämmöisenä


katovuotena jos jotakin.

ROUVA VALLSTRÖM. Niin. Varsinkin kun velkaantuminen


muutenkin on niin suuri täällä kansassa. Ja on kai niitä nyt
herastuomarilla manuita oikein kosolta?

SINKKONEN. Äi'ä niitä on. Ja oikein sitä sydäntä vihloo, kun


tietää, ett'ei ihmisillä ole aitassa mitään, ja kuitenkin täytyy koko
omaisuuden ryöstettäväksi ilmoittaa.
ROUVA VALLSTRÖM. Niin. Jumala nähköön! Ja täälläkinhän, kun
ovat tulleet hätäapua pyytämään, herastuomarin usein on täytynyt
niitä tulla käräjiin manaamaan.

SINKKONEN. Täällähän niitä aina paraiten on tavannutkin. Ja sen


tauttahan minä nytkin…

ROUVA VALLSTRÖM. Joko nyt taas? Ja kenenkäs raukan vuoro


se nyt sitte on?

SINKKONEN. Sitä Valkeapään Anttiahan minä kävin kotoa


tapaamassa, vaan…

ROUVA VALLSTRÖM (laskee hervonneena sukankutimensa


helmaansa ja ottaa silmälasit nenältään). Herra Jesta!! — — Joko se
todellakin?! — —

SINKKONEN. Kyllä se on niin. — Sohvi sanoi, että Antti oli tänne


lähtenyt, niin minäkin tulin perässä.

ROUVA VALLSTRÖM. No, Sipikö se todenperään ilkeää, vai…?

SINKKONEN. Ei. Kyllä se kauppias Loikkanen hänet manuuttaa.

ROUVA VALLSTRÖM. Loikkanenko? Mutta eihän Antti häneltä,


minun tietääkseni. —

SINKKONEN. Niin se vaan on.

ROUVA VALLSTRÖM. No, saadaanhan sen kuulla Antilta, kun


hän tulee. —
Voi, voi, niin käy sääliksi kun ajattelen.
SINKKONEN. Sitähän minä sanoin. Siellä kun on kuudes suu vielä
lisää tullut, joka ruokaa huutaa sekin.

ROUVA VALLSTRÖM. Joko?! Taivaan Herra!! Kuulinhan minä,


että siellä oli tulossa. No, oikein siellä nyt koetuksen malja on
kukkurallaan niillä poloisilla.

SINKKONEN. Kovasti se Herra koettelee. Ensin oli Hilma kuinka


monta viikkoa sairaana, nyt taas Liisu rintatautia potee ja kätkyessä
pikkarainen poika soudettavana.

ROUVA VALLSTRÖM. On siinä tekemistä, ja elatuksen huolet


lisänä!

Toinen kohtaus.

Rouva VALLSTRÖM, SINKKONEN, SIIRI, SIPI ja sitten HELÉN.

SIIRI ja SIPI (syöksyvät rähisten peräovesta sisään).

SIPI. Rouva Vallström! Rouva Vallström! — Ka, täällähän


herastuomarikin on. Terveeks'!

(SIIRI ja SIPI tervehtivät Sinkkosta.)

ROUVA VALLSTRÖM. No, mikä hätänä?

SIPI. Minä tulen teille kantelemaan.

SIIRI. Ja minä kanssa, ja minä kanssa.

ROUVA VALLSTRÖM. Mitäs kauppias…?


SIPI. Sitä vaan, että teidän kasvattinne on hirveän itsepäinen.

SIIRI. Ja hän on niin hirveän sietämätön, että tädin pitäisi jo ajaa


pois hänet täältä.

ROUVA VALLSTRÖM. No, no — no, no!

SIIRI. Ha-ha-ha-ha-ha!!

SIPI. Minä pyydän Siiriä vaan vähän laulamaan, mutta hän ei


suostu…

SIIRI. Sillä minä en koskaan suostu siihen, mitä hän tahtoo. Ja


sitte (laskee takaapäin kätensä rouva Vallströmin kaulaan), täti kulta,
joll'ei täti säestä… Ja minä tiedän, ett'ei täti tahdo säestää, eikö niin?

ROUVA VALLSTRÖM. Nyt toden perään ei tee mieli. Kuulen tässä


niin paljon surkeata, ett'ei sydämmeni ollenkaan ole taipuvainen
iloitsemaan.

SIIRI (lannistunein mielin). No? Mitä se on? Kertokaa, täti! (Katsoo


vuoronperään Sinkkoseen ja rouva Vallströmiin.)

SIPI (muuttuu myös mieleltään ja luo kysyvän ja epäilevän


katseen
Sinkkoseen).

ROUVA VALLSTRÖM. Herastuomarihan se tässä tietää yhtä ja


toista.

SIPI (puolikovaan Siirille). Lähdetään pois sitte!

SIIRI. Ei, ei. — Kertokaa, herastuomari!


SINKKONEN. Eikös sitä kauppiaskin jo…?

SIPI (joka aikoo mennä takaisin). Ei, en minä mitään…

HELÉN (on tullut vasemmalta ja rykäisee oven suussa).

ROUVA VALLSTRÖM. Minä kerron sinulle sitte, Siiri.

SIIRI (kääntyy kysyvästi SIPIIN, joka tekee liikkeen, niinkuin ei


ymmärtäisi mitään, ja menee Sipin kanssa perälle, jättäen oven
puoleksi auki).

ROUVA VALLSTRÖM. Mitäs pehtori…?

HELÉN. Herastuomarihan se pyysi ilmoittamaan, kun Valkeapään


Antti tulee.

SINKKONEN (nousten ylös). Vai jo se tuli?

ROUVA VALLSTRÖM. Vai tuli hän. Tahtookos hän mitä minulta?

HELÉN. Jyviähän se sanoi tahtovansa pyytää, jos että niinkuin


annettaisiin.

ROUVA VALLSTRÖM. Pitäähän Antin saada. Missäs hän on?

HELÉN. Lieneekö se tullut perässä kyökkiin, vai…

SINKKONEN. Minä menen sitte, pehtori vieraana miehenä, häntä


manaamaan.

ROUVA VALLSTRÖM. Antaa Antin tulla sisään. (Menee kyökkiin


päin.)
HELÉN. Eihän se näy tahtovan tulla.

ROUVA VALLSTRÖM (puhuu kyökissä). Mitä se Antti siellä


porstuassa…? —
Hyvää päivää! — Antti tulee sisään vaan.

SINKKONEN. Oikein minun on vaikea miesparkaa manata, vaan


mikäs siinä auttaa.

HELÉN. Mikäs siinä auttaa, kun se on oikeuden asia.

Kolmas kohtaus.

Rouva VALLSTRÖM, SINKKONEN, HELÉN ja ANTTI sekä


(toisessa huoneessa) SIIRI (joka näkyy väliin puhuvan jotakin
toisessa huoneessa, mutta samalla koko kohtauksen ajan
tarkkaan seuraa tapahtumia näyttämöllä).

ROUVA VALLSTRÖM. Antti tulee tänne — sisään.

ANTTI. Kiitoksia vaan! (Panee lakkinsa lattialle, oven suussa


olevan tuolin alle.) Enhän minä oikein…

SINKKONEN. Päivää, Antti!

ANTTI (hiukan säpsähtäen). Ka, terveeks'! En huo…


huomannutkaan heras — tuomaria (Kättelevät.)

ROUVA VALLSTRÖM. Antti istuu edes vähän aikaa. Ei suinkaan


nyt niin kiire ole?

ANTTI. Eipä tuota ole paljon aikaakaan. Sinne kun jäi portille tytär
hevosen luo vuottamaan.
ROUVA VALLSTRÖM. Hilmako?

ANTTI. Hi… Hilmahan se on muassa.

ROUVA VALLSTRÖM. No, miks' ei hän tullut? Kutsutaan hänet


sisään.
Pehtori on hyvä ja menee kutsumaan.

ANTTI (pysäyttää Helénin). Ei, ei, pehtori! — Kiitoksia vaan, hyvä


rouva, kyllä hän ei tule. Ei sanonut oikein ilkeävänsä.

ROUVA VALLSTRÖM. No, eihän sille sitte mitä mahda. Jyviähän


Antti taitaa olla hakemassa, vai?

ANTTI. Niitähän minä olen… Jos voisin saada?

ROUVA VALLSTRÖM. Pitäähän niitä Antille antaa. Ja johan siitä


kauan on, kun Antti viimeksi sai.

ANTTI. Johan siitä on aikaakin… mitä sitä on?

ROUVA VALLSTRÖM (antaa Helénille kaksi suurta aitan avainta


bufetin ja uunin väliseltä seinältä). Pehtori ottaa ja mittaa sitte Antille
sen tavallisen määrän.

HELÉN. Kyllä. (Ottaa avaimet, mutta viivähtää ja luo katseen


Sinkkoseen, kun ANTTI ottaa lakkinsa ja aikoo lähteä).

SINKKONEN. Niin, tuota, minä kävin siellä tänään Anttia


tapaamassa, vaan ei ollut kuin Sohvi kotona.

ANTTI (hämmästyen). Mi… minuako? Milloinka?


SINKKONEN. Olitte vähää ennen kerinneet lähteä. Täytyi käydä…
pikkuisen asiassa.

ANTTI. Me poikettiin vaan Hilman kanssa vähän kylässä. — Mitäs


se herastuomari…?

SINKKONEN. Sen kauppias Loikkasen puolestahan minun


täytyy…

ANTTI. Loikkasen? —

SINKKONEN. Niin. Tuntee kai Antti asiansa?

ANTTI. Tunnen. Ja tiedänhän minä. Mutta sehän lupasi vuottaa,


kun minä ne parikymmentä markkaa suoritin. Hilman ansaitsemat
rahat annoin, ja Loikkanen suostui, että saisin vähin erin…

SINKKONEN. Hyvin mahdollista. Mutta kyllä se nyt manuuttaa


Antin tuleviin käräjiin, huomisesta viikon päästä.

ANTTI (seisoo vakavana, pyöritellen lakkiaan). Enhän minä sitte


tätä ymmärrä.

ROUVA VALLSTRÖM. Milloinkas Antti on Loikkaselle


velkaantunut? Eikös
Antti aina Rahikaiselta ottanut?

ANTTI. Siltähän minä… Vaan sitte tehtiin velkakirja. Ja sitte, kun


Sipi välinsä purki, niin hän siirsi velkakirjan Loikkaselle, että olisi
kaikki lopussa.

ROUVA VALLSTRÖM (heiluttaen päätään). Kas sitä! Vai sillä


lailla? No, voi tokisen! Olipas se temppu!
SINKKONEN. Onhan se sitä. Mutta laillisesti se on kuitenkin kaikki
tehty. Ja kun se näkyy olevan Antilla selvillä, niin tunnustaahan Antti
Valkeapää tässä, vieraanmiehen läsnä ollessa, saaneensa manuun
tuleviin käräjiin?

ANTTI (synkkänä). Tunnustanhan minä.

SINKKONEN. Siis Elokuun 30 päivänä.

ANTTI. Muistanhan minä sen. (Vaipuu tuolille mietteisiin, pää


käsiensä varaan ja tuijottaa eteensä).

ROUVA VALLSTRÖM (Sinkkoselle). Antti parka! Se näkyy olevan


hänelle kova isku.

ANTTI (ikäänkuin itsekseen). Mistä minä ne… neljäsataa


kolmekymmentä… ja näin pian?… Kun nälkä muutenkin oven raosta
irvistää…?

ROUVA VALLSTRÖM ja SINKKONEN (kuiskaavat jotakin


keskenään).

SIIRI (seisoo hämmästyneenä ovella).

ROUVA VALLSTRÖM. Ei pidä nyt Antin joutua epätoivoon! Ehkä


Jumala kyllä lähettää.

ANTTI (naurahtaen katkerasti). H-h?! Jumalako… lähettää? Kun


hylkää kokonaan.

ROUVA VALLSTRÖM. Koetetaan nyt jotakin tuumia, Antti hyvä.


(Helénille.) Pehtori menee nyt ja mittaa Antille, niin on Antti ainakin
siitä hädästä autettu Ja antaa nyt puoli hehtolitraa lisää — minun
puolestani.

HELÉN. Kyllä, paikalla. (Menee oikealle.)

SINKKONEN. Niin no. Pitää sitä sitte minunkin… Hyvästi.


(Kättelee rouva Vallströmiä.)

ROUVA VALLSTRÖM. No, hyvästi, herastuomari.

SINKKONEN. Hyvästi, Antti. (Kättelee ja menee oikealle.)

ANTTI. Pitäähän minun viedä se säkki. — Hyvin paljon nyt


kiitoksia! (On kuin pyörryksissä.)

ROUVA VALLSTRÖM. Mutta kuulkaahan Antti! Sipi on täällä…

ANTTI (säpsähtäen). Onko hän täällä?

ROUVA VALLSTRÖM. On. No, no, ei mitään… Jos Antti tahtoisi


häntä pyytää sovittamaan?

ANTTI. Sipiäkö pyytää? — — Ei… ei…

ROUVA VALLSTRÖM. Mutta hänestähän se on alkujaan lähtenyt


koko asia. Ehkä hän nyt peruuttaisi, sovittaisi jollakin lailla, jos
pyydettäisiin, — jos koetettaisiin, Antti.

ANTTI. Ei, rouva hyvä, ei. Rouva ei tunne. Ei hän sitä tee —
koskaan.

ROUVA VALLSTRÖM. Mistä sen tietää, mistä sen tietää.

ANTTI. Päinvastoin. Hänhän se siihen on yllyttänytkin Loikkasta.


ROUVA VALLSTRÖM. Sitä enemmänhän on syytä pyytää häntä,
jos Antti niin luulee. Ja miks'ei Antti nyt…?

ANTTI. Ei siltä. Kyllähän minä… mut' ei hän… (pyöritellen


päätään) ei, ei, ei.

ROUVA VALLSTRÖM. No, Antti menee nyt ensin ja saa jyvät. Ja


tulee sitte takaisin Hilman kanssa. Jos hänkin…?

ANTTI. Hilmako? — Ettäkö hän pyytäisi?

ROUVA VALLSTRÖM. Niin. Ehkä Sipi toden perään vielä hänen


tähtensä taipuisi, kun näkisi hänet.

ANTTI. Hilma nyt ei liioinkaan… vaikka minuthan elämä jo kyllä on


nöyryyttänyt sitäkin tekemään. — Hyvästi vaan ja kiitoksia hyvin
paljon! (Kääntyy ja menee vasemmalle.)

ROUVA VALLSTRÖM (puhuen Antin jälkeen). Ei nyt pidä noin!


Pitää koettaa! Jos minä puhuttelen Hilmaa. Ja minä tulen itse hänet
hakemaan tänne. (Palaa ottamaan tuolin karmilta huivin, jonka
viskaa hartioilleen.)

SIIRI (joka, seisoen peräovella, ja toisen käden selkä suun


edessä, tarkkaan kuunnellen keskustelua, on vaaninut hetkeä,
milloin Antti menisi pois sekä samalla on koettanut pidättää Sipiä,
juoksee Antin mentyä rouva Vallströmin luo). Täti! Täti!

ROUVA VALLSTRÖM. No? Mitä?

SIIRI (puolikovaan). Täti! Minä olen kuullut kaikki. Minä tiedän jo…
ymmärrän jo kaikki… Se on hirveätä! Kuule, täti, pyydä Hilmaa
tänne! Pyydäthän? Sano, että minä tahdon tavata häntä —
välttämättä. Mene ja sano, täti kulta! Minä puhuttelen sill'aikaa Sipiä.
Mutta elä vaan sano Hilmalle, että hän on täällä.

ROUVA VALLSTRÖM. Hyvä on, hyvä on, tyttöseni! Tehdään niin!


(Taputtaa lohduttavasti Siiriä käsivarrelle ja menee vasemmalle.)

SIIRI (juoksee ikkunan eteen ja lyö kätensä ristiin). Hilma!! Hilma


parka!! (Jää tuijottamaan ulos.)

SIPI (tulee verkalleen ja arasti perältä.)

Neljäs kohtaus.

SIIRI ja SIPI.

SIPI. No, Siiri? Miksi sinä tänne jäit? Vai etkö tahdo hyvästiäkään
minulle enää sanoa?

SIIRI (alussa hillitysti, vaan sitte yhä pontevammin). Tulehan


tänne!

SIPI. No?

SIIRI (vetää Sipiä hihasta). Katsohan tuonne!

SIPI. Katsonhan minä. Entä sitte?

SIIRI (osottaen sormellaan). Näetkös? — Tuolla — portilla?

SIPI (kuivasti naurahtaen). Vielä häntä kysyy?

SIIRI. Sano! Tunnetkos, kuka hän on?

SIPI. Kah! Valkeapään Hilmahan se on.


SIIRI. No. Näetkös, kuinka hän siellä tuulessa värjättää?

SIPI. Mitä sinä nyt viitsit tässä teatteria pelata, Siiri!

SIIRI. Ei. Mutta näetkös, kuinka hän on kovasti muuttunut?

SIPI. Herra Jumala! Silmäthän mulla päässä on.

SIIRI. Ja sinä ilkeät vielä laskea leikkiä, kun pitäisi sydämmesi


sortua. Tiedät kai, kuka hänet on tuommoiseksi tehnyt?

SIPI. Kukako? Minäkös se sinun mielestäsi olen syypää, että hän


on ollut sairas?

SIIRI. Ja kukas sitte?

SIPI. Hyvä! Tahdotkos, että sanon sen sinulle!

SIIRI (hämmästyen). No? — — —

SIPI. Sinä, tietysti, — itse.

SIIRI. Sipi!!!

SIPI. Niin juuri — sinä.

SIIRI. Minä? — Ja miten? Millä lailla?

SIPI. Niinkuin et itse tietäisi? Muistatkos kun teatterista erottuasi


ensi kerran Hilman luona kävit?

SIIRI. Kunko sinun kanssasi ilveilin? Ja sitte yhdessä…?

SIPI. Näythän sen muistavan. — Sinun käytöksesi, Siiri, minua


kohtaan oli silloin semmoinen, että…
SIIRI. Herranen aika! Minun käytökseni?! Tunnethan sinä minun
käytökseni. Todellisuudessa minä en sillä ole sitä ennen enkä sen
jälkeenkään antanut sinulle vähintäkään aihetta moiseen luuloon.

SIPI. Ellet olisi antanut, ei olisi tämä kaikki tapahtunutkaan.

SIIRI. Kuule, Sipi, jätä jo syytöksesi minua vastaan ja syytä


kaikesta vaan omaa itseäsi. Sillä nyt minä näen selvään koko sinun
kavalan kepposesi. — Minä en antanut sinulle millään mitäkään
toivoa; sen sinä vaan otit itse itsellesi, sillä sinä luulit voivasi käyttää
hyväksesi tilaani, kun jouduin pois teatterista. Sinä luulit, ettei minulla
ollut enää, minne mennä, että sinä muka olit minun ainoa
pelastukseni ja että minun suin päin täytyi heittäytyä sinun syliisi.
Mutta siinäpä sinä erehdyit.

SIPI. Turhaan sinä minua nyt noin tuomitset, Siiri. Sinä tiedät
kuitenkin varsin hyvin, kuinka paljon sinua rakastan. Yksi sana, yksi
liike, yksi ainoa viittaus sinulta oli kylliksi, että olin valmis sinua
seuraamaan, tekemään kaikki, jättämään kaikki!

SIIRI. Entäs hän? Entäs Hilma?

SIPI. Häntä en ole konsana rakastanut niinkuin sinua.

SIIRI. Siis petit hänet ensin lupauksillasi ja sitte syyttä jätit?

SIPI. Kaikki, mitä olen tehnyt, olen vaan sinun tähtesi tehnyt, Siiri.

SIIRI (epätoivossaan). Herra Jumala! Sittenkin minä.

SIPI. Muistat kai, mitä minulle sanoit, kun sinua ensi kerran kosin,
kun vielä kauppapalvelija olin.

You might also like