Professional Documents
Culture Documents
Emotion Affect and Personality in Speech The Bias of Language and Paralan
Emotion Affect and Personality in Speech The Bias of Language and Paralan
Swati Johar
Emotion, Affect
and Personality
in Speech
The Bias of
Language and
Paralanguage
123
SpringerBriefs in Electrical and Computer
Engineering
Speech Technology
Series editor
Amy Neustein, Fort Lee, NJ, USA
Editor’s Note
The authors of this series have been hand-selected. They comprise some of the most outstanding scien-
tists—drawn from academia and private industry—whose research is marked by its novelty, applicabil-
ity, and practicality in providing broad based speech solutions. The SpringerBriefs in Speech Technology
series provides the latest findings in speech technology gleaned from comprehensive literature reviews
and empirical investigations that are performed in both laboratory and real life settings. Some of the top-
ics covered in this series include the presentation of real life commercial deployment of spoken dialog
systems, contemporary methods of speech parameterization, developments in information security for
automated speech, forensic speaker recognition, use of sophisticated speech analytics in call centers, and
an exploration of new methods of soft computing for improving human-computer interaction. Those in
academia, the private sector, the self service industry, law enforcement, and government intelligence,
are among the principal audience for this series, which is designed to serve as an important and essen-
tial reference guide for speech developers, system designers, speech engineers, linguists and others. In
particular, a major audience of readers will consist of researchers and technical experts in the automated
call center industry where speech processing is a key component to the functioning of customer care
contact centers.
Amy Neustein, Ph.D., serves as Editor-in-Chief of the International Journal of Speech Technology
(Springer). She edited the recently published book “Advances in Speech Recognition: Mobile Environ-
ments, Call Centers and Clinics” (Springer 2010), and serves as quest columnist on speech processing
for Womensenews. Dr. Neustein is Founder and CEO of Linguistic Technology Systems, a NJ-based think
tank for intelligent design of advanced natural language based emotion-detection software to improve
human response in monitoring recorded conversations of terror suspects and helpline calls. Dr. Neus-
tein’s work appears in the peer review literature, in industry and mass media publications. Her academic
books, which cover a range of political, social and legal topics, have been cited in the Chronicles of
Higher Education, and have won her a pro Humanitate Literary Award. She serves on the visiting faculty
of the National Judicial College and as a plenary speaker at conferences in artificial intelligence and
computing. Dr. Neustein is a member of MIR (machine intelligence research) Labs, which does advanced
work in computer technology to assist underdeveloped countries in improving their ability to cope with
famine, disease/illness, and political and social affliction. She is a founding member of the New York City
Speech Processing Consortium, a newly formed group of NY-based companies, publishing houses, and
researchers dedicated to advancing speech technology research and development.
13
Swati Johar
Defence Institute of Psychological
Research
DRDO, Ministry of Defence
New Delhi
India
1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Paralanguage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Psychology of Voice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Pitch as a Major Auditory Attribute. . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Speech Markers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Emotional Markers in Speech . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Personality Markers in Speech. . . . . . . . . . . . . . . . . . . . . . . . 12
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Language, Communication and Human Behaviour . . . . . . . . . . . . . . . 17
3.1 Language and Interpersonal Communication. . . . . . . . . . . . . . . . . . . 18
3.2 Language and Coverbal Behaviours. . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Understanding Nonverbal Behaviour. . . . . . . . . . . . . . . . . . . . . . . . . 20
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Multimodality and Spoken Dialogue Systems . . . . . . . . . . . . . . . . . . . . 25
4.1 Potential Benefits of Multimodal Interfaces. . . . . . . . . . . . . . . . . . . . 26
4.2 Multimodality in Spoken Dialogue Systems. . . . . . . . . . . . . . . . . . . 27
4.3 Future Directions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5 Emotional Speech Recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.1 Significant Developments in Speech Recognition. . . . . . . . . . . . . . . 36
5.2 Future Directions in Speech Recognition and Understanding. . . . . . 38
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
v
vi Contents
vii
Chapter 1
Introduction
Keywords Nonverbal · Paralanguage · Kinesics · Proxemics · Prosody ·
Suprasegmental · Auditory cues · Vocalizations · Linguistic · Segmental
and psychologists believe that human language evolved from a system of nonverbal
communication [1]. Tannen [2] estimates that as much as 90 % of all human com-
munication is nonverbal, whereas Noam Chomsky and many others argue that verbal
language is an advanced and refined form of an inherited nonlinguistic system [3].
We continuously give and receive wordless signals like body language, which
includes facial expressions, eye contact and posture etc. The gestures we perform,
the way we sit, the way we talk, how close we stand and how much eye contact
we make reflect our real feelings and intentions. The sound of our voice, including
pitch, tone and volume are also forms of non-verbal communication. Nonverbal
communication focuses on information that is shared but does not contain words,
for example, messages sent through body motions; vocal qualities; and the use of
time, space, artifacts, dress, and even smell. This nonverbal content emphasizes,
complements and substitutes the verbal message. Therefore, it can be said that
communication is a continuous process wherein a considerable portion of our
behaviour is implicit, learned and innate.
According to Wertheim [4], non-verbal communication may be used to express
emotions, communicate interpersonal relationships, support verbal interaction,
reflect personality and perform rituals, such as greetings and goodbyes.
Unlike verbal messages, no formal set of alphabets or grammar exists for non-
verbal codes. But nonverbal communication may be classified on the basis of the
various channels through which nonverbal signals are sent. These channels are
kinesics, proxemics, haptics, olfactics, physical appearance and paralanguage,.
Kinesic behaviour, or body movement, includes gestures, hand and arm move-
ments, leg movements, facial expressions, eye gaze and blinking, and stance or
posture. It deals with the interpretation of any part of the body or the body as a
whole. Knapp and Hall [5] point out that perhaps more than any other part of the
body, the face has the highest nonverbal sending capacity. Through facial expres-
sions, we can communicate our personality; open and close channels of communi-
cation; complement other nonverbal behaviour; and communicate emotional states
[6]. Proxemics is the study of interrelated observations of man’s use of personal
place to communicate with others by virtue of relative position of bodies. Personal
space during conversation plays an important role and several studies show that
culture, ethnicity and sex greatly influence and affect proxemic distances [7, 8].
Haptic communication is a form of nonverbal communication that describes how
we communicate through the use of touch. Like proxemics, it varies widely across
cultures, sex, situation and relationship of the people involved. Olfactics refers to
people’s sense of smell. There are certain qualities or thoughts attributed to the
specific scents that give them their moral significance. Olfactic research is the least
understood of all due to the lack of sufficient smell vocabulary. It is often said
that the first impression about a person is formed from his physical appearance.
People dress differently across cultures and a person’s appearance communicates
the person’s age, sex and even status. This form of nonverbal channel of commu-
nication may be manipulated easily and may result in negative sanctions. Each of
these dimensions discussed above explain differences in nonverbal communication
across cultures.
1.1 Paralanguage 3
1.1 Paralanguage
to a significant extent, can judge a speaker’s age, sex, race, education, status, geo-
graphic origin, and emotional disposition. Often, paralinguistic qualities, vocaliza-
tions, and nonfluencies reveal a speaker’s emotional state and/or veracity.
According to Knapp and Hall [5], paralanguage may be divided into voice
qualities that include pitch, rhythm, tempo, articulation, and resonance of the
voice and vocalizations that include laughing, crying, sighing, belching, swallow-
ing, clearing of the throat, snoring etc. Trager [6], on the other hand, defines three
kinds of vocalizations: (1) Vocal characterizers include non-language sounds such
as laughing, giggling, yelling, whining and voice breaking. (2) Voice qualifiers are
the qualifying portions of language material and include intensity, pitch height and
extent. (3) Vocal sagregates include vocal fillers (such as uh-uh, ooh, hmm), silent
pauses, and other hesitation phenomena.
Various acoustic cues have been observed for discrete emotions and the rela-
tionship between acoustic correlates and acted emotions may be summarised as
follows:
Pitch is the perceived frequency of sound and can affect social meanings. The
different size of the vocal folds of men and women, influence pitch range so that
adult male voices are usually lower pitched than female voices. The speaker uses
different patterns of pitch, consciously or unconsciously, to convey different mean-
ings. The changes in pitch or pitch contour convey shades of meaning such as sur-
prise, tense, emphasis etc.
Pause refers to rest, hesitation, or a temporary stop. Every language has pat-
terns of pauses which varies the tempo of their speech. They may enhance deliv-
ery or be filled unnecessarily to distract. Depending on the duration and frequency
of pauses in an expression of speech one may infer the state of a person. Too
many pauses indicate a lack of control and an over-emotional expression whereas
silence, which is pause of a longer duration is considered a failure of dialogue.
Moreover, a person who talks little is considered shy and uncommunicative. A dif-
ferent category of pauses called ineffective pauses, convey various facets about the
personality and affective state of the speaker [13]:
Filled pauses are hesitation sounds that speakers employ to maintain con-
trol of a conversation. They do not alter the meaning of the sentence, do not
add any information and indicate lack of certainty in what is being said. Uh,
um, ah are some of the fillers that are used globally to reformulate or rephrase
representations.
Interjections are short verbal utterances that do not convey any meaning and
are used as fillers. These are a kind of disfluency accompanied with or without
tension and are typically common. For ex. Um, like, mean uh etc. these may either
occur at the start or in the middle of a sentence.
Speech disfluency are breaks, irregularities, or non-lexical vocables that occur
within the flow of otherwise fluent speech, including false starts (words and sen-
tences that are cut off in the middle), phrases that are restarted and repeated,
grunts, or fillers like uh, erm, and well.
1.1 Paralanguage 5
Stress relates to the relative emphasis placed on certain words within sentences
and conveys both meaning and emotion. There are various ways in which stress
manifests itself in the speech stream and effects articulation.
Volume is generally used to show emotions such as fear or anger and is the per-
ceived loudness of the speaker. It’s a measure of the physical strength of the signal
and also plays a role in influencing the affective state of an individual.
Traditionally, research has been conducted to explore the nonverbal acoustic
manifestations of emotions in a corpus of acted speech. It has become possible to
isolate the role of acoustic information in conveying emotion and several studies
have even made use of nonsensical utterances to factor out all possible lexical or
semantic effects [14–17].
The social implications of all these prosodic features are subject to cultural
interpretation. However, prosody is related to various levels of information, from
linguistic, paralinguistic, to non-linguistic, and, therefore, its acoustic manifesta-
tion is rather complicated with large variations.
In most current speech recognition systems, prosodic features are not utilized
to their full potential. Role of prosodic features for speech recognition increases in
the case of spontaneous speech. As discussed, speech includes number of irregu-
larities and hesitations which degrade the performance of recognition systems. As
a result, extracting the prosodic features of these segments or frames of speech
may result in different and useful outcomes that may further aid the recognition
process. Detection of utterance irregularities from this viewpoint comes important
for the future work.
To provide an aid for psychological research in this area, a study was taken up
for measuring aspects of non-verbal behaviour through speech [18]. This research
explores the various indicators for non-verbal cues of speech and provides a
method of building a paralinguistic profile of these speech characteristics. The
scope of the study involves short term acoustic feature extraction, decision mak-
ing and analysis of various speech samples based on sections of spaces present and
pitch contours. The values obtained for average pitch, short-term energy, silence
duration, rate of speech and loudness were analyzed for different possibilities and
a decision output was generated based on these five paralinguistic parameters clas-
sifying the individual as confident, tensed, happy or sad.
As evident from above, the investigator must consider internal as well as
external factors which influence behaviour before interpreting a subject’s behav-
iour. The relative contribution of different cultural dimensions to emotion infer-
ences from vocal expressions must be examined in future work. The phenomenon
of vocal expression and content free utterances is of particular interest due to its
likely roots in nonhuman primate vocalizations and the evolution of language and
communication systems. The role of paralinguistic vocalizations have already
been explored in diverse applications. Some of the examples are as follows: in
cathartic experiences where individuals express themselves more openly [19],
in expressive communication which enhances participants’ cognitive skills and
6 1 Introduction
enable some speech-disabled participants to access words through the melody [20]
and in voice and speech therapy [21], in addition to the use of paralinguistic voice
input for maintaining a near real time experience.
Paralinguistic vocal control may be affected by a number of factors includ-
ing social context, background, physical impairment, state of mind etc. and there
are many aspects of voice which have not yet been tapped and exploited fully to
enhance the user’s experience and speech recognition techniques. Vocal input may
also serve as a useful therapeutic intervention for various disorders and inclusion
of non-speech vocal control may become necessary for reliable speech recognition
in cases of speech impairment. These limitations can be overcome by developing
new generation methods and techniques that will bring a fundamental shift in the
conventional procedures and enable users to interact with a real world interface in
synergy with the technology.
References
17. Banziger T, Scherer KR (2005) The role of intonation in emotional expressions. Speech
Commun 46(3–4):252–267
18. Johar S (2014) Paralinguistic profiling using speech recognition. Int J Speech Technol
17(3):205–209
19. Heron J (1977) Catharsis in human development. Human potential research project.
University of Surrey, Guildford
20. Elliott J (2005) How singing unlocks the brain, BBC News. http://news.bbc.co.uk/1/hi/
health/4448634.stm. Accessed 10 Apr 2015
21. Arizona Health Sciences Library (2006) Health problems of musicians. http://www.ahsl.
arizona.edu/about/ahslexhibits/musicianmedicalmaladies/instruments.cfm. Accessed 22 Feb
2015
Chapter 2
Psychology of Voice
Abstract The sound of every individual’s voice is unique due to the difference in
the size and shape of vocal cords. The vocal folds loosen and tighten resulting in a
change in pitch, volume, timbre, or tone of the sound produced. Analyzing speech
from a physiological perspective, this chapter explores the pitch component of voice
and how influential it can be. Interestingly, information regarding prosody, emotions,
gender and age is affected by pitch and pitch can help in unconsciously divulging
the feelings, moods and emotions. The chapter also enlightens vocal behaviour as
a powerful index of emotional and personality markers which are paramount in the
extraction of meaningful information from acoustic signals and contribute to a better
understanding of the psychology of voice and performance capabilities.
The periodicity of the glottal pulse and the time-variations of glottal pulse period
convey the intent, expressional content, intonation and stress in the speech signals
[1]. The time duration of one glottal cycle is defined as the pitch period and its
reciprocal is the pitch or fundamental frequency. Pitch is determined by the length,
tension, mass of the vocal cords and the sub-glottal pressure. It carries information
regarding the prosody or rhythm, emotion, speaking style and accent among many
others. The following information is contained in the pitch signal:
(a) Gender classification aims to predict the gender of the speaker by analyzing
different parameters of the speech signal. It is mainly conveyed by the pitch
value and in part by the vocal tract characteristics. The average pitch for
males is about 110 Hz while for females it is about 200 Hz [2].
(b) Emotional states are correlated with particular physiological states, which in
turn make predictable effects on speech features, especially on pitch, timing
and voice quality. Speech emotion recognition is particularly useful for appli-
cations which require natural man-machine interaction and when a person is
in a state of anger, joy or fear, the speech is fast, loud and with strong high
frequency energy. When someone is sad or bored, slow, low pitched speech
with weak high frequency energy is produced. Pitch variation is often corre-
lated with loudness variation. Happiness, distress and extreme fear in voice
are also signalled by fluctuations of pitch.
(c) Accents convey information about the status of individual entities in the
discourse to indicate their relative salience. It is also largely conveyed by
changes in the pitch and rhythm of speech. In addition, a certain type of pitch
movement may signal an intonational meaning.
(d) Prosody is a parallel channel of communication for carrying information that
cannot be deduced from lexical channel. All aspects of prosody are transmit-
ted by muscle motions and time-variations of pitch have a smooth relationship
with these muscle tensions. It gives clues to many channels of linguistic and
paralinguistic information and can indicate syntax and people’s attitudes and
2.1 Pitch as a Major Auditory Attribute 11
feelings. Even hand gestures, eyebrow and face motions, can be considered
prosody because they carry information that modifies and can even reverse the
meaning of the lexical channel.
(e) Age and state of health of a speaker is also related to pitch. The biological
fact that the ratio of eye diameter to head diameter varies markedly with age
develops connections between the sound shape, meanings or communicative
intentions, emotions and affect of the speaker. As a result, a visual estimation
of the ratio eye diameter/head diameter is a rough indicator of age and size of
speaker.
2.2 Speech Markers
distinguishing a large range of emotions over a range of human voices and con-
text, adding naturalness to synthesized speech and thereby facilitating effective
emotional speech processing. Since emotion analysis varies with culture, language
and even population, it is essentially a multi-faceted approach and improvement in
speech emotion recognition performance has been achieved by combining gestural
information along with acoustic correlates. Anger, fear, sadness, joy, neutral and
surprise are some of the common emotions identified by current speech dialogue
and processing systems.
Most of the current methods for measurement and analysis of these cues are
intrusive and require specialized equipment and expertise to make explicit and
detailed predictions regarding the states conveyed in emotional speech. Studies on
emotion may focus on the expression of the emotion by the speaker, the acoustic
cues that convey the intended emotion, the perception of these cues and the infer-
ence about the expressed emotion. Several studies have explored affect inferences
from voice cues in listening tests, where the participants are required to judge the
emotions expressed in speech samples using various response formats like forced
choice and quantitative ratings. According to Scherer [5], various content-mask-
ing procedures that disrupt or degrade individual voice cues can be used to study
which voice cues are used by listeners to infer specific emotions.
The existence of various voice profiles for different emotions and the complex
nature of voice production process make this task quite challenging to success-
fully achieve the desired purpose. Inconsistent data regarding voice cues to spe-
cific emotions, individual differences among speakers, weak emotional effects and
interplay of spontaneous and strategic expressions are some sources of variabil-
ity that pose practical problems to deduce emotion portrayals. As a result, efforts
are being directed towards cross-cultural studies, implementation of multi-modal
approaches in emotion expression and intensive research collaboration from psy-
chology, acoustics, engineering and computer science to facilitate better under-
standing of how emotions are revealed by various aspects of the voice.
As it has been mentioned before, the scope of voice-based human machine inter-
action expands beyond directed dialogue and simple command and control type
interfaces. Future machines will need to be able to interpret a specific context,
which is determined by many factors including the quality of voice, and produce
the respective output. An analysis of the semantic nature of personality traits and
interpersonal and intra-personal behaviour dispositions reveals the underlying
dimensions regarded as essential determinants of social interactive behaviour.
Controversy that surrounds the concept of personality has forced social and
behavioural scientists to debate the nature of personality and its impact on behav-
iour. Since listeners rely heavily on speech style to attribute personality to the
speaker, the possibility of accurate personality inferences from speech remains
2.2 Speech Markers 13
and high neurotics used pronouns and verbs more pervasively. Also, high extro-
verts used more conjunctions overall and low extroverts preferred more nouns and
adjectives.
In order to model personality traits for speech synthesis using different speak-
ers and to identify one or more several defined personalities in dynamic situa-
tions, future work will be bound to the availability of data and large databases in
order to avoid any influence of the bias of the listener’s perspective. Speech is a
highly complex interaction of communicative as well as informative characteris-
tics that convey information about the speaker’s identity, his emotional state and
the situational context. In addition to pitch and rate, additive models that involve
other vocal factors must be designed to understand how semantic information is
conveyed by the paralinguistic parts of speech. Future personalized speech syn-
thesis systems would require an understanding of how personality is encoded in
spoken communication along with refined methods to analyze speech. Moreover,
capturing emotional states along with the personalities would facilitate a more
holistic system of estimating behaviour from speech. Such parametrical synthesis
of speech can be used for diverse commercial applications to indicate personal-
ity impressions that individuals leave on one another and highlight the existence
and psychological significance of personality as an important correlate in social
interaction.
References
13. Brown BL, Strong WJ, Rencher AC (1974) Fifty-four voices from two: the effects of simulta-
neous manipulations of rate, mean fundamental frequency, and variance of fundamental fre-
quency on ratings of personality from speech. J Acoust Soc Am 55:313–318
14. Scherer KR, Scherer U (1981) Speech behaviour and personality. Speech evaluation in psy-
chiatry. Grune & Stratton, New York, pp 115–135
15. Nass C, Lee KM (2001) Does computer-synthesized speech manifest personality?
Experimental tests of recognition, similarity-attraction, and consistency-attraction. J Exp
Psychol Appl 7(3):171–181
16. Mairesse F, Walker MA, Mehl MR, Moore RK (2007) Using linguistic cues for the automatic
recognition of personality in conversation and text. J Artif Intell Res (JAIR) 30:457–500
17. Eysenck H, Eysenck SBG (1991) The Eysenck personality questionnaire-revised. Hodder
and Stoughton, Sevenoaks
18. Isbister K, Nass C (2000) Consistency of personality in interactive characters: verbal cues,
non-verbal cues, and user characteristics. Int J Hum Comput Stud 53:251–267
19. Furnham A (1990) Language and personality. In: Giles H, Robinson W (eds) Handbook of
language and social psychology. Wiley, Chichester, pp 73–95
20. Dewaele JM, Furnham A (1999) Extraversion: the unloved variable in applied linguistic
research. Lang Learn 49:509–544
21. Oberlander J, Gill AJ (2004) Individual differences and implicit language: personality, parts
of speech and pervasiveness. In: Proceedings of the 26th annual conference of the cognitive
science society. Chicago, IL, USA
Chapter 3
Language, Communication and Human
Behaviour
Keywords Coverbal · Linguistic · Phonological · Syntactic · Semantic ·
Interpersonal · Nonverbal · Gesture · Body language
Individual behaviours are deeply embedded in social and institutional contexts. We are
guided as much by what others around us say and do, and by the “rules of the game” as
we are by personal choice [1].
Other people’s beliefs and behaviour can have a strong social influence on our
own behaviour, a phenomenon that has been widely discussed in recent years.
Communications can be effective in highlighting social norms and prompting peo-
ple to act in accordance with them. For example, online forums or communities
where people connect to others in similar circumstances can be particularly help-
ful with regard to sensitive issues, as social proof and reassurance can be provided
in a ‘safe’ and anonymous manner. There exist number of social psychological
models, specific to particular behaviours that provide a comprehensive picture of
the communication and behaviour relationship [2].
The most recent trend of thought is the idea that all of human behaviour and
emotion originate from the brain and there is an anatomical area of the brain that
is responsible for language and communication in humans.
From the above discussion, it is clear that language and communication per-
vade social life. A person’s need to communicate and form relationships is one of
the most extensively studied of all the human behaviours. It is the primary means
by which we gain access to the contents of others’ minds. In natural sciences, biol-
ogists are only interested in the most observable manifestation of communication,
i.e. language.
Linguists often say that language and communication are different which is cer-
tainly true as people do communicate without language and language can be con-
sidered as a tool for communication in human societies. But it also goes without
doubt that without this capacity of linguistic communication, the nature of human
life would be radically different. Linguists regard language as an abstract set of
principles that specifies the relation between a sequence of sounds and a sequence
of meanings [3] and is often implicated in the various social phenomena: social
perception, attitude change, social interaction, stereotyping etc.
Any language is made of four systems: phonological, morphological, syntac-
tic and semantic. The phonological system is concerned with the analysis of an
acoustic signal into a sequence of speech sounds and the morphological system
deals with the way words are constructed out of these phonological elements
called phonemes. The syntactic system is concerned with the organization of these
phonemes into phrases and sentences and the semantic system gives meaning to
these units. Particular acts of speech can also be regarded as actions intended to
accomplish a specific purpose- assertions, requests etc. They typically are imbed-
ded in a discourse and a distinction needs to be drawn between the literal meaning
of an utterance and its intended meaning.
accompany speech that are not strictly linguistic and add meaning to it are catego-
rized as coverbal or broadly, nonverbal behaviours. Gestures, facial expressions,
eye contact, body language are some of the common behaviours that can occur
apart from the context of speech. The coverbal complements the verbal chan-
nel of communication, whereas nonverbal is supplementary to the verbal content
and independent of it. According to Mehrabian and Ferris [21], 55 % of the mes-
sage is delivered through body language that is, the coverbal behaviour and 38 %
is communicated by the pitch, tone and paralanguage elements that is, the non-
verbal behaviour. This suggests that coverbal and non verbal channels are much
more powerful in communicating the credibility of the message than the real spo-
ken words i.e. the verbal channel. Goffman has said: ‘Although an individual can
stop talking, he cannot stop communicating through body expression; he must say
either the right thing or the wrong thing. He cannot say nothing’ [22, p. 35].
In recent years, much research in psychology and psychiatry has been con-
ducted on nonverbal communication. The efforts have attempted to measure the
occurrence of these behaviours and identify the communicative significance of
these behaviours. Emotional states, attitudes, and other affective and regulatory
information are the communicative behaviours that occur in association with or
accompany words, but do not stand alone and the investigation of these non-lin-
guistic behaviours elicited by individuals constitutes the coverbal or nonverbal
communication research.
A deep understanding of the behaviours should lead to a clear understanding of
the role that communications can play to the establishment of specific and realistic
objectives. Understanding behaviour and its influences will stimulate the debate
about how communications can most effectively influence behaviour and enable us
to harness the most efficient and effective communications channels.
Verbal and nonverbal elements share a dependence relation for a holistic and
correct interpretation of the act. Distortion caused by one of the elements is sup-
plemented by the other and a continued manifestation is achieved of what and how
is being said by the speaker. Nonverbal cues help regulate the system by defin-
ing and constraining the pattern of interaction and providing feedback. They may
sometimes convey content and intention more efficiently than linguistic signs, usu-
ally in an independent manner. According to Ekman and Friesen [23], repetition,
contradiction, complementation, accent and regulation are the general functions of
nonverbal behaviour that signal the flow of interaction.
While the study of verbal and non-verbal behaviour has been done indepen-
dently in several disciplines, the relationship between the two has not received the
attention it deserves. The structural relations between units, thematic development,
emotion, and modalization and interpersonal relation between speaker and hearer
are the different levels that need to be considered to understand this phenomenon
of conversational analysis [24]. The fusion of verbal and nonverbal facets appears
to have both genetic as well as sociocultural consequences and this virtuous uni-
fication not only distinguishes societies and cultures from one another but also
clearly tags the human species as distinct from other species.
From a pragmatic point of view, prosodic features are important contextualiza-
tion cues for speech production and perception. These features may occur at dif-
ferent time of the speech production cycle: preceeding the act of speech, speech
itself or following the act of speech. According to Trager [25], any communica-
tion consists of certain sequences called vocalizations that consist of noise and do
not have the structure of language. They constitute paralanguage and other voice
qualities such as intonation, pitch range, rhythm control etc. Vocal characterisers,
vocal qualifiers and vocal segregates together constitute paralanguage. Laughing,
yelling, yawning and crying are some of the vocal characterisers whereas intensity,
pitch height and extent are regarded as vocal qualifiers. There are certain parts of
speech that do not convey any meaning nor fit into proper word frames in lan-
guage sequences. Items such as uh, ah, hmm are used as fillers and come under
vocal segregates.
Recently, many research studies are focusing on the association of non verbal
behaviours and psychological states. Psychologically oriented approaches cover
all forms of non verbal behaviour such as gestures, facial expressions, visual and
crowd behaviour etc. They essentially focus on interpretation of human behaviour
by exploitation of statistical measures different from those employed in linguis-
tic studies. The enrichment of non verbal communication research through the
emergence of psychologically oriented studies has gained momentum and deals
with the description of psychological states of the individuals expressing non ver-
bal behaviour in speech. Mehrabian [26] suggests that this description of findings
should account for the relationships among behavioural cues along with the rela-
tionship between these cues and feelings, personalities and attitudes of the com-
municators, keeping in view the situation in which interaction occurs.
The incredibly dynamic nature of human behaviour and the flaws of social sci-
entific research methodology make the study of human behaviour a challenging
22 3 Language, Communication and Human Behaviour
References
Input modes
Speech Analysis
Visual Gesture
Motor Language
(Facial Recognition
Application Interface
or Eye) and I nteraction
Touch verification User
modelling
Fusion of
modalities
Output Design Discourse
modes Modality modelling
Speech Language
Sound Presentation
Graphics design
Multimodal architectures offer a wide range of benefits over other user interfaces
[3]. A single modality does not permit the user to interact effectively across all tasks
and environments [4]. The multiple modalities offer free choice of modalities, flex-
ible use of input modes depending on the specifics of the task or environment and
communication close to human-human communication resulting in naturalness. They
offer high efficiency as the best suited modality for each task is used. The mul-
tiple modalities also increase the accuracy of the user interface and hence, lead to
enhanced error avoidance as one modality can indicate an object more accurately
than some other modality. The ability and preference to use different modes of
communication permits the user to exercise control over their interaction with the
computer. In this respect, they accommodate a wider range of users, tasks and envi-
ronmental situations. According to van Wassenhove et al. [5], humans may process
information faster and in a better way when it is presented in multiple modalities.
Richard Bolt’s “Put That There” system [6] is the groundbreaking demonstra-
tion which processed multimodal speech that integrated voice and gesture inputs
to enable a user to have a natural interaction with a display. Since then, consid-
erable strides have been made in developing more complex multimodal systems
bringing together new modalities such as haptics and eventually introducing
mobile computing environments as a rich testbed for multimodality. During the
past decade, there has been significant progress in the development of spoken lan-
guage technology and natural language processing. These technologies are being
further aided by advances in pen-based hardware and software capabilities to
automate telephony and other real world applications. Till date, most of the mul-
timodal systems combine either speech and pen input [7] or speech and lip move-
ments [8–10]. This is because speech input offers ease of use and high bandwidth
information. On the other hand, pen input provides a more socially acceptable
form of input and a viable alternative to speech under circumstances of extreme
noise [11, 12]. As a result, such complementary multimodal spoken systems per-
mit users to engage in more transparent information seeking and expressive sys-
tems providing flexible descriptions of objects and situations. These systems can
support greater precision of spatial information than a speech-only interface and
therefore, support shorter and simpler utterances which results in fewer disfluen-
cies and more robustness [13]. Audio visual integration is another widespread area
of research to help increase the granularity of the system. Dynamic navigation sys-
tems when combined with speech recognition can cover a wider range of environ-
mental conditions in real life scenarios that may not be possible to handle when
such modes are used separately.
Machine
readable
Speech Automatic Language
words
User Speech Understanding
Input Recognizer Semantic
Engine
(ASR) meaning
Dialogue
Manager
Text
Generator
System Speech
Output Synthesize
Speech Text
modules for merging inputs from different modalities, decomposing the multi-
modal messages to respective output modality and controlling the timing of input
and output signals of the system.
There are thousands of modalities in existence, both input and output, that can be
incorporated into interface designs [17]. Modality theory by Bernsen [18] is about
representational modalities and not the devices which machines and humans use
when they exchange information, such as sensors, hands, joysticks etc. It states that:
Given any particular set of information which needs to be exchanged between the user
and system during task performance in context, identify the input/output modalities which
constitute an optimal solution to the representation and exchange of the information [18].
The constructive proposition is that the world of modalities is far more sta-
ble than the world of devices and hence, much more fit for theoretical conduct
whereas on the negative aspect, it does not address the issues of device selec-
tion for a particular set of input/output modalities for a specific application. The
Modality Theory of Bernsen [18] provides us a basis for examining arbitrary
input/output modality types and combinations to their capabilities of information
representation and exchange. Sutcliffe and Faraday method and Roskilde method
provide theoretical and methodological basis of information mapping between
user and system based on the modality [19].
Given today’s technological advancements, we need to attend a much wider
range of modalities and their combinations. Today and in future, major advances
in new input technologies and algorithms, processing speed, distributed computing
and spoken language technology will introduce new class of sophisticated multi-
modal systems for human-computer interaction. The practical implementation of
Modality Theory can serve to offer a sound theoretical framework to combine the
existing and emerging developments for appropriate use of modalities in efficient
interaction design and development.
30 4 Multimodality and Spoken Dialogue Systems
4.3 Future Directions
Designing well integrated and robust multimodal systems is a new and emerging
field of interest. These systems represent a variety of platforms and they illus-
trate the diverse and challenging nature of emerging multimodal applications.
Multimodal interaction offers many performance advantages to users as outlined
in the previous section, still many challenges remain before sophisticated multi-
modal interaction becomes an indispensable part of computing. Appreciating
the vast potential that multimodality has in making the human-computer interac-
tion more natural, easier and even more efficient, it becomes important to model
human-like sensory perception and communication patterns. Though the field is
developing rapidly, most of the systems till date are bimodal and research-level
systems. Therefore, developing multimodal integration methods and architectures
to explore a wider range of methods and modality combinations remains a vast
future research issue. Each unimodal technology like speech and sound recogni-
tion, haptics, language understanding, vision and gesture based recognition, con-
text modeling etc. is an active area of research in itself and studies must explore
both the development and understanding of individual modalities and methods for
multi-modal integration.
Hardware advances and fundamental improvements in metrics and machine
learning techniques further challenge the development of richer and more person-
alized communication interfaces. Interdisciplinary cooperation is of considerable
importance to better understand multimodal interaction. There are different views
on how to define multimodal user interfaces and how to select, deploy and coor-
dinate various modes in specific tasks of human computer interaction. In order to
have a better understanding about the natural communication modalities and how
the brain works to identify the best suited modalities in a given context to achieve
the desired synergy for the successful completion of the task, a comprehensive
organization of literature on psychology, cognitive science, linguistics, neurosci-
ence and computer vision must be made available and utilized extensively as a
basis for spearheading empirical work and proposing innovative system designs
to proactively guide the design of new interfaces that are consistent with human
capabilities and perceptions [20–23].
To build systems to perform these tasks with high level of robustness, multi-
modal knowledge acquisition and representation is of paramount importance.
Today, digital libraries and online resources have become a major source of
information for scholars and general public. It will be necessary to automati-
cally acquire knowledge and extract information from such huge repositories or
multimodal knowledge bases that will aid in more sophisticated natural language
processing. The enhanced presentation of multimodal data may be facilitated
by developing virtual environments that include human-like virtual characters,
show convincing emotions and mimic the behaviour of real world individuals.
This will provide a blended interface style that combines both active and passive
modes and improves the system’s prediction abilities. Another big challenge in
4.3 Future Directions 31
this technology is to enable access of everything to everyone and place user in the
centre of the design process. There is also a danger for cognitive overload while
exposing users to such 3D, virtual and simulated environments. Autonomous sys-
tems with interactive control and wearable devices that comprehend user’s actions
and adjust the cognitive load to provide appropriate responses will be capable of
delivering a ubiquitous and personalized computing environment. There is thus a
need for adaptive systems that can adapt to users’ needs automatically and make
the interaction natural and intuitive [13, 24]. Active adaptation technology for
diverse users is necessary before achieving the full potential and bridging the
virtual and physical worlds.
Multimodal interfaces can be seen as the future user interface paradigm that
exploits the power of computing, human computer interaction and psychology.
The existence of human race shall be acknowledged by these systems as such
interfaces will work using multiple linguistic codes and representation systems in
addition to multiple modalities thereby, supporting broader application functional-
ity. The combinations of behavioural and non-behavioural modalities will continue
to increase leading to the proliferation of more reliable and innovative interactive
techniques [25]. Implementing such compelling, powerful and efficient technol-
ogy on a commercial scale would also require considerable research to develop
the appropriate infrastructure, automated tools and software to support the features
of next generation multimodal applications discussed above. Simulation tools
will need to be developed to permit research in natural field environments [26].
Availability of significant corpora will be critical for achieving rapid progress in
performance and architecture in the area of spoken language processing [27]. In
all of these ways, it becomes a grand challenge and much work needs to be done
to revolutionize this technology of human-computer interaction. New and more
sophisticated architectures that will support more effective natural language and
dialogue processing will need to be formulated and issues of privacy and security
must be considered in order to provide this state-of-the-art promising opportunity
of human-computer interaction outside research laboratories. The cross fertiliza-
tion of ideas and perspectives among the diverse areas of engineering, linguistics
and psychology etc. will facilitate the conduct of meaningful research across the
entire spectrum and these novel multimodal interfaces will represent a new multi-
disciplinary science that will aim at preserving the diverse languages of the world.
References
4. Larson JA, Oviatt SL, Ferro D (1999) Designing the user interface for pen and speech appli-
cations. In: Conference on Human Factors in Computing Systems, CHI ’99 Workshop.
Philadelphia, Pa
5. van Wassenhove V, Grant KW, Poeppel D (2005) Visual speech speeds up the neural process-
ing of auditory speech. Proc Natl Acad Sci USA 102(4):1181–1186
6. Bolt RA (1980) ‘‘Put-that-there’’: voice and gesture at the graphics interface. ACM Comput
Graphic 14(3):262–270
7. Oviatt SL, Cohen PR (2000) Multimodal systems that process what comes naturally.
Commun ACM 43(3):45–53
8. Rubin P, Vatikiotis-Bateson E, Benoit C (1998) Special issue on audio-visual speech process-
ing. Speech Commun 26:1–2
9. Stork DG, Hennecke ME (eds) (1995) Speechreading by humans and machines: models, sys-
tems and applications. Springer, New York
10. Benoit C, Martin J-C, Palachaud C, Schomaker L, Suhm B (2000) Audio-visual and multi-
modal speech systems. In: Gibbon D, Moore R (eds) Handbook of standards and resources
for spoken language systems. Kluwer, Norwell, pp 102–203
11. Gong Y (1995) Speech recognition in noisy environments: a survey. Speech Commun
16:261–291
12. Holzman TG (1999) Computer-human interface solutions for emergency medical care.
Interactions 6(3):13–24
13. Oviatt SL, Cohen PR, Wu L, Vergo J, Duncan L, Suhm B, Bers J, Holzman T, Winograd
T, Landay J, Larson J, Ferro D (2000) Designing the user interface for multimodal speech
and gesture applications: state-of-the-art systems and research directions. Human Comput
Interact 15(4):263–322
14. Huang X, Acero A, Chelba C, Deng L, Duchene D, Goodman J, Hon H, Jacoby D, Jiang L,
Loynd R, Mahajan M, Mau P, Meredith S, Mughal S, Neto S, Plumpe M, Wang K, Wang Y
(2000) MiPad: a next-generation PDA prototype. In: Proceedings of the international con-
ference on spoken language processing (ICSLP 2000), vol 3. Chinese Military Friendship
Publishers, Beijing, China, pp 33–36
15. Bangalore S, Hakkani-Tur D, Tur G (2006) Introduction to the special issue on spoken lan-
guage understanding in conversational systems. Speech Commun 48(3–4):233–238
16. López-Cózar R, Araki M (2005) Spoken, multilingual and multimodal dialogue systems.
Development and assessment. Wiley, West Sussex
17. Bernsen NO, Bertels A (1993). A methodology for mapping information from task domains
to interactive modalities, working papers in cognitive science and HCI. University of
Roskilde, Denmark, pp 93–10
18. Bernsen NO (1994) Modality theory in support of multimedia interface design. In:
Proceedings of the AAAI spring symposium on intelligent Multi-Media—Modal systems.
Stanford, March 1994, pp 37–44
19. Faraday P, Sutcliffe A (1993) A method for multimedia interface design, people and comput-
ers, HCI ’93, pp 173–190
20. Oviatt S, Coulston R, Lundsford R (2004) When do we interact multimodally? Cognitive
load and multimodal communication patterns. ACM international conference on multimodal
interfaces. State College, PA, pp 129–136
21. Calvert GA, Spence C, Stein BE (eds) (2004) The handbook of multisensory processing.
MIT Press, Cambridge, MA
22. Ernst M, Bulthoff H (2004) Merging the sense into a robust whole percept. Trends Cogn Sci
8(4):162–169
23. Chen F, Ruiz N, Choi E, Epps J, Khawaja A, Taib R, Yin B, Wang Y (2012) Multimodal
behaviour and interaction as indicators of cognitive load. ACM Trans Interact Intell Syst
2(4):1–36
24. Kumar S, Cohen PR (2000) Towards a fault-tolerant multi-agent system architecture. In:
Fourth international conference on autonomous agents 2000. ACM Press, Barcelona, Spain,
June 2000, pp 459–466
References 33
25. Pankanti S, Bolle RM, Jain A (eds) (2000) Biometrics: the future of identification (special
issue). Computer 33(2):46–80
26. Oviatt SL, Pothering J (1998) Interacting with animated characters: research infrastructure
and next-generation interface design. In: Proceedings of the First Workshop on Embodied
Conversational Characters, pp 159–165
27. Cole R, Hirschman L, Atlas L, Beckman M, Biermann A, Bush M, Clements M, Cohen
P, Garcia O, Hanson B, Hermansky H, Levinson S, McKeown K, Morgan N, Novick D,
Ostendorf M, Oviatt S, Price P, Silverman H, Spitz J, Waibel A, Weinstein C, Zahorian S,
Zue V (1995) The challenge of spoken language systems: Research directions for the nine-
ties. IEEE Trans Speech Audio Process 3(1):1–21
Chapter 5
Emotional Speech Recognition
Abstract Recent years have been marked by a growing need for systems that
can grasp human emotions and in particular, recognize emotions. Emotions lie at
the centre of any social communication and form the basis for an intelligent and
meaningful interaction. The chapter further discusses the acoustic correlates of
emotions and describes various techniques and developments imperative to sup-
port speech interfaces that recognize emotional expressions in real world settings.
Significant advancement in the areas of knowledge representation, infrastructure
requirements and algorithm implementation is a prerequisite for modeling effec-
tive future speech recognition systems that are more robust and dynamic in nature.
Humans interact not only through speech, but vision, gaze, body gestures and
expressions contribute critically to emphasize attributes such as emotion, attitude,
mood etc. As a consequence, information exchanges via the natural sensory modes
of sight, sound and touch are steadily being accommodated through new inter-
face technologies. Making a machine to recognize emotions from speech is not a
new idea but the roles of these multiple modalities and their interplay for natural
interaction are still to be scientifically understood and quantified. Command and
control, dictation, transcription of recorded speech, searching audio documents
and interactive spoken dialogues are some of the potential applications of speech
recognition systems. However, several applications exist where it is beneficial for
computers to recognize human emotions. Stress monitoring, online tutoring and
diagnosis of psychological disorders are prospective areas where a computer’s
functionality may be enhanced to be more aware of the human user’s emotional
and attentional expressions.
Psychology and engineering communities are working towards development of
automatic ways to analyze gestures, facial expressions, vocal emotions and physi-
ological signals to understand and characterize emotions as a goal towards achiev-
ing human-computer intelligent interaction. This knowledge of labeling emotions
into different states suggests growing evidence of the importance of emotions in
leading the research to use pattern recognition approaches using different modali-
ties as inputs to the emotion recognition models. The on-going availability of
large-scale emotional speech data collections will primarily benefit the emotional
speech research in future, and improvement of theoretical models for speech pro-
duction [2] and vocal communication of emotion will gain absolute attention [3].
In today’s era, voice and natural language understanding are at the forefront owing
to steady progress in the technologies needed to help machines understand human
speech, including machine learning and statistical data-mining techniques. The
rapid rise in voice technology coupled with the combination of more data with
more computing power has resulted in increased proficiency in the emerging mar-
ket for speech interfaces. There is a growing need to integrate pervasiveness in the
process of speech recognition to make it more powerful and incredibly accurate. It
is believed that within a few years, speech technology research would have to be
architected to run on wearable computers where the user will just be able to com-
municate without touching the interface and the system’s response will be based
on trigger words. The availability of new generation parallel computation systems
have enthused researchers to explore possible approaches to improve automatic
speech recognition. The evolution of these technologies can be mainly attributed
to the advances in very large-scale integration (VLSI) and digital signal process-
ing (DSP) technologies that have allowed complex algorithms like HMM (Hidden
Markov Models) to be performed in real time.
5.1 Significant Developments in Speech Recognition 37
the fusion of multimodal information for emotion recognition. Various other factors
that influence affective data collection are a result of familial, personal or culturally
learned rules. In majority of the situations the environment in which emotions are
recorded is artificial or unreal which greatly effects the spontaneity of the subject’s
response. Moreover, if the subject knows the purpose of the experiment he/she will
act in appropriate ways, thereby, controlling the real expression. Besides these con-
cerns there are other social and ethical issues involved [8]. As a consequence, real
emotions are largely overlooked and experimentalists and theorists are increasingly
shifting efforts towards robust context-sensitive, multimodal and adaptive analy-
sis of human nonverbal affective states to develop systems that are able to monitor
human behaviour, adapt to the current context and user and are perceptually aware
in a ubiquitous computing environment.
processes spoken language and adapts to non-native accents is needed for speech
recognition and understanding applications to perform and reach a level compa-
rable to humans. Current HMMs focus on the linguistic information and remove
most of the paralinguistic information from the speech signal. As discussed pre-
viously and shown by speech perception experiments, paralinguistic information
plays a crucial role in human speech perception. Morgan et al. [10] discusses a
parametric and structure based approach which overcomes the mentioned limita-
tion and exploits the knowledge and mechanisms of human speech perception and
production by taking into account the relationship between speaking rate varia-
tions and the corresponding changes in the acoustic features [9].
The communicative intent of a spoken utterance is hidden in the meaning and
representation of meaning could provide feedback for further processing in appli-
cations such as interrogation and emotion recognition. The past few years have
seen unprecedented growth in computation and storage capabilities of systems
permitting the use of large training databases with inputs from multiple knowl-
edge sources. The resulting effects on speech recognition and understanding have
been enormous with streams of data becoming more heterogeneous and from dif-
ferent modalities. Additionally, we are coming into a period where the resources
are available to integrate different modalities and ensure heterogeneous parallel-
ism in algorithms and architectures in a much more significant way. Multi core
processors allow incorporation of detailed models of spoken language and imple-
mentation of parallelism in novel computational architectures for knowledge–rich
speech recognition requires further research and exploration [11].
Infrastructure: A speech signal is characterised by many parameters and thus
maintaining a large corpus becomes critical in modeling a given task to improve
performance and capture crucial information and tremendous variability in the
speech signal to be decoded. In order to make systems more powerful and to
understand the nature of speech itself, well-labelled annotated speech corpora
needs to be created for today’s systems to evolve. Consequently, design systems
must be tolerant to labelling errors and thus, standard conventions for labelling
must be determined for developing future methodologies.
As the internet is becoming a major source of information exchange, the avail-
ability of large amount of speech data which is readily accessible has become a
possibility. YouTube and other media sharing sites are a rich source of high vol-
ume audio data which might be recorded or streamed. These resources reflect a
more spontaneous and natural form of speech than present-day systems have
typically been developed to recognize and shall increase the robustness and tran-
scription capabilities of the future systems under wide range of conditions. As the
knowledge sources have increased, a large number of high quality speech tools
to collect, label and process large quantities of data have also evolved. Both open
source and commercial web-based tools have become valuable for cost-effective
processing of data in many languages and new initiatives are being aimed towards
elicitation of huge amounts of speech corpora in different languages to make a sig-
nificant impact on the automation of speech and language itself.
40 5 Emotional Speech Recognition
Algorithms: The human speech system constantly evolves and adapts to non-
native accents and languages without explicit supervision. Current speech recog-
nition systems are fairly statistic models with built-in knowledge that becomes
obsolete over a period of time or in a particular real world application. They
undergo supervised training and do not learn. There is a need to incorporate self
learning into speech and language processing systems to make them learn from the
data and apply the learned knowledge for specific results. The long term goal is to
create self-adaptive speech technology [9] to cope with changing environments,
dialects, accents, non-speech sounds etc. A learned system may perform automatic
pattern discovery and generalization through machine learning and advance the
natural language processing, information retrieval and cognitive abilities of new
improved speech systems. Klein [12], Park [13] and Venkataraman [14] explain
unsupervised acquisition of speech and natural language across different cultures.
Another important feature that needs to be pervaded and that significantly
affects the speech signal is the context in which the speech is captured and com-
municated. Speaker characteristics, speaking style, acoustic environment, channel
of transmission and language characteristics [9] are some of the factors that bring
in variability in the speech signal and controlling such factors presents a significant
challenge to speech community. Background noise is said to be the dominant cause
of harmful variability that degrades the system performance. Various filtering tech-
niques are currently applied to remove noise and distortions [15]. Also, speaking
style and dialect varies for each individual speaker and current ASR systems adapt
to these variations to a certain extent by including a large database of speakers in
the training phase. This approach is very data intensive and impractical for a large
real time application. Modeling and exploiting speaking rate of the signal during
the recognition process may be seen as a promising mechanism and a solution to
this problem, thus making ASR more robust and effective in acoustic models.
To implement these grand challenge tasks mentioned above, it is crucial to
carry out promising research and development in focused directions and prosper-
ous areas to enable this technology of speech recognition and understanding to
become progressively more capable and transform a number of high-utility appli-
cations to reality.
References
1. Reeves B, Nass C (1996) The media equation: how people treat computers, television and
new media like real people and places. Cambridge University Press, Cambridge
2. Flanagan JL (1972) Speech analysis, synthesis, and perception, 2nd edn. Springer, New York
3. Scherer KR (2003) Vocal communication of emotion: a review of research paradigms.
Speech Commun 40:227–256
4. Potamianos G, Neti C, Gravier G, Garg A, Senior A (2003) Recent advances in the automatic
recognition of audiovisual speech. Proc IEEE 91(9):1306–1326
5. Scherer K (1996) Adding the affective dimension: a new look in speech analysis and synthe-
sis. In: Proceeding of international conference on spoken language processing (ICSLP 1996),
pp 1808–1811
References 41
6. Chen LS, Tao H, Huang TS, Miyasato T, Nakatsu R (1998) Emotion recognition from audio-
visual information. In Proceedings of IEEE workshop on multimedia signal processing, Los
Angeles, CA, pp 83–88, 7–9 Dec 1998
7. De Silva L, Ng P (2000) Bimodal emotion recognition. In: Proceedings of automatic face and
gesture recognition, 2000, pp 332–335
8. Schneiderman B (1993) Human values and the future of technology: a declaration of respon-
sibility. In: Schneiderman B (ed) Sparks of innovation in human-computer interaction, Ablex
Publ, 1(1), Jan 1994, pp 67–71 (ACM Interactions )
9. Baker J, Deng L, Glass J, Khudanpur S, Lee C, Morgan N, O’Shaughnessy D (2009)
Developments and directions in speech recognition and understanding, Part 1 [DSP
Education]. IEEE Signal Process Mag 26(3):75–80
10. Morgan N, Zhu Q, Stolcke A, Sonmez K, Sivadas S, Shinozaki T, Ostendorf M, Jain P,
Hermansky H, Ellis D, Doddington G, Chen B, Cetin O, Bourlard H, Athineos M (2005)
Pushing the envelope—aside. IEEE Signal Process Mag 22(5):81–88
11. Olukotun K (2006) A conversation with John Hennessy and David Patterson. ACM Queue
Mag 4(10):14–22
12. Klein D (2005) The unsupervised learning of natural language structure. PhD thesis, Stanford
University
13. Park A (2006) unsupervised pattern discovery in speech: applications to word acquisition and
speaker segmentation. PhD thesis, MIT
14. Venkataraman A (2001) A statistical model for word discovery in transcribed speech.
Comput Linguist 27(3):352–372
15. Rosenberg AE, Lee CH, Soong FK (1994) Cepstral channel normalization techniques for
HMM-based speaker verification. In: Proceedings of the IEEE international conference on
acoustics, speech and signal processing, 1994, pp 1835–1838
Chapter 6
Where Speech Recognition Is Going:
Conclusion and Future Scope
Abstract Today, voice and natural language processing are at the forefront of any
human machine interaction environment. The chapter emphasizes the tremendous
progress that has taken place in machine learning, statistical data-mining and pat-
tern recognition approaches that can help in making speech interfaces more ver-
satile and pervasive. The growing requirements of speech interfaces also warn
against the impediments that may come in the way of successful implementation
of acoustically robust natural interfaces. Finally, the chapter underlines the techni-
cal advances and research efforts to be undertaken for high performance real-time
speech recognition that will completely change the way humans interact with their
computing devices.
Since the invention of computers and emerging technologies that enable more
natural ways of interacting with computers, speech recognition technologies have
come a long way and become more approachable to people and play a substantial
role in this technological evolution. Since 1990, their performance has exponen-
tially improved and has reached a level where a completely ubiquitous user inter-
face exists with less or no hardware requirement. During the 1990s and the 2000s,
state-of-the-art speech recognition systems were using evolved HMM variants,
human perceptual versions of cepstral or linear predictive coding feature vectors,
and sophisticated pattern matching and scoring algorithms.
Steady progresses in the technologies have provided the much needed thrust
to help machines understand human speech, including machine learning and sta-
tistical data-mining techniques. Devices are becoming more aware and smart,
making it possible to blend complex modeling approaches of powerful symbolic
processing, machine learning that take advantage of big data and knowledge
One of the major obstacles coming in the way of ASR is the lack of an unambigu-
ous boundary between paralinguistic and prosodic information. The systems today
are unable to effectively process the prosodic cues and generate specific speaker
qualities like age, emotion, attitudes etc. The human spoken speech or sponta-
neous speech encodes linguistic as well as interpersonal streams of information
[2]. The verbal meaning of the message is carried by the former one and the lat-
ter enriches the speech with paralinguistic cues. Nonverbal sounds or nonverbal
vocalisations as they are technically called are one of the characteristics of sponta-
neous speech that distinguish it from written text and improve speech recognition
accuracy. They can provide valuable paralinguistic cues and occur frequently in
spontaneous speech.
Researchers have divided speech information into various categories over
the years. Laver [3] suggests that paralinguistic signals are used to denote affec-
tive information through voice tone and extralinguistic refers to voice qualities
that identify individual speaker. Roach et al. [4] define paralinguistic features as
those that are intentional and non-linguistic features as unintentional. They fur-
ther define prosodic features as unambiguously signalling linguistic information
at one end and vocal features independent of pitch such as voice quality on the
6.1 Obstacles in the Implementation and Acceptance of ASR 45
paralinguistic end. Roach et al. [4] divide prosodic features further into tempo,
pitch range, rhythm, pause and intonation. On the other hand, according to Crystal
[5], prosodic features are characterised by variations in pitch, loudness, duration
and silence. Carlson [6] uses the term extralinguistic for inhalation, exhalation
and hesitation and refers to attitudes and emotions as extralinguistic [7]. As can be
seen from the various classifications, prosody can be used to signal both linguis-
tic as well as paralinguistic information. Broadly, prosodic information in speech
may be divided into linguistic information that includes verbal content and lexi-
cal stress and paralinguistic comprising of attitude, intention and emotional state.
Thus, one may regard paralinguistic phonetics as a subset of prosody. Fig. 6.1
gives a broad distinction between prosody and paralinguistic phonetics
Further research has been devoted to nonverbal sounds like filled pauses,
silence duration and linguistic disfluencies such as repetitions, fillers etc. and con-
firm that they have a systematic and non random nature and modeling them as reg-
ular words improves recognition performance [8]. The results with filled pauses
illustrate their role as linguistic boundaries for modeling more natural human
speech [9, 10].
Prylipko et al. [11] have investigated the potential of a wide range of nonver-
bals for language modeling of conversational speech and conclude that nonverbal
tokens lower the overall perplexity of the full test data and including nonverbal
into the model as regular words increases the perplexity of verbal tokens. Also,
modeling of breath as a regular language model event leads to a substantial
improvement in both perplexity and speech recognition accuracy. It has been pre-
sented that filled pauses have a crucial role as markers of prosodic and linguistic
segment boundaries and are better predictors for the following words and ignoring
or omitting them from context makes local perplexity worse. These tokens can sig-
nificantly enrich transcriptions with paralinguistic information, which may further
enhance natural speech processing and understanding.
Research and development in speech recognition has been undertaken for quite
a few decades and it continues to be an active area of research. The corpora
have evolved from small, private corpora to publically available large corpora
6.3 Technical Advances in Speech Recognition 47
under more realistic conditions. Such extensive corpora allow the field to evolve
and mature to the point that applications to control access to information, using
speaker verification as a powerful biometric in conducting benchmark evalua-
tions and research on realistic data are making commercial headway by shifting
the research and development effort to unconstrained situations. Obtaining speech
from a wide variety of channels and acoustic environments allows integration
of real world robustness and development of new and improved compensation
techniques.
The emergence of variable noise conditions, channels, text-independent speech
and multi-speaker speech indexing emphasize the need to understand and study
unconstrained situations and tasks. Current systems use low-level spectrum fea-
tures which are susceptible to channel effects and other noises. Systems that are
being developed these days must cater to high-level features like prosodic meas-
ures and idiolect that offer improved accuracy and robustness.
The conditions mentioned in the above paragraph direct the entire speech
research community to make advances in the current techniques to achieve the
desired technology sophistication. The most significant paradigm shift required in
this direction is the use of statistical models that assist in statistical discriminative
training techniques and use maximum mutual information that results in minimum
error [18, 19]. The basic HMM-like acoustic models focus on the likelihood for
matching criteria. Over the past decade, incremental advances in HMM technol-
ogy have advanced to the point where segmental models [20–24], and structured
speech and language models [24–26] are being employed for commercial use.
Linear Discriminant Analysis (LDA) and Heteroscedastic LDA analysis for fea-
ture extraction, use of determinization and minimization in decoding graph com-
pilation and discriminative training are some of the major recent advances in this
area that produce impressive results for many speakers under different conditions.
HLDA [27] and neural net-based features [28] allow multiple types of feature-
based transformations that can be applied both in parallel and sequentially.
In order to unravel the difficulty of a wide range of variable conditions for
channel, noise, speaker, vocabulary, accent, and recognition context, speaker adap-
tation models that are more closely tuned to the individual and environment have
been intensively studied and are the focus of a significant amount of research.
They enable rapid application integration and are a key to the successful commer-
cial deployment of speech recognition technology. Maximum a Posteriori proba-
bility (MAP) estimation [29] is the simplest form of acoustic adaptation technique
in this area. Maximum Likelihood Linear Regression (MLLR) [30] which adjusts
the gaussians and feature vectors so as to increase the data likelihood is another
popular adaptation technique.
Consequently, to develop and evaluate complex real-time algorithms and to
process the large public speech corpora, computational infrastructure that ena-
bles these statistical models and algorithm development needs to be built up for
establishment of ever-increasing capabilities. To intelligently extract informa-
tion from this huge database of speech, efficient knowledge representation tech-
niques that allow multiple sources of knowledge to be incorporated into a common
48 6 Where Speech Recognition Is Going: Conclusion and Future Scope
References
8. Schultz T, Rogina I (1995) Acoustic and language modeling of human and nonhuman noises
for human-to-human spontaneous speech recognition. In: Proceedings of ICASSP, IEEE, vol
1, Detroit, pp 293–296
9. Siu M, Ostendorf M (1996) Modeling disfluencies in conversational speech. In: Proceedings
of the 4th international conference on spoken language processing (ICSLP-96), vol I,
Atlanta, pp 386–389
10. Siu MH, Ostendorf M (2000) Variable N-grams and extensions for conversational speech lan-
guage modeling. IEEE Trans Speech Audio Process 8(1):63–75
11. Prylipko D, Vlasenko B, Stolcke A, Wendemuth A (2012) Language modeling of nonverbal
vocalizations in spontaneous speech. In: Proceedings of 15th international conference on
text, speech and dialogue, 2012. LNCS 7499. Springer, Heidelberg, pp 4625–4628
12. Mary ZJ, Tian X, Woods KJ, Poeppel D (2015) Multiple levels of linguistic and paralinguis-
tic features contribute to voice recognition. Sci Rep 5:11475
13. Schötz S (2002) Linguistic & paralinguistic phonetic variation in speaker recognition & text-
to-speech synthesis. GSLT papers: speech technology 1
14. Furui S (1997) Recent advances in speaker recognition. Pattern Recogn Lett 18(9):859–872
15. Klatt D (1987) Review of text-to-speech conversion for English. J Acoust Soc Am
82:737–783
16. Roach P (2000). The emotion in speech project. In: Proceedings of the ISCA workshop on
speech and emotion. Newcastle, Northern Ireland, Sept 2000, pp 53–59
17. Gustafson-Capková S (2001) Emotions in speech: tagset and acoustic correlates. Term
paper in speech technology 1, Swedish National Graduate School of Language Technology
(GSLT), Stockholm University, Department of Linguistics
18. Bahl L, Brown P, de Souza P, Mercer R (1986) Maximum mutual information estimation of
hidden Markov model parameters for speech recognition. In: Proceedings of the IEEE inter-
national conference on acoustics, speech, and signal processing, Tokyo, Japan, pp 49–52
19. He X, Deng L, Wu C (2008) Discriminative learning in sequential pattern recognition. IEEE
Signal Process Mag 25(5):14–36
20. Deng L (1993) A stochastic model of speech incorporating hierarchical nonstationarity. IEEE
Trans Speech Audio Process 1(4):471–475
21. Deng L, Aksmanovic M, Sun D, Wu J (1994) Speech recognition using hidden Markov mod-
els with polynomial regression functions as nonstationary states. IEEE Trans Speech Audio
Process 2:507–520
22. Poritz A (1998) Hidden Markov models: a guided tour. In: Proceedings of the international
conference on acoustics, speech, and signal processing, vol 1, Seattle, WA, pp 1–4
23. Glass J (2003) A probabilistic framework for segment-based speech recognition. In: Russell
M, Bilmes J (eds) New computational paradigms for acoustic modeling in speech recogni-
tion, computer, speech and language (special issue), vol 17(2–3), pp 137–152
24. Deng L, Yu D, Acero A (2006) Structured speech modeling. IEEE Trans Audio, Speech Lang
Process (special issue on Rich Transcription) 14(5):1492–1504
25. Chelba C, Jelinek F (2000) Structured language modeling. Comput Speech Lang 14:283–332
26. Wang Y, Mahajan M, Huang X (2000) A unified context-free grammar and n-gram model
for spoken language processing. In: Proceedings of the international conference on acoustics,
speech, and signal processing, Istanbul, Turkey, vol 3, pp 1639–1642
27. Kumar N, Andreou A (1998) Heteroscedastic analysis and reduced rank HMMs for improved
speech recognition. Speech Commun 26:283–297
28. Morgan N, Zhu Q, Stolcke A, Sonmez K, Sivadas S, Shinozaki T, Ostendorf M, Jain P,
Hermansky H, Ellis D, Doddington G, Chen B, Cetin O, Bourlard H, Athineos M (2005)
Pushing the envelope—Aside. IEEE Signal Process Mag 22:81–88
29. Gauvain J-L, Lee C-H (1997) Maximum a posteriori estimation for multivariate Gaussian
mixture observations of Markov chains. IEEE Trans Speech Audio Process 7:711–720
30. Leggetter C, Woodland P (1995) Maximum likelihood linear regression for speaker adapta-
tion of continuous density hidden Markov models. Comput Speech Lang 9:171–185
Index
C
Communication, 17–19 L
Coverbal, 19, 20 Linear discriminant analysis (LDA), 47
Coverbal behaviours, 19 Linguistic, 2, 3, 5, 18–21
E M
Emotion, 36–39 Machine learning, 43
Emotional markers, 11 Major auditory attribute, 9
Emotional speech recognition, 35 Mel-frequency cepstral coefficients (MFCC), 48
Extralinguistic, 44 Modality theory, 29
Extroversion, 13 Multimodal, 37
Multimodal interfaces, 26
Multimodality, 25–27, 30
G
Gesture, 20, 21
N
Neuroticism, 13
H Nonverbal, 1, 2, 20–22
Hidden markov model (HMM), 36, 37, 39, 43,
44, 47
Human computer intelligent interaction O
(HCII), 35 Obstacles, 44
P Semantic, 18
Paralanguage, 2–4 Significant developments, 36
Paralanguage in ASR, 45 Speech corpora, 46, 47
Paralinguistic, 44–46 Speech markers, 11
Personality markers, 12 Speech recognition, 36, 38, 43, 46
Pervasive, 36 Spoken dialogue systems, 25, 27, 28
Phonemes, 9 Suprasegmental, 3
Phonetics, 45, 46 Syntactic, 18
Phonological, 18
Physiology, 13
Pitch, 9–11, 13, 14 T
Pitch range, 13 Technical advances, 46
Potential benefits, 26
Prosody, 3, 5, 9–11
Proxemics, 2 U
Understanding, 20, 38
S
Segmental, 3 V
Self-adaptive, 40 Vocalizations, 4, 5