Download as pdf or txt
Download as pdf or txt
You are on page 1of 54

SPRINGER BRIEFS IN ELEC TRIC AL AND

COMPUTER ENGINEERING  SPEECH TECHNOLOGY

Swati Johar

Emotion, Affect
and Personality
in Speech
The Bias of
Language and
Paralanguage
123
SpringerBriefs in Electrical and Computer
Engineering

Speech Technology

Series editor
Amy Neustein, Fort Lee, NJ, USA
Editor’s Note

The authors of this series have been hand-selected. They comprise some of the most outstanding scien-
tists—drawn from academia and private industry—whose research is marked by its novelty, applicabil-
ity, and practicality in providing broad based speech solutions. The SpringerBriefs in Speech Technology
series provides the latest findings in speech technology gleaned from comprehensive literature reviews
and empirical investigations that are performed in both laboratory and real life settings. Some of the top-
ics covered in this series include the presentation of real life commercial deployment of spoken dialog
systems, contemporary methods of speech parameterization, developments in information security for
automated speech, forensic speaker recognition, use of sophisticated speech analytics in call centers, and
an exploration of new methods of soft computing for improving human-computer interaction. Those in
academia, the private sector, the self service industry, law enforcement, and government intelligence,
are among the principal audience for this series, which is designed to serve as an important and essen-
tial reference guide for speech developers, system designers, speech engineers, linguists and others. In
particular, a major audience of readers will consist of researchers and technical experts in the automated
call center industry where speech processing is a key component to the functioning of customer care
contact centers.

Amy Neustein, Ph.D., serves as Editor-in-Chief of the International Journal of Speech Technology
(Springer). She edited the recently published book “Advances in Speech Recognition: Mobile Environ-
ments, Call Centers and Clinics” (Springer 2010), and serves as quest columnist on speech processing
for Womensenews. Dr. Neustein is Founder and CEO of Linguistic Technology Systems, a NJ-based think
tank for intelligent design of advanced natural language based emotion-detection software to improve
human response in monitoring recorded conversations of terror suspects and helpline calls. Dr. Neus-
tein’s work appears in the peer review literature, in industry and mass media publications. Her academic
books, which cover a range of political, social and legal topics, have been cited in the Chronicles of
Higher Education, and have won her a pro Humanitate Literary Award. She serves on the visiting faculty
of the National Judicial College and as a plenary speaker at conferences in artificial intelligence and
computing. Dr. Neustein is a member of MIR (machine intelligence research) Labs, which does advanced
work in computer technology to assist underdeveloped countries in improving their ability to cope with
famine, disease/illness, and political and social affliction. She is a founding member of the New York City
Speech Processing Consortium, a newly formed group of NY-based companies, publishing houses, and
researchers dedicated to advancing speech technology research and development.

More information about this series at http://www.springer.com/series/10043


Swati Johar

Emotion, Affect and


Personality in Speech
The Bias of Language and Paralanguage

13
Swati Johar
Defence Institute of Psychological
Research
DRDO, Ministry of Defence
New Delhi
India

ISSN  2191-8112 ISSN  2191-8120  (electronic)


SpringerBriefs in Electrical and Computer Engineering
ISSN  2191-737X ISSN  2191-7388  (electronic)
SpringerBriefs in Speech Technology
ISBN 978-3-319-28045-5 ISBN 978-3-319-28047-9  (eBook)
DOI 10.1007/978-3-319-28047-9

Library of Congress Control Number: 2015958312

© The Author(s) 2016


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or
dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt
from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained
herein or for any errors or omissions that may have been made.

Printed on acid-free paper

This Springer imprint is published by SpringerNature


The registered company is Springer International Publishing AG Switzerland
Contents

1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Paralanguage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Psychology of Voice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Pitch as a Major Auditory Attribute. . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Speech Markers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Emotional Markers in Speech . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Personality Markers in Speech. . . . . . . . . . . . . . . . . . . . . . . . 12
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 Language, Communication and Human Behaviour . . . . . . . . . . . . . . . 17
3.1 Language and Interpersonal Communication. . . . . . . . . . . . . . . . . . . 18
3.2 Language and Coverbal Behaviours. . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Understanding Nonverbal Behaviour. . . . . . . . . . . . . . . . . . . . . . . . . 20
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Multimodality and Spoken Dialogue Systems . . . . . . . . . . . . . . . . . . . . 25
4.1 Potential Benefits of Multimodal Interfaces. . . . . . . . . . . . . . . . . . . . 26
4.2 Multimodality in Spoken Dialogue Systems. . . . . . . . . . . . . . . . . . . 27
4.3 Future Directions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5 Emotional Speech Recognition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.1 Significant Developments in Speech Recognition. . . . . . . . . . . . . . . 36
5.2 Future Directions in Speech Recognition and Understanding. . . . . . 38
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

v
vi Contents

6 Where Speech Recognition Is Going: Conclusion


and Future Scope. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.1 Obstacles in the Implementation and Acceptance of ASR. . . . . . . . . 44
6.2 Role of Paralanguage in ASR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.3 Technical Advances in Speech Recognition. . . . . . . . . . . . . . . . . . . . 46
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
About the Author

Swati Johar  is Scientist ‘C’ at the Defence Institute of Psychological Research


(DIPR), Delhi. She is involved in many major research projects from an inter-
disciplinary perspective, including research on image and signal processing. She
has completed her M.Tech from BITS, Pilani and her research work on gestures
and speech recognition has been published in reputed International Journals and
proposed to be integrated with the New Selection System being developed for the
­Indian Armed Forces. Emotion recognition and non-verbal behaviour are some of
her areas of interest and she has published scientific articles in peer-reviewed jour-
nals. She authored a few book chapters dealing with human computer interaction
and technological emergence, and has been an active member of Institute of Electri-
cal and Electronics Engineers (IEEE) society for more than 3 years.

vii
Chapter 1
Introduction

Abstract  The ability to interpret any kind of human communication correctly is


considered an essential competency of any individual. Human behaviour is influ-
enced by various internal as well as external factors such as genetics, culture,
social norms, attitude etc. and is impacted by certain traits each individual has.
Non-verbal communication possesses immense potential to emphasize the mean-
ing of spoken words through thoughts and feelings especially when the conversa-
tion is not face to face. This chapter elucidates the science of paralanguage as a
means to extract the unintentional meanings from utterances. Various paralinguis-
tic cues have been investigated for discrete emotion recognition and evolution of
language and communication systems.

Keywords Nonverbal · Paralanguage · Kinesics · Proxemics · Prosody · 
Suprasegmental  ·  Auditory cues  · Vocalizations · Linguistic · Segmental

Good communication is the foundation of any successful personal or professional


association. The ability to effectively communicate both verbally and non verbally
ultimately defines an individual. Understanding the different aspects of verbal and
non-verbal communication, and the important roles they play in our interactions
with others becomes the first step to enhancing communication and making it rich
and effective.
Verbal communication represents the literal content of a message, involving
words, spoken, written or signed and is based on an organized structure of words
of a given language. The nonverbal component communicates how the message
is interpreted. It can affect people’s perceptions and exchanges in subtle but sig-
nificant ways. In our interactions with others, we need to interpret a wide range of
social signals to understand the intentions and feelings of others. The implicit or
actual meaning behind someone’s words may be entirely different than the literal
translation.
All of our verbal and nonverbal behaviour create an intricate communication
system through which humans know and understand each other. Many linguists

© The Author(s) 2016 1


S. Johar, Emotion, Affect and Personality in Speech,
SpringerBriefs in Speech Technology, DOI 10.1007/978-3-319-28047-9_1
2 1 Introduction

and psychologists believe that human language evolved from a system of nonverbal
communication [1]. Tannen [2] estimates that as much as 90 % of all human com-
munication is nonverbal, whereas Noam Chomsky and many others argue that verbal
language is an advanced and refined form of an inherited nonlinguistic system [3].
We continuously give and receive wordless signals like body language, which
includes facial expressions, eye contact and posture etc. The gestures we perform,
the way we sit, the way we talk, how close we stand and how much eye contact
we make reflect our real feelings and intentions. The sound of our voice, including
pitch, tone and volume are also forms of non-verbal communication. Nonverbal
communication focuses on information that is shared but does not contain words,
for example, messages sent through body motions; vocal qualities; and the use of
time, space, artifacts, dress, and even smell. This nonverbal content emphasizes,
complements and substitutes the verbal message. Therefore, it can be said that
communication is a continuous process wherein a considerable portion of our
behaviour is implicit, learned and innate.
According to Wertheim [4], non-verbal communication may be used to express
emotions, communicate interpersonal relationships, support verbal interaction,
reflect personality and perform rituals, such as greetings and goodbyes.
Unlike verbal messages, no formal set of alphabets or grammar exists for non-
verbal codes. But nonverbal communication may be classified on the basis of the
various channels through which nonverbal signals are sent. These channels are
kinesics, proxemics, haptics, olfactics, physical appearance and paralanguage,.
Kinesic behaviour, or body movement, includes gestures, hand and arm move-
ments, leg movements, facial expressions, eye gaze and blinking, and stance or
posture. It deals with the interpretation of any part of the body or the body as a
whole. Knapp and Hall [5] point out that perhaps more than any other part of the
body, the face has the highest nonverbal sending capacity. Through facial expres-
sions, we can communicate our personality; open and close channels of communi-
cation; complement other nonverbal behaviour; and communicate emotional states
[6]. Proxemics is the study of interrelated observations of man’s use of personal
place to communicate with others by virtue of relative position of bodies. Personal
space during conversation plays an important role and several studies show that
culture, ethnicity and sex greatly influence and affect proxemic distances [7, 8].
Haptic communication is a form of nonverbal communication that describes how
we communicate through the use of touch. Like proxemics, it varies widely across
cultures, sex, situation and relationship of the people involved. Olfactics refers to
people’s sense of smell. There are certain qualities or thoughts attributed to the
specific scents that give them their moral significance. Olfactic research is the least
understood of all due to the lack of sufficient smell vocabulary. It is often said
that the first impression about a person is formed from his physical appearance.
People dress differently across cultures and a person’s appearance communicates
the person’s age, sex and even status. This form of nonverbal channel of commu-
nication may be manipulated easily and may result in negative sanctions. Each of
these dimensions discussed above explain differences in nonverbal communication
across cultures.
1.1 Paralanguage 3

1.1 Paralanguage

Of all the channels of communication, paralanguage has been studied and


explored for many years now and has proven to play a significant role in non-
verbal communication. It is a non-linguistic component in human interaction
that accompanies speech and may be expressed consciously or unconsciously.
Any communication that uses explicit verbal forms is characterized as linguistic.
Linguistics treat speech as a fascinating phenomenon because it can communi-
cate meaning both by the meaning of the words said and also by how they are
said. Paralanguage is a non-linguistic and non-verbal communication behaviour in
human interaction that modifies meaning and conveys speaker’s feeling and emo-
tion. It is a vocal (sometimes non-vocal) phenomenon useful to envision intrap-
ersonal communication occurring in the mind of the individual. Vocal qualities
like prosody, stress, intonation etc., beyond the basic verbal message that usually
accompany speech constitute the paralinguistic properties and play an important
role in human speech communication. The study of paralanguage is known as par-
alinguistics, and was invented by Trager in the 1950s [6]. It is believed that the
conversational use of spoken language cannot be properly understood unless para-
linguistic elements are taken into account [9].
A study by psychologist Mehrabian [10] suggests that approximately 38 %
of the impact in most conversations derives from how things are said. Emotional
cues control important information about the intention and feelings of others. Even
though, wealth of research has taken place in understanding facial expressions of
emotions, only a handful of studies on developmental trajectory of interpreting
affective cues in the voice have been carried out and still in the immature stage.
Comparing the presentation of affective information from facial and auditory cues,
several studies have suggested that the auditory modality may be especially vital
for the communication of emotions [11, 12].
The suprasegmental characteristics of speech such as pitch, rhythm, stress and
loudness, called prosody in general are a rich source of information in spoken lan-
guage and tell about the internal state of a speaker, especially the affective state.
They go beyond segmental level analysis which concerns individual sounds or
phonemes and deal with the auditory qualities of sound. Tone, pitch and intonation
convey emotions or politeness. Stress and pauses indicate importance and confi-
dence. All of these features possess linguistic differences as well that can alter the
meanings of messages. Though suprasegmental information is often considered to
be an important component of paralanguage, the two are not synonymous but par-
alanguage can be conveyed via both suprasegmental and segmental information.
The paralinguistic properties of speech play an important role in human speech
communication. There are no utterances or speech signals that lack paralinguistic
properties, since speech requires the presence of a voice that can be modulated.
This voice must have some properties, and all the properties of a voice as such are
paralinguistic. In recent years, researchers have concentrated their studies on the
effects various vocal characteristics have upon listeners and found that listeners,
4 1 Introduction

to a significant extent, can judge a speaker’s age, sex, race, education, status, geo-
graphic origin, and emotional disposition. Often, paralinguistic qualities, vocaliza-
tions, and nonfluencies reveal a speaker’s emotional state and/or veracity.
According to Knapp and Hall [5], paralanguage may be divided into voice
qualities that include pitch, rhythm, tempo, articulation, and resonance of the
voice and vocalizations that include laughing, crying, sighing, belching, swallow-
ing, clearing of the throat, snoring etc. Trager [6], on the other hand, defines three
kinds of vocalizations: (1) Vocal characterizers include non-language sounds such
as laughing, giggling, yelling, whining and voice breaking. (2) Voice qualifiers are
the qualifying portions of language material and include intensity, pitch height and
extent. (3) Vocal sagregates include vocal fillers (such as uh-uh, ooh, hmm), silent
pauses, and other hesitation phenomena.
Various acoustic cues have been observed for discrete emotions and the rela-
tionship between acoustic correlates and acted emotions may be summarised as
follows:
Pitch is the perceived frequency of sound and can affect social meanings. The
different size of the vocal folds of men and women, influence pitch range so that
adult male voices are usually lower pitched than female voices. The speaker uses
different patterns of pitch, consciously or unconsciously, to convey different mean-
ings. The changes in pitch or pitch contour convey shades of meaning such as sur-
prise, tense, emphasis etc.
Pause refers to rest, hesitation, or a temporary stop. Every language has pat-
terns of pauses which varies the tempo of their speech. They may enhance deliv-
ery or be filled unnecessarily to distract. Depending on the duration and frequency
of pauses in an expression of speech one may infer the state of a person. Too
many pauses indicate a lack of control and an over-emotional expression whereas
silence, which is pause of a longer duration is considered a failure of dialogue.
Moreover, a person who talks little is considered shy and uncommunicative. A dif-
ferent category of pauses called ineffective pauses, convey various facets about the
personality and affective state of the speaker [13]:
Filled pauses are hesitation sounds that speakers employ to maintain con-
trol of a conversation. They do not alter the meaning of the sentence, do not
add any information and indicate lack of certainty in what is being said. Uh,
um, ah are some of the fillers that are used globally to reformulate or rephrase
representations.
Interjections are short verbal utterances that do not convey any meaning and
are used as fillers. These are a kind of disfluency accompanied with or without
tension and are typically common. For ex. Um, like, mean uh etc. these may either
occur at the start or in the middle of a sentence.
Speech disfluency are breaks, irregularities, or non-lexical vocables that occur
within the flow of otherwise fluent speech, including false starts (words and sen-
tences that are cut off in the middle), phrases that are restarted and repeated,
grunts, or fillers like uh, erm, and well.
1.1 Paralanguage 5

Stress relates to the relative emphasis placed on certain words within sentences
and conveys both meaning and emotion. There are various ways in which stress
manifests itself in the speech stream and effects articulation.
Volume is generally used to show emotions such as fear or anger and is the per-
ceived loudness of the speaker. It’s a measure of the physical strength of the signal
and also plays a role in influencing the affective state of an individual.
Traditionally, research has been conducted to explore the nonverbal acoustic
manifestations of emotions in a corpus of acted speech. It has become possible to
isolate the role of acoustic information in conveying emotion and several studies
have even made use of nonsensical utterances to factor out all possible lexical or
semantic effects [14–17].
The social implications of all these prosodic features are subject to cultural
interpretation. However, prosody is related to various levels of information, from
linguistic, paralinguistic, to non-linguistic, and, therefore, its acoustic manifesta-
tion is rather complicated with large variations.
In most current speech recognition systems, prosodic features are not utilized
to their full potential. Role of prosodic features for speech recognition increases in
the case of spontaneous speech. As discussed, speech includes number of irregu-
larities and hesitations which degrade the performance of recognition systems. As
a result, extracting the prosodic features of these segments or frames of speech
may result in different and useful outcomes that may further aid the recognition
process. Detection of utterance irregularities from this viewpoint comes important
for the future work.
To provide an aid for psychological research in this area, a study was taken up
for measuring aspects of non-verbal behaviour through speech [18]. This research
explores the various indicators for non-verbal cues of speech and provides a
method of building a paralinguistic profile of these speech characteristics. The
scope of the study involves short term acoustic feature extraction, decision mak-
ing and analysis of various speech samples based on sections of spaces present and
pitch contours. The values obtained for average pitch, short-term energy, silence
duration, rate of speech and loudness were analyzed for different possibilities and
a decision output was generated based on these five paralinguistic parameters clas-
sifying the individual as confident, tensed, happy or sad.
As evident from above, the investigator must consider internal as well as
external factors which influence behaviour before interpreting a subject’s behav-
iour. The relative contribution of different cultural dimensions to emotion infer-
ences from vocal expressions must be examined in future work. The phenomenon
of vocal expression and content free utterances is of particular interest due to its
likely roots in nonhuman primate vocalizations and the evolution of language and
communication systems. The role of paralinguistic vocalizations have already
been explored in diverse applications. Some of the examples are as follows: in
cathartic experiences where individuals express themselves more openly [19],
in expressive communication which enhances participants’ cognitive skills and
6 1 Introduction

enable some speech-disabled participants to access words through the melody [20]
and in voice and speech therapy [21], in addition to the use of paralinguistic voice
input for maintaining a near real time experience.
Paralinguistic vocal control may be affected by a number of factors includ-
ing social context, background, physical impairment, state of mind etc. and there
are many aspects of voice which have not yet been tapped and exploited fully to
enhance the user’s experience and speech recognition techniques. Vocal input may
also serve as a useful therapeutic intervention for various disorders and inclusion
of non-speech vocal control may become necessary for reliable speech recognition
in cases of speech impairment. These limitations can be overcome by developing
new generation methods and techniques that will bring a fundamental shift in the
conventional procedures and enable users to interact with a real world interface in
synergy with the technology.

References

1. Neuliep JW (2005) The nonverbal code. Intercultural communication: a contextual approach,


3rd edn. Sage Publications, Inc, pp 285–330
2. Tannen D (1986) That’s not what I meant: how communication style makes or breaks rela-
tionship. Ballentine Books, New York
3. Chomsky N (2004) Language and mind: current thoughts on ancient problems. Part I &
Part II. In: Jenkins L (ed) Variation and universals in biolinguistic. Elsevier, Amsterdam, pp
379–405
4. Wertheim EG (1999) The importance of effective communication. Northeastern university.
https://ysrinfo.files.wordpress.com/2012/06/effectivecommunication5.pdf. Accessed 25 Mar
2015
5. Knapp ML, Hall JA (2002) Nonverbal communication in human interaction, 5th edn.
Harcourt Brace, Fort Worth
6. Trager GL (1958) Paralanguage: a first approximation. Stud Linguist 13:1–12
7. Hall EM (1966) The hidden dimension. Doubleday, Anchor
8. Hall EM (1968) Proxemics. Curr Anthropol 9:83–108
9. Abercrombie D (1967) Elements of general phonetics. Edinburgh University Press,
Edinburgh
10. Mehrabian A (1971) Silent messages, 1st edn. Wadsworth, CA
11. Akhtar N, Gernsbacher MA (2008) On privileging the role of gaze in infant social cognition.
Child Dev Perspect 2(2):59–65
12. Baldwin DA, Moses LJ (1996) The ontogeny of social information gathering. Child Dev
67(5):1915–1939
13. Boundless (2015) Pauses. Boundless Communications. https://www.boundless.com/commu-
nications/textbooks/boundless-communications-textbook/delivering-the-speech-12/effective-
vocal-delivery-64/pauses-256-10661/. Accessed 2 Feb 2015
14. Scherer K (2000) A cross-cultural investigation of emotion inferences from voice and speech:
Implications for speech technology. In: Proceedings of international conference on speech
language processing (ICSLP). Beijing, China, Oct 2000, pp 379–382
15. Tato R, Santos R, Kompe R, Pardo JM (2002) Space improves emotion recognition. In:
Proceedings of international conference on speech language processing (ICSLP). Denver,
Colorado, pp 2029–2032
16. Oudeyer P (2002) The synthesis of cartoon emotional speech. In: Proceedings of speech
prosody. Aix-en-Provence, France, pp 551–554
References 7

17. Banziger T, Scherer KR (2005) The role of intonation in emotional expressions. Speech
Commun 46(3–4):252–267
18. Johar S (2014) Paralinguistic profiling using speech recognition. Int J Speech Technol
17(3):205–209
19. Heron J (1977) Catharsis in human development. Human potential research project.
University of Surrey, Guildford
20. Elliott J (2005) How singing unlocks the brain, BBC News. http://news.bbc.co.uk/1/hi/
health/4448634.stm. Accessed 10 Apr 2015
21. Arizona Health Sciences Library (2006) Health problems of musicians. http://www.ahsl.
arizona.edu/about/ahslexhibits/musicianmedicalmaladies/instruments.cfm. Accessed 22 Feb
2015
Chapter 2
Psychology of Voice

Abstract  The sound of every individual’s voice is unique due to the difference in
the size and shape of vocal cords. The vocal folds loosen and tighten resulting in a
change in pitch, volume, timbre, or tone of the sound produced. Analyzing speech
from a physiological perspective, this chapter explores the pitch component of voice
and how influential it can be. Interestingly, information regarding prosody, emotions,
gender and age is affected by pitch and pitch can help in unconsciously divulging
the feelings, moods and emotions. The chapter also enlightens vocal behaviour as
a powerful index of emotional and personality markers which are paramount in the
extraction of meaningful information from acoustic signals and contribute to a better
understanding of the psychology of voice and performance capabilities.

Keywords Physiology · Phonemes · Prosody ·  Pitch range  · Pitch · Emotional


markers  ·  Personality markers  · Benevolence · Extroversion · Neuroticism

Speech is an information-rich signal created at the vocal cords after travelling


through the vocal tract and produced at speaker’s mouth. It exploits frequency-
modulated, amplitude-modulated and time-modulated carriers to convey informa-
tion about words, identity, accent, expression, style of speech, emotion and the state
of the speaker. It is the most natural form of human communication and is related
to human physiological capability and sequence of sound and acoustics known as
phonemes. It is essentially a non-stationary signal, but can be divided into sound
segments which have some common acoustic properties for a short time interval.
The information conveyed by speech is composed of multilayered temporal and
spectral variation that includes prosody, gender, age, identity, emotional state etc.
To understand speech as a means of communication, to analyze speech for
automatic recognition and extraction of information and to discover some physio-
logical characteristics of the speaker, it is necessary to study how to model speech
and its correlates and various aspects of speech processing. Speech coding, syn-
thesis, recognition, understanding, speaker verification and language translation
are some of the many speech applications that use fundamentals of linguistics,

© The Author(s) 2016 9


S. Johar, Emotion, Affect and Personality in Speech,
SpringerBriefs in Speech Technology, DOI 10.1007/978-3-319-28047-9_2
10 2  Psychology of Voice

acoustics, pragmatics, speech perception, representation and various speech meas-


ures and properties thus making speech processing an extensive theoretical and
experimental area of research.
The noise-like air from the lungs is temporally and spectrally shaped by the fre-
quency of the openings and closings of the glottal folds and forms the source sig-
nal of the speech. As a result, broadly two types of sounds exist: voiced which are
periodic and generated by the vocal cords and unvoiced which are aperiodic and
noisy in nature. Due to a steady supply of pressurized air, the vocal cords open and
close in a quasi-periodic fashion giving rise to voiced sounds like an ‘e’. In case of
unvoiced sounds, air passes through some obstacle in the mouth and this obstacle
leads to a non-uniform, non-periodic pulse of air.

2.1 Pitch as a Major Auditory Attribute

The periodicity of the glottal pulse and the time-variations of glottal pulse period
convey the intent, expressional content, intonation and stress in the speech signals
[1]. The time duration of one glottal cycle is defined as the pitch period and its
reciprocal is the pitch or fundamental frequency. Pitch is determined by the length,
tension, mass of the vocal cords and the sub-glottal pressure. It carries information
regarding the prosody or rhythm, emotion, speaking style and accent among many
others. The following information is contained in the pitch signal:
(a) Gender classification aims to predict the gender of the speaker by analyzing
different parameters of the speech signal. It is mainly conveyed by the pitch
value and in part by the vocal tract characteristics. The average pitch for
males is about 110 Hz while for females it is about 200 Hz [2].
(b) Emotional states are correlated with particular physiological states, which in
turn make predictable effects on speech features, especially on pitch, timing
and voice quality. Speech emotion recognition is particularly useful for appli-
cations which require natural man-machine interaction and when a person is
in a state of anger, joy or fear, the speech is fast, loud and with strong high
frequency energy. When someone is sad or bored, slow, low pitched speech
with weak high frequency energy is produced. Pitch variation is often corre-
lated with loudness variation. Happiness, distress and extreme fear in voice
are also signalled by fluctuations of pitch.
(c) Accents convey information about the status of individual entities in the
discourse to indicate their relative salience. It is also largely conveyed by
changes in the pitch and rhythm of speech. In addition, a certain type of pitch
movement may signal an intonational meaning.
(d) Prosody is a parallel channel of communication for carrying information that
cannot be deduced from lexical channel. All aspects of prosody are transmit-
ted by muscle motions and time-variations of pitch have a smooth relationship
with these muscle tensions. It gives clues to many channels of linguistic and
paralinguistic information and can indicate syntax and people’s attitudes and
2.1  Pitch as a Major Auditory Attribute 11

feelings. Even hand gestures, eyebrow and face motions, can be considered
prosody because they carry information that modifies and can even reverse the
meaning of the lexical channel.
(e) Age and state of health of a speaker is also related to pitch. The biological
fact that the ratio of eye diameter to head diameter varies markedly with age
develops connections between the sound shape, meanings or communicative
intentions, emotions and affect of the speaker. As a result, a visual estimation
of the ratio eye diameter/head diameter is a rough indicator of age and size of
speaker.

2.2 Speech Markers

The knowledge of perception of sound and extraction of meaningful data from


acoustic signals is paramount to understand the relevance and evolution of audio
signal processing. This understanding helps to analyze what comprises the pitch,
timbre etc. and what makes some sounds especially natural or artificial. The short-
comings and pitfalls encountered during sound processing can also be studied with
this knowledge and thus, suggest various extensions that can be made in storing,
producing and modifying speech signals.
Voice has long been considered a measure of emotion and a reflector of person-
ality due to its mature potential to tap individual differences in emotional states
and personality dispositions [3]. The understanding of the complex interplay of
personality, emotional dynamics and voice production has progressed to a level
that many technological advances today support the voice-psychology association.
The role of psychological processes among voice- disordered groups has also been
long debated and remains a controversial topic of argument.
Speech carries a lot of information over and above the content in the language.
The concept of speech markers has been incorporated into the domain of socio-
linguistics since 1970s. Most individuals do not have a voluntary control over
their personality (age, sex, social class etc.) they present to others. These speech
markers are often accompanied by non-linguistic cues permitting interlocutors to
communicate emotions, attitudes and intentions about their own as well as other’s
social states.

2.2.1 Emotional Markers in Speech

Emotions can be expressed in voice at the physiological, the articulatory or the


acoustic level. It is intimately connected with cognition and many physiological
indices change during emotion arousal [4]. There exist a large number of para-
linguistic markers embedded in the acoustic, linguistic and non-verbal content
of speech that are intertwined with prosody and semantics and are effective in
12 2  Psychology of Voice

distinguishing a large range of emotions over a range of human voices and con-
text, adding naturalness to synthesized speech and thereby facilitating effective
emotional speech processing. Since emotion analysis varies with culture, language
and even population, it is essentially a multi-faceted approach and improvement in
speech emotion recognition performance has been achieved by combining gestural
information along with acoustic correlates. Anger, fear, sadness, joy, neutral and
surprise are some of the common emotions identified by current speech dialogue
and processing systems.
Most of the current methods for measurement and analysis of these cues are
intrusive and require specialized equipment and expertise to make explicit and
detailed predictions regarding the states conveyed in emotional speech. Studies on
emotion may focus on the expression of the emotion by the speaker, the acoustic
cues that convey the intended emotion, the perception of these cues and the infer-
ence about the expressed emotion. Several studies have explored affect inferences
from voice cues in listening tests, where the participants are required to judge the
emotions expressed in speech samples using various response formats like forced
choice and quantitative ratings. According to Scherer [5], various content-mask-
ing procedures that disrupt or degrade individual voice cues can be used to study
which voice cues are used by listeners to infer specific emotions.
The existence of various voice profiles for different emotions and the complex
nature of voice production process make this task quite challenging to success-
fully achieve the desired purpose. Inconsistent data regarding voice cues to spe-
cific emotions, individual differences among speakers, weak emotional effects and
interplay of spontaneous and strategic expressions are some sources of variabil-
ity that pose practical problems to deduce emotion portrayals. As a result, efforts
are being directed towards cross-cultural studies, implementation of multi-modal
approaches in emotion expression and intensive research collaboration from psy-
chology, acoustics, engineering and computer science to facilitate better under-
standing of how emotions are revealed by various aspects of the voice.

2.2.2 Personality Markers in Speech

As it has been mentioned before, the scope of voice-based human machine inter-
action expands beyond directed dialogue and simple command and control type
interfaces. Future machines will need to be able to interpret a specific context,
which is determined by many factors including the quality of voice, and produce
the respective output. An analysis of the semantic nature of personality traits and
interpersonal and intra-personal behaviour dispositions reveals the underlying
dimensions regarded as essential determinants of social interactive behaviour.
Controversy that surrounds the concept of personality has forced social and
behavioural scientists to debate the nature of personality and its impact on behav-
iour. Since listeners rely heavily on speech style to attribute personality to the
speaker, the possibility of accurate personality inferences from speech remains
2.2  Speech Markers 13

questionable [6]. In speech based communication vocal manifestations can be


modeled to establish a psychological categorization of personality traits. These
manifestations are regarded as speech markers of personality that serve as the
basis for personality attribution of the listener corresponding to a specific person-
ality disposition of the speaker [7]. These speech cues reflect the individual dif-
ferences in cognitive processes of the individuals and the relative dominance of
certain emotional and motivational states.
The various prosodic characteristics like pitch level, tempo, speech rate and
loudness can be modeled in a number of different ways to convey speaker’s affec-
tive state and attitude [8, 9]. Speech researchers have demonstrated that emotional
states differ in their paralinguistic expression and observers use these vocal cues
to judge the personality traits and affective states of the speaker. For example, a
speaker’s age can be judged by voice alone [10]. Voice quality, pitch and pitch
range are the important dimensions on which listeners could base their judgments
about speaker age [11]. The pitch measurement varies substantially from child-
hood to adulthood and is also different for men and women. On the other hand,
pitch range appears to remain fairly constant during childhood and increases
from adolescence to adulthood. Pitch range is also an important indicator of sex
and female range is considerably wider than for men. It may also be stated that
the speaker characteristics are relatively permanent as they are closely related to
speaker’s physiology and anatomy. According to Scherer [7], males tend to have
higher pitch levels compared to females and this can be attributed to high degree
of arousal in males. Active emotions like anger and happiness are associated with
fast tempo, and high pitch, whereas low energy state of sadness attributes to slow
tempo lower speech rate and mean pitch. Similarly, major personality dimensions
of benevolence and competence are also largely related to pitch and speech rate.
Lower pitch and faster speech rate are associated with more credibility and hence,
more benevolence [8, 12, 13]. However, deception is strongly related to funda-
mental frequency of voice and an increased frequency signals false utterances and
judges the individual as less truthful. This can also be supported by the fact that
stressful situations tend to raise the voice’s fundamental frequency. From time
to time, correlations of the personality dimensions of introversion-extroversion
and emotional stability have attracted various researchers and numerous studies
have investigated the prosodic parameters pitch range, pitch level, intensity and
tempo to model these dimensions in synthetic as well as natural speech [14–16].
Extroverts are more sociable and interactive and introverts are rather conservative,
quiet and shy. On the other hand, emotional stability or neuroticism is an internal
state of mind rather than interpersonal reaction. Individuals with high neuroticism
are easily overwhelmed by feelings and are said to be less confident and unstable
as compared to low neurotics who are more calm and controlled [17]. In the past,
it has been found that both these personality dimensions significantly influence an
individual’s behaviour in a variety of contexts and therefore, there is considerable
interest in these traits and their manifestations in behaviour [18–20]. Oberlander
and Gill [21] performed Parts of Speech analysis on these two groups and pre-
dicted that the neuroticism dimension was more closely related to implicitness
14 2  Psychology of Voice

and high neurotics used pronouns and verbs more pervasively. Also, high extro-
verts used more conjunctions overall and low extroverts preferred more nouns and
adjectives.
In order to model personality traits for speech synthesis using different speak-
ers and to identify one or more several defined personalities in dynamic situa-
tions, future work will be bound to the availability of data and large databases in
order to avoid any influence of the bias of the listener’s perspective. Speech is a
highly complex interaction of communicative as well as informative characteris-
tics that convey information about the speaker’s identity, his emotional state and
the situational context. In addition to pitch and rate, additive models that involve
other vocal factors must be designed to understand how semantic information is
conveyed by the paralinguistic parts of speech. Future personalized speech syn-
thesis systems would require an understanding of how personality is encoded in
spoken communication along with refined methods to analyze speech. Moreover,
capturing emotional states along with the personalities would facilitate a more
holistic system of estimating behaviour from speech. Such parametrical synthesis
of speech can be used for diverse commercial applications to indicate personal-
ity impressions that individuals leave on one another and highlight the existence
and psychological significance of personality as an important correlate in social
interaction.

References

1. Kashem (2004) Speech processing. http://duet.ac.bd/drkashemweb/Dr.Kashem%20Wev/


dr%20kasem-dsp-ps/Chapter13-Speech%20Processing.pdf. Accessed 17 Jun 2015
2. Kawahara H, Matsui H (2003) Auditory morphing based on an elastic perceptual distance
metric in an interference-free time-frequency representation. In: Proceedings of IEEE inter-
national conference on acoustics, speech and signal processing, vol I, pp 256–259
3. Aronson AE (1990) Clinical voice disorders: an interdisciplinary approach, 3rd edn. Thieme,
New York
4. Lindsay PH, Norman DA (1972) Human information processing. Academic Press, New York
and London
5. Scherer KR (2003) Vocal communication of emotion: A review of research paradigms.
Speech Commun 40:227–256
6. Giles H, Powesland PF (1975) Speech style and social evaluation. Academic Press, New York
7. Scherer KJ (1979) Personality markers in speech. In: Scherer KR, Giles H (eds) Social mark-
ers in speech. Cambridge University Press, Cambridge p, pp 147–209
8. Apple W, Krauss RM (1979) Effects of pitch and speech rate on personal attributions. J Appl
Soc Psychol 37:715–727
9. Trouvain J, Barry WJ (2000) The prosody of excitement in horse race commentaries.
Proceedings of ISCA—workshop on “speech and emotion”. Newcastle, Northern Ireland, pp
86–91
10. Allport GW, Cantril H (1934) Judging personality from voice. J Soc Psychol 5:37–55
11. Helfrich H (1979) Age markers in speech. In: Scherer KR, Giles H (eds) Social markers in
speech. Cambridge University Press, Cambridge p, pp 63–108
12. Smith B, Brown B, Strong W, Rencher A (1975) Effects of speech rate on personality percep-
tion. Lang Speech 18:145–152
References 15

13. Brown BL, Strong WJ, Rencher AC (1974) Fifty-four voices from two: the effects of simulta-
neous manipulations of rate, mean fundamental frequency, and variance of fundamental fre-
quency on ratings of personality from speech. J Acoust Soc Am 55:313–318
14. Scherer KR, Scherer U (1981) Speech behaviour and personality. Speech evaluation in psy-
chiatry. Grune & Stratton, New York, pp 115–135
15. Nass C, Lee KM (2001) Does computer-synthesized speech manifest personality?
Experimental tests of recognition, similarity-attraction, and consistency-attraction. J Exp
Psychol Appl 7(3):171–181
16. Mairesse F, Walker MA, Mehl MR, Moore RK (2007) Using linguistic cues for the automatic
recognition of personality in conversation and text. J Artif Intell Res (JAIR) 30:457–500
17. Eysenck H, Eysenck SBG (1991) The Eysenck personality questionnaire-revised. Hodder
and Stoughton, Sevenoaks
18. Isbister K, Nass C (2000) Consistency of personality in interactive characters: verbal cues,
non-verbal cues, and user characteristics. Int J Hum Comput Stud 53:251–267
19. Furnham A (1990) Language and personality. In: Giles H, Robinson W (eds) Handbook of
language and social psychology. Wiley, Chichester, pp 73–95
20. Dewaele JM, Furnham A (1999) Extraversion: the unloved variable in applied linguistic
research. Lang Learn 49:509–544
21. Oberlander J, Gill AJ (2004) Individual differences and implicit language: personality, parts
of speech and pervasiveness. In: Proceedings of the 26th annual conference of the cognitive
science society. Chicago, IL, USA
Chapter 3
Language, Communication and Human
Behaviour

Abstract  Language has been considered as a social behavioural phenomenon and


an indicator of the structure of cognitive processes dealing with functions such as
communicating, imagining, learning and perceiving. Communication is enriched
through coverbal or nonverbal behaviours and this chapter speculates the implica-
tions of these behaviours to significantly improve speech recognition and under-
standing. The association and interdependence of verbal and nonverbal elements
has been highlighted and various research approaches and challenges focusing on
interpretation of human behaviour by exploitation of these measures have been
discussed.

Keywords Coverbal · Linguistic · Phonological · Syntactic · Semantic · 
Interpersonal  · Nonverbal · Gesture ·  Body language

Individual behaviours are deeply embedded in social and institutional contexts. We are
guided as much by what others around us say and do, and by the “rules of the game” as
we are by personal choice [1].

Human behaviour is a very complex area and is influenced by a huge range of


factors. These factors can be split into three major levels: personal or micro factors
which are intrinsic to the individual, social or meso factors which inform about the
influence of other peoples’ behaviour on the individual and lastly, environmental
factors over which the individual has less or no control. The social psychological
models of human behaviour consider the factors at all the three levels to under-
stand the complex and inter-related nature of the mechanisms about why we do
and what we do.
Providing information is the first step towards influencing behaviour change.
While attitude can also influence behaviour, evidence suggests that the link is
not as strong as previously thought. Communication can shift human attitude.
Similarly emotions can have a strong influence on our conscious or unconscious
behaviour. In most of the situations though, emotion influences behaviour change
indirectly and it is important to understand which factors are likely to be affected
by the emotional response.

© The Author(s) 2016 17


S. Johar, Emotion, Affect and Personality in Speech,
SpringerBriefs in Speech Technology, DOI 10.1007/978-3-319-28047-9_3
18 3  Language, Communication and Human Behaviour

Other people’s beliefs and behaviour can have a strong social influence on our
own behaviour, a phenomenon that has been widely discussed in recent years.
Communications can be effective in highlighting social norms and prompting peo-
ple to act in accordance with them. For example, online forums or communities
where people connect to others in similar circumstances can be particularly help-
ful with regard to sensitive issues, as social proof and reassurance can be provided
in a ‘safe’ and anonymous manner. There exist number of social psychological
models, specific to particular behaviours that provide a comprehensive picture of
the communication and behaviour relationship [2].
The most recent trend of thought is the idea that all of human behaviour and
emotion originate from the brain and there is an anatomical area of the brain that
is responsible for language and communication in humans.
From the above discussion, it is clear that language and communication per-
vade social life. A person’s need to communicate and form relationships is one of
the most extensively studied of all the human behaviours. It is the primary means
by which we gain access to the contents of others’ minds. In natural sciences, biol-
ogists are only interested in the most observable manifestation of communication,
i.e. language.
Linguists often say that language and communication are different which is cer-
tainly true as people do communicate without language and language can be con-
sidered as a tool for communication in human societies. But it also goes without
doubt that without this capacity of linguistic communication, the nature of human
life would be radically different. Linguists regard language as an abstract set of
principles that specifies the relation between a sequence of sounds and a sequence
of meanings [3] and is often implicated in the various social phenomena: social
perception, attitude change, social interaction, stereotyping etc.
Any language is made of four systems: phonological, morphological, syntac-
tic and semantic. The phonological system is concerned with the analysis of an
acoustic signal into a sequence of speech sounds and the morphological system
deals with the way words are constructed out of these phonological elements
called phonemes. The syntactic system is concerned with the organization of these
phonemes into phrases and sentences and the semantic system gives meaning to
these units. Particular acts of speech can also be regarded as actions intended to
accomplish a specific purpose- assertions, requests etc. They typically are imbed-
ded in a discourse and a distinction needs to be drawn between the literal meaning
of an utterance and its intended meaning.

3.1 Language and Interpersonal Communication

Any communication involves two information processing devices. Human com-


munication is a complex and intriguing phenomenon where both form i.e. syn-
tactic and content i.e. semantics of the communication reflect the personal
characteristics of the individuals as well as their social roles and relationships.
3.1  Language and Interpersonal Communication 19

In human communication, people are the information processing devices who


modify the physical environment of others and construct mental representations
[4, p. 1]. Interpersonal communication is woven through all aspects of living and
is meaningful only in the context of living. The study of interpersonal commu-
nication is a multidisciplinary activity and includes much of psychology, sociol-
ogy, medicine, social psychology, and many facets of the language studies as
well. Recently, evolutionary explanations of human behaviour have become more
sophisticated in their descriptive and predictive ability which has motivated inter-
personal communication researchers to explore this important theoretic approach.
Work specialization, globalization, organizational restructuring, technology and
cultural diversity contribute to the current emphasis on interpersonal skills. There
is a concrete and complex network of links among the elements such as language,
behaviour and interpersonal skills in the process of communication. “Naive psy-
chology” given by social psychologist, Heider, suggests that individuals act as
observers and analyzers of human behaviour in everyday life [5].
Kraus and Fussell [6] describe four conceptions of interpersonal
communication:
(i) The Encoding/Decoding proponent produces and interprets information. It
allows speakers to create linguistic representations of the mental representa-
tions they want to convey.
(ii) The Intentionalist paradigm distinguishes between a message’s literal and
nonliteral meanings. It reflects the communicative intention of the speaker
and expresses the meaning they want to convey.
(iii) The Perspective paradigm derives from the listener’s point of view and is
based on the fact that the same message can convey different meaning to dif-
ferent recipients.
(iv) and Feedback, which refers to the information available to a source that
permits him or her to make qualitative judgments about communication
effectiveness.
Schwarz, Strack and colleagues have shown that discrepancies between a speak-
er’s intended meaning and receiver’s interpretation can be an important determi-
nant of the receiver’s response [7–10]. There is good evidence that speakers take
the listener’s perspectives in the formulation of messages [11–17]. Feedback and
the knowledge of its availability transforms the communication environment as it
permits speakers to modify the previously formulated messages and redistributes
the cognitive load of message production and comprehension.

3.2 Language and Coverbal Behaviours

Human communication can be viewed as a system of interdependent channels


over which information is transferred. The fact that people do not communicate
by words alone has been well noted for many years [18–20]. Behaviours that
20 3  Language, Communication and Human Behaviour

accompany speech that are not strictly linguistic and add meaning to it are catego-
rized as coverbal or broadly, nonverbal behaviours. Gestures, facial expressions,
eye contact, body language are some of the common behaviours that can occur
apart from the context of speech. The coverbal complements the verbal chan-
nel of communication, whereas nonverbal is supplementary to the verbal content
and independent of it. According to Mehrabian and Ferris [21], 55 % of the mes-
sage is delivered through body language that is, the coverbal behaviour and 38 %
is communicated by the pitch, tone and paralanguage elements that is, the non-
verbal behaviour. This suggests that coverbal and non verbal channels are much
more powerful in communicating the credibility of the message than the real spo-
ken words i.e. the verbal channel. Goffman has said: ‘Although an individual can
stop talking, he cannot stop communicating through body expression; he must say
either the right thing or the wrong thing. He cannot say nothing’ [22, p. 35].
In recent years, much research in psychology and psychiatry has been con-
ducted on nonverbal communication. The efforts have attempted to measure the
occurrence of these behaviours and identify the communicative significance of
these behaviours. Emotional states, attitudes, and other affective and regulatory
information are the communicative behaviours that occur in association with or
accompany words, but do not stand alone and the investigation of these non-lin-
guistic behaviours elicited by individuals constitutes the coverbal or nonverbal
communication research.
A deep understanding of the behaviours should lead to a clear understanding of
the role that communications can play to the establishment of specific and realistic
objectives. Understanding behaviour and its influences will stimulate the debate
about how communications can most effectively influence behaviour and enable us
to harness the most efficient and effective communications channels.

3.3 Understanding Nonverbal Behaviour

Nonverbal behaviour naturally covers a wide range of phenomena such as ges-


tures, facial and eye movements and various other body movements. Generally,
although nonverbal behaviour means acts other than speech, in a broader sense it
is illumined that nonverbal behaviour includes a variety of subtle aspects of speech
that may be expressed consciously or unconsciously and that do not belong to an
arbitrary conventional code of language. These are referred to as paralinguistic
properties and include intensity range, pitch, pauses, speech errors, speech rate etc.
When used in communicative context, these features describe the implied mean-
ings that are not explicitly stated through linguistic units. Robust language inter-
pretation is essential for building reliable and efficient conversational systems.
Conversational systems built so far tend to fail when unreliable and unexpected
inputs are received. As a result, it is essential to explore the use of nonverbal ele-
ments to significantly improve speech understanding and recognition.
3.3  Understanding Nonverbal Behaviour 21

Verbal and nonverbal elements share a dependence relation for a holistic and
correct interpretation of the act. Distortion caused by one of the elements is sup-
plemented by the other and a continued manifestation is achieved of what and how
is being said by the speaker. Nonverbal cues help regulate the system by defin-
ing and constraining the pattern of interaction and providing feedback. They may
sometimes convey content and intention more efficiently than linguistic signs, usu-
ally in an independent manner. According to Ekman and Friesen [23], repetition,
contradiction, complementation, accent and regulation are the general functions of
nonverbal behaviour that signal the flow of interaction.
While the study of verbal and non-verbal behaviour has been done indepen-
dently in several disciplines, the relationship between the two has not received the
attention it deserves. The structural relations between units, thematic development,
emotion, and modalization and interpersonal relation between speaker and hearer
are the different levels that need to be considered to understand this phenomenon
of conversational analysis [24]. The fusion of verbal and nonverbal facets appears
to have both genetic as well as sociocultural consequences and this virtuous uni-
fication not only distinguishes societies and cultures from one another but also
clearly tags the human species as distinct from other species.
From a pragmatic point of view, prosodic features are important contextualiza-
tion cues for speech production and perception. These features may occur at dif-
ferent time of the speech production cycle: preceeding the act of speech, speech
itself or following the act of speech. According to Trager [25], any communica-
tion consists of certain sequences called vocalizations that consist of noise and do
not have the structure of language. They constitute paralanguage and other voice
qualities such as intonation, pitch range, rhythm control etc. Vocal characterisers,
vocal qualifiers and vocal segregates together constitute paralanguage. Laughing,
yelling, yawning and crying are some of the vocal characterisers whereas intensity,
pitch height and extent are regarded as vocal qualifiers. There are certain parts of
speech that do not convey any meaning nor fit into proper word frames in lan-
guage sequences. Items such as uh, ah, hmm are used as fillers and come under
vocal segregates.
Recently, many research studies are focusing on the association of non verbal
behaviours and psychological states. Psychologically oriented approaches cover
all forms of non verbal behaviour such as gestures, facial expressions, visual and
crowd behaviour etc. They essentially focus on interpretation of human behaviour
by exploitation of statistical measures different from those employed in linguis-
tic studies. The enrichment of non verbal communication research through the
emergence of psychologically oriented studies has gained momentum and deals
with the description of psychological states of the individuals expressing non ver-
bal behaviour in speech. Mehrabian [26] suggests that this description of findings
should account for the relationships among behavioural cues along with the rela-
tionship between these cues and feelings, personalities and attitudes of the com-
municators, keeping in view the situation in which interaction occurs.
The incredibly dynamic nature of human behaviour and the flaws of social sci-
entific research methodology make the study of human behaviour a challenging
22 3  Language, Communication and Human Behaviour

task. It is important to determine what factors influence a person’s decision to per-


form a specific behaviour, or the ways in which an existing behaviour can be chan-
neled toward more desirable outcomes. Interventions would be more productive if
carried out in the right direction to understand why people behave the way they do.
The ultimate objective of communication is to influence behaviour and the effec-
tiveness of communication can be evaluated by measuring behaviour before and
after the communication effort. A direct implication of this can be seen in organi-
zational settings where employee values and organizational culture are considered
vital to organizational performance. The personal values of employees are widely
considered to influence their workplace behaviour and to achieve an employee
behaviour change, many communicators work on the principle that they need to
change employee attitudes first. Mass media interventions, on the other hand, can
be used to influence beneficial changes in behaviour both directly and indirectly.
Thousands of years ago people communicated only through spoken languages.
However, today we have more ways to communicate than any other period in his-
tory and communication has become instantaneous because of the digital revolu-
tion. This emergence of digital communication and social networking has made
a big impact on behaviour as well. Though every mode of communication has
proved relevant through technological innovation, nonverbal cues promote effec-
tive communication by building trust and clarity in messages. According to
Wertheim [27], non verbal cues have the power to reinforce what is being said or
contradict the verbal message being said. Non verbal communication is not only
crucial in daily communication situation but also for the assessors and interpret-
ers. As discussed, it may take various forms, each of which illustrates or replaces
a certain part of the verbal communication. To be able to construe nonverbal cues
properly is the first step to successful communication. There is no doubt that
our world is getting smaller and numerous languages are present on this planet.
However, it is imperative that all those who are hearing or language impaired and
also those who deal with such population reap the benefits of effective communi-
cation. The importance of providing the best solution must be acknowledged in
order to ensure accurate communication in each and every circumstance. In order
to accomplish this, access to nonverbal elements is paramount and every effort
should be made to utilize their potential and eventually experience superior and
successful outcomes.

References

1. Jackson T (2005) Motivating sustainable consumption: a review of evidence on consumer


behaviour and behavioural change. Report to the Sustainable Development Research
Network, Policy Studies Institute, London
2. Darnton A (2008) GSR behaviour change knowledge review. Reference report: an overview
of behaviour change models and their uses. HMT Publishing Unit, London
3. Krauss RM, Chiu C-Y (1998) Language and social behaviour. In: Gilbert D, Fiske D,
Lindsey G (eds) The handbook of social psychology, vol 2, 4th edn. McGraw-Hill, Boston,
pp 41–88
References 23

4. Sperber D, Wilson D (1986) Relevance: communication and cognition. Harvard University


Press, Cambridge
5. Heider F (1958) The psychology of interpersonal relations. Wiley, New York
6. Krauss RM, Fussell SR (1996) Social psychological models of interpersonal communica-
tion. In: Higgins ET, Kruglanski A (eds) Social psychology: a handbook of basic principles.
Guilford, New York, pp 655–701
7. Bless H, Strack F, Schwarz N (1993) Informative functions of research procedures: bias and
the logic of conversation. Eur J Social Psychol 23:149–165
8. Schwarz N, Strack F, Hilton DJ, Naderer G (1991) Base rates, representativeness and the
logic of conversation. Social Cogn 9:67–84
9. Strack F, Schwarz N (1992) Communicative influences in standardized question situations:
the case of implicit collaboration. In: Semin G, Fiedler K (eds) Language, interaction and
social cognition. Sage, Beverly Hills, CA, pp 173–193
10. Strack F, Schwarz N, Wänke M (1991) Semantic and pragmatic aspects of context effects in
social and psychological research. Social Cogn 9:111–125
11. Clark HH, Murphy GL (1982) Audience design in meaning and reference. In: Ny JFL,
Kintsch W (eds) Language and comprehension. North Holland Publishing, New York, pp
287–296
12. Fussell SR, Krauss RM (1989) The effects of intended audience on message production and
comprehension: reference in a common ground framework. J Exp Soc Psychol 25:203–219
13. Graumann CF (1989) Perspective setting and taking in verbal interaction. In: Dietrich R,
Graumann CF (eds) Language processing in social context. North-Holland Publishing,
Amsterdam
14. Keysar B (1992) The illusory transparency of intention: linguistic perspective-taking in text.
Cogn Psychol 26:165–208
15. Krauss RM, Fussell SR, Chen Y (1995) Coordination of perspective in dialogue: intraper-
sonal and interpersonal processes. In: Markova I, Graumann CG, Foppa K (eds) Mutualities
in dialogue. Cambridge University Press, Cambridge, pp 124–145
16. Krauss RM, Weinheimer S, Vivehananthan PS (1968) Inner speech and external speech:
characteristics and communication effectiveness of socially and nonsocially encoded mes-
sages. J Pers Soc Psychol 9:295–300
17. Schober MF (1993) Spatial perspective-taking in conversation. Cognition 47:1–24
18. Darwin CR (1872) The expression of the emotions in man and animal. John Murray, London
19. Sapir E (1927) Speech as a personality trait. Am J Sociol 32:892–905
20. Pittenger RE, Smith HL Jr (1957) A basis for some contributions of linguistics to psychiatry.
Psychiatry 20:61–78
21. Mehrabian A, Ferris SR (1967) Inference of attitudes from nonverbal communication in two
channels. J Consult Psychol 31(3):248–252
22. Goffman E (1965) Behaviour in public places. Free Press, New York
23. Ekman P, Friesen WV (1969) The repertoire of nonverbal behavior: categories, origins, usage
and coding. Semiotica 1:49–97
24. Rodrigues IG (2005) Verbal and nonverbal signals in face-to-face interaction: a theoretical
framework for a holistic micro-analysis (The example of a parenthesis). Interacting Bodies,
Lyon, 15–18 June 2005
25. Trager GL (1958) Paralanguage: a first approximation. Studies in Linguistics 13:1–12
26. Mehrabian A (1972) Nonverbal communication. Aldine-Atherton, Illinois
27. Wertheim EG (1999) The importance of effective communication, Northeastern university.
https://ysrinfo.files.wordpress.com/2012/06/effectivecommunication5.pdf. Accessed 26 May
2015
Chapter 4
Multimodality and Spoken Dialogue Systems

Abstract Broaching communication from an interdisciplinary perspective, the


present chapter attends to the diverse ways in which multimodal principles can be
applied to current speech systems to seek natural and seamless human computer
interaction capabilities. Offering various avenues to explore and suggesting bene-
fits of multimodal architectures, new perspectives in spoken dialogue systems have
been described. The emergence of technological innovation and inevitable incor-
poration of natural language technologies in future spoken dialogue systems chal-
lenge the future of multimodality to evolve as a natural user interface paradigm
and facilitate consequential multidisciplinary research.

Keywords Multimodality ·  Human-computer interaction (HCI)  · Automatic


speech recognition (ASR)  ·  Spoken dialogue systems  ·  Modality theory

Multimodality is an inherent property of human cognition and communication.


People perceive with all their senses—vision, hearing, smell, touch, and taste—
and express themselves naturally by voice, gesture, gaze, facial expression, body
posture, and motion. Human interaction with the world is inherently multimodal
[1, 2]. To understand human communication it becomes necessary to encompass
the whole repertoire of communication possibilities and move towards interac-
tive systems that seek to leverage natural human capabilities to communicate via
speech, gesture, touch, facial expression and other modalities. Multimodal interac-
tion lies at the centre of several research areas including computer vision, speech
recognition, artificial intelligence, psychology and many others. A multimodal sys-
tem is one that parallely uses any combination of modalities, interacts with a user
along different communication channels and extracts and conveys meaning auto-
matically. The human-centered view focuses on multimodal perception and con-
trol of human input and output channels i.e. sight, touch, hearing, vision etc. The
system view focuses on synergistic representation of two or more computer input
and output modalities like keyboard, mouse, cameras etc. In this chapter we focus
on the former view as it is much closer to our area of discussion. Parallel operation

© The Author(s) 2016 25


S. Johar, Emotion, Affect and Personality in Speech,
SpringerBriefs in Speech Technology, DOI 10.1007/978-3-319-28047-9_4
26 4  Multimodality and Spoken Dialogue Systems

Input modes
Speech Analysis
Visual Gesture
Motor Language
(Facial Recognition

Application Interface
or Eye) and I nteraction
Touch verification User
modelling
Fusion of
modalities
Output Design Discourse
modes Modality modelling
Speech Language
Sound Presentation
Graphics design

Fig. 4.1  An architecture of multimodal user interfaces. Adapted from [3]

can be achieved at different levels of abstraction. Fusion of different types of data


and temporal constraints imposed on them are the prerequisites of such systems
in order to allow naturalness from the free choice of modalities and result in a
human-computer communication that is close to human-human communication.
Thus, by letting our highly skilled and coordinated human communicative behav-
iour control interactions with a system more transparently than ever before, multi-
modal interfaces allow more accessiiblity in diverse scenarios and contexts, thus
improving robustness, performance and efficiency of communication.
People accompany their utterances by facial expressions and gestures and
listeners simultaneously use both verbal and non-verbal cues to interpret and
comprehend these messages. The goal is to make the system aware of user’s envi-
ronment, react to speech and gestures and respond with speech, video, text etc
thus enabling a natural interaction with information systems available anytime and
anywhere. They possess the potential to expand computing to more challenging
applications that can accommodate more adverse usage conditions than in the past.
Figure 4.1 shows the basic components of a multimodal interface.

4.1 Potential Benefits of Multimodal Interfaces

Multimodal systems emphasize abstract levels of processing, interaction and con-


text management, knowledge sources and investigations of user’s beliefs, inten-
tions and attitudes. The growing interest in multimodal interface design is largely
influenced by the need for flexible, transparent and powerfully expressive means
of human computer interaction.
4.1  Potential Benefits of Multimodal Interfaces 27

Multimodal architectures offer a wide range of benefits over other user interfaces
[3]. A single modality does not permit the user to interact effectively across all tasks
and environments [4]. The multiple modalities offer free choice of modalities, flex-
ible use of input modes depending on the specifics of the task or environment and
communication close to human-human communication resulting in naturalness. They
offer high efficiency as the best suited modality for each task is used. The mul-
tiple modalities also increase the accuracy of the user interface and hence, lead to
enhanced error avoidance as one modality can indicate an object more accurately
than some other modality. The ability and preference to use different modes of
communication permits the user to exercise control over their interaction with the
computer. In this respect, they accommodate a wider range of users, tasks and envi-
ronmental situations. According to van Wassenhove et al. [5], humans may process
information faster and in a better way when it is presented in multiple modalities.
Richard Bolt’s “Put That There” system [6] is the groundbreaking demonstra-
tion which processed multimodal speech that integrated voice and gesture inputs
to enable a user to have a natural interaction with a display. Since then, consid-
erable strides have been made in developing more complex multimodal systems
bringing together new modalities such as haptics and eventually introducing
mobile computing environments as a rich testbed for multimodality. During the
past decade, there has been significant progress in the development of spoken lan-
guage technology and natural language processing. These technologies are being
further aided by advances in pen-based hardware and software capabilities to
automate telephony and other real world applications. Till date, most of the mul-
timodal systems combine either speech and pen input [7] or speech and lip move-
ments [8–10]. This is because speech input offers ease of use and high bandwidth
information. On the other hand, pen input provides a more socially acceptable
form of input and a viable alternative to speech under circumstances of extreme
noise [11, 12]. As a result, such complementary multimodal spoken systems per-
mit users to engage in more transparent information seeking and expressive sys-
tems providing flexible descriptions of objects and situations. These systems can
support greater precision of spatial information than a speech-only interface and
therefore, support shorter and simpler utterances which results in fewer disfluen-
cies and more robustness [13]. Audio visual integration is another widespread area
of research to help increase the granularity of the system. Dynamic navigation sys-
tems when combined with speech recognition can cover a wider range of environ-
mental conditions in real life scenarios that may not be possible to handle when
such modes are used separately.

4.2 Multimodality in Spoken Dialogue Systems

Interpreting human language is a challenging problem in building man-machine


interface systems due to the flexibility of human language behaviour. In our inter-
action with others, we easily and naturally use all of our sensory modalities as we
28 4  Multimodality and Spoken Dialogue Systems

communicate and exchange information. The information from different modali-


ties can be effortlessly integrated to fuse data to optimally meet the communica-
tion needs. As man-machine interaction systems advance, it becomes increasingly
important to make use of multiple modalities in language and speech systems and
the exploitation of several modalities can increase the naturalness and ease of
human-system communication.
Spoken dialogue research is an area of research that seeks to understand and
advance work in the development of Human-Computer Interfaces to incorpo-
rate both speech and natural language technologies including Automatic Speech
Recognition (ASR) and Text-To-Speech (TTS) technologies. Recent advances and
innovation in spoken dialogue systems have given rise to effective forms of com-
munication and interaction in activities such as billing, purchasing and booking
services. But academic, language and dialogue researchers endeavor to improve
these initial dialogue systems into more natural, flexible and reliable systems that
use language understanding and dialogue management, collaborating with the user
in dialogue to solve a common goal.
Speech is the most natural form of communication between humans and the
most dominant mode of information exchange [14]. Literature provides a num-
ber of reasons for using speech in human-machine interaction. Spoken dialogue
systems incorporate artificial intelligence in speech to adapt machines to humans
thereby providing a natural speech interface. Extracting the underlying meaning
and semantics of what is being spoken is a highly challenging area of research
and future developments in this area will focus on incorporating prosodic features
and pragmatic salience in addition to the current syntax and content dependent
approaches [15].
Multimodal spoken dialogue systems are the systems which mimic human
conversation to the greatest extent by allowing users to express themselves freely.
Multimodal dialogue systems offer additional modes of input and output such as
video and text that allow the user to select the most convenient mode suited for
the current environment and allows presentation of output in the most appropriate
manner [16]. Effective interaction in dialogue systems involves both the presenta-
tion of information and the flow of interactive dialogue.
A spoken dialogue system (Fig. 4.2) consists of various components that enable
the whole system to function properly. It can be usually divided into three parts:
Understanding the user input, decision making, and generating the output speech.
The user’s input is translated into machine readable form by an automatic speech
recognizer (ASR). The recognized words are sent to a language understanding
engine which interprets the semantic meaning of the input. The decision making
is then done by the dialogue manager based on the meaning of the words extracted
and the current state of the dialogue. Finally, a speech synthesizer takes the sen-
tences produced by the language generator in text form and converts them into a
spoken form that can be output using either recorded prompts or audio.
In multimodal systems, input from several modalities must be combined and
it must be decided which channels to use for output and when to produce it,
thus making it more complex and conversational. Such a system has additional
4.2  Multimodality in Spoken Dialogue Systems 29

Machine
readable
Speech Automatic Language
words
User Speech Understanding
Input Recognizer Semantic
Engine
(ASR) meaning

Dialogue
Manager

Text
Generator
System Speech
Output Synthesize
Speech Text

Fig. 4.2  Spoken dialogue system

modules for merging inputs from different modalities, decomposing the multi-
modal messages to respective output modality and controlling the timing of input
and output signals of the system.
There are thousands of modalities in existence, both input and output, that can be
incorporated into interface designs [17]. Modality theory by Bernsen [18] is about
representational modalities and not the devices which machines and humans use
when they exchange information, such as sensors, hands, joysticks etc. It states that:
Given any particular set of information which needs to be exchanged between the user
and system during task performance in context, identify the input/output modalities which
constitute an optimal solution to the representation and exchange of the information [18].

The constructive proposition is that the world of modalities is far more sta-
ble than the world of devices and hence, much more fit for theoretical conduct
whereas on the negative aspect, it does not address the issues of device selec-
tion for a particular set of input/output modalities for a specific application. The
Modality Theory of Bernsen [18] provides us a basis for examining arbitrary
input/output modality types and combinations to their capabilities of information
representation and exchange. Sutcliffe and Faraday method and Roskilde method
provide theoretical and methodological basis of information mapping between
user and system based on the modality [19].
Given today’s technological advancements, we need to attend a much wider
range of modalities and their combinations. Today and in future, major advances
in new input technologies and algorithms, processing speed, distributed computing
and spoken language technology will introduce new class of sophisticated multi-
modal systems for human-computer interaction. The practical implementation of
Modality Theory can serve to offer a sound theoretical framework to combine the
existing and emerging developments for appropriate use of modalities in efficient
interaction design and development.
30 4  Multimodality and Spoken Dialogue Systems

4.3 Future Directions

Designing well integrated and robust multimodal systems is a new and emerging
field of interest. These systems represent a variety of platforms and they illus-
trate the diverse and challenging nature of emerging multimodal applications.
Multimodal interaction offers many performance advantages to users as outlined
in the previous section, still many challenges remain before sophisticated multi-
modal interaction becomes an indispensable part of computing. Appreciating
the vast potential that multimodality has in making the human-computer interac-
tion more natural, easier and even more efficient, it becomes important to model
human-like sensory perception and communication patterns. Though the field is
developing rapidly, most of the systems till date are bimodal and research-level
systems. Therefore, developing multimodal integration methods and architectures
to explore a wider range of methods and modality combinations remains a vast
future research issue. Each unimodal technology like speech and sound recogni-
tion, haptics, language understanding, vision and gesture based recognition, con-
text modeling etc. is an active area of research in itself and studies must explore
both the development and understanding of individual modalities and methods for
multi-modal integration.
Hardware advances and fundamental improvements in metrics and machine
learning techniques further challenge the development of richer and more person-
alized communication interfaces. Interdisciplinary cooperation is of considerable
importance to better understand multimodal interaction. There are different views
on how to define multimodal user interfaces and how to select, deploy and coor-
dinate various modes in specific tasks of human computer interaction. In order to
have a better understanding about the natural communication modalities and how
the brain works to identify the best suited modalities in a given context to achieve
the desired synergy for the successful completion of the task, a comprehensive
organization of literature on psychology, cognitive science, linguistics, neurosci-
ence and computer vision must be made available and utilized extensively as a
basis for spearheading empirical work and proposing innovative system designs
to proactively guide the design of new interfaces that are consistent with human
capabilities and perceptions [20–23].
To build systems to perform these tasks with high level of robustness, multi-
modal knowledge acquisition and representation is of paramount importance.
Today, digital libraries and online resources have become a major source of
information for scholars and general public. It will be necessary to automati-
cally acquire knowledge and extract information from such huge repositories or
multimodal knowledge bases that will aid in more sophisticated natural language
processing. The enhanced presentation of multimodal data may be facilitated
by developing virtual environments that include human-like virtual characters,
show convincing emotions and mimic the behaviour of real world individuals.
This will provide a blended interface style that combines both active and passive
modes and improves the system’s prediction abilities. Another big challenge in
4.3  Future Directions 31

this technology is to enable access of everything to everyone and place user in the
centre of the design process. There is also a danger for cognitive overload while
exposing users to such 3D, virtual and simulated environments. Autonomous sys-
tems with interactive control and wearable devices that comprehend user’s actions
and adjust the cognitive load to provide appropriate responses will be capable of
delivering a ubiquitous and personalized computing environment. There is thus a
need for adaptive systems that can adapt to users’ needs automatically and make
the interaction natural and intuitive [13, 24]. Active adaptation technology for
diverse users is necessary before achieving the full potential and bridging the
­virtual and physical worlds.
Multimodal interfaces can be seen as the future user interface paradigm that
exploits the power of computing, human computer interaction and psychology.
The existence of human race shall be acknowledged by these systems as such
interfaces will work using multiple linguistic codes and representation systems in
addition to multiple modalities thereby, supporting broader application functional-
ity. The combinations of behavioural and non-behavioural modalities will continue
to increase leading to the proliferation of more reliable and innovative interactive
techniques [25]. Implementing such compelling, powerful and efficient technol-
ogy on a commercial scale would also require considerable research to develop
the appropriate infrastructure, automated tools and software to support the features
of next generation multimodal applications discussed above. Simulation tools
will need to be developed to permit research in natural field environments [26].
Availability of significant corpora will be critical for achieving rapid progress in
performance and architecture in the area of spoken language processing [27]. In
all of these ways, it becomes a grand challenge and much work needs to be done
to revolutionize this technology of human-computer interaction. New and more
sophisticated architectures that will support more effective natural language and
dialogue processing will need to be formulated and issues of privacy and security
must be considered in order to provide this state-of-the-art promising opportunity
of human-computer interaction outside research laboratories. The cross fertiliza-
tion of ideas and perspectives among the diverse areas of engineering, linguistics
and psychology etc. will facilitate the conduct of meaningful research across the
entire spectrum and these novel multimodal interfaces will represent a new multi-
disciplinary science that will aim at preserving the diverse languages of the world.

References

1. Bunt H, Beun RJ, Borghuis T (1998) Multimodal human–computer communication systems,


techniques, and experiments. Lect. Notes Comput. Sci. 1374:39–67
2. Quek F, McNeill D, Bryll R, Duncan S, Ma XF, Kirbas C, McCullough KE, Ansari R (2002)
Multimodal human discourse: gesture and speech. ACM Trans. Comput. Human Interact.
9(3):171–193
3. Maybury MT, Wahlster W (eds) (1998) Readings in intelligent user interfaces. Morgan
Kaufmann Publishers, San Francisco
32 4  Multimodality and Spoken Dialogue Systems

4. Larson JA, Oviatt SL, Ferro D (1999) Designing the user interface for pen and speech appli-
cations. In: Conference on Human Factors in Computing Systems, CHI ’99 Workshop.
Philadelphia, Pa
5. van Wassenhove V, Grant KW, Poeppel D (2005) Visual speech speeds up the neural process-
ing of auditory speech. Proc Natl Acad Sci USA 102(4):1181–1186
6. Bolt RA (1980) ‘‘Put-that-there’’: voice and gesture at the graphics interface. ACM Comput
Graphic 14(3):262–270
7. Oviatt SL, Cohen PR (2000) Multimodal systems that process what comes naturally.
Commun ACM 43(3):45–53
8. Rubin P, Vatikiotis-Bateson E, Benoit C (1998) Special issue on audio-visual speech process-
ing. Speech Commun 26:1–2
9. Stork DG, Hennecke ME (eds) (1995) Speechreading by humans and machines: models, sys-
tems and applications. Springer, New York
10. Benoit C, Martin J-C, Palachaud C, Schomaker L, Suhm B (2000) Audio-visual and multi-
modal speech systems. In: Gibbon D, Moore R (eds) Handbook of standards and resources
for spoken language systems. Kluwer, Norwell, pp 102–203
11. Gong Y (1995) Speech recognition in noisy environments: a survey. Speech Commun
16:261–291
12. Holzman TG (1999) Computer-human interface solutions for emergency medical care.
Interactions 6(3):13–24
13. Oviatt SL, Cohen PR, Wu L, Vergo J, Duncan L, Suhm B, Bers J, Holzman T, Winograd
T, Landay J, Larson J, Ferro D (2000) Designing the user interface for multimodal speech
and gesture applications: state-of-the-art systems and research directions. Human Comput
Interact 15(4):263–322
14. Huang X, Acero A, Chelba C, Deng L, Duchene D, Goodman J, Hon H, Jacoby D, Jiang L,
Loynd R, Mahajan M, Mau P, Meredith S, Mughal S, Neto S, Plumpe M, Wang K, Wang Y
(2000) MiPad: a next-generation PDA prototype. In: Proceedings of the international con-
ference on spoken language processing (ICSLP 2000), vol 3. Chinese Military Friendship
Publishers, Beijing, China, pp 33–36
15. Bangalore S, Hakkani-Tur D, Tur G (2006) Introduction to the special issue on spoken lan-
guage understanding in conversational systems. Speech Commun 48(3–4):233–238
16. López-Cózar R, Araki M (2005) Spoken, multilingual and multimodal dialogue systems.
Development and assessment. Wiley, West Sussex
17. Bernsen NO, Bertels A (1993). A methodology for mapping information from task domains
to interactive modalities, working papers in cognitive science and HCI. University of
Roskilde, Denmark, pp 93–10
18. Bernsen NO (1994) Modality theory in support of multimedia interface design. In:
Proceedings of the AAAI spring symposium on intelligent Multi-Media—Modal systems.
Stanford, March 1994, pp 37–44
19. Faraday P, Sutcliffe A (1993) A method for multimedia interface design, people and comput-
ers, HCI ’93, pp 173–190
20. Oviatt S, Coulston R, Lundsford R (2004) When do we interact multimodally? Cognitive
load and multimodal communication patterns. ACM international conference on multimodal
interfaces. State College, PA, pp 129–136
21. Calvert GA, Spence C, Stein BE (eds) (2004) The handbook of multisensory processing.
MIT Press, Cambridge, MA
22. Ernst M, Bulthoff H (2004) Merging the sense into a robust whole percept. Trends Cogn Sci
8(4):162–169
23. Chen F, Ruiz N, Choi E, Epps J, Khawaja A, Taib R, Yin B, Wang Y (2012) Multimodal
behaviour and interaction as indicators of cognitive load. ACM Trans Interact Intell Syst
2(4):1–36
24. Kumar S, Cohen PR (2000) Towards a fault-tolerant multi-agent system architecture. In:
Fourth international conference on autonomous agents 2000. ACM Press, Barcelona, Spain,
June 2000, pp 459–466
References 33

25. Pankanti S, Bolle RM, Jain A (eds) (2000) Biometrics: the future of identification (special
issue). Computer 33(2):46–80
26. Oviatt SL, Pothering J (1998) Interacting with animated characters: research infrastructure
and next-generation interface design. In: Proceedings of the First Workshop on Embodied
Conversational Characters, pp 159–165
27. Cole R, Hirschman L, Atlas L, Beckman M, Biermann A, Bush M, Clements M, Cohen
P, Garcia O, Hanson B, Hermansky H, Levinson S, McKeown K, Morgan N, Novick D,
Ostendorf M, Oviatt S, Price P, Silverman H, Spitz J, Waibel A, Weinstein C, Zahorian S,
Zue V (1995) The challenge of spoken language systems: Research directions for the nine-
ties. IEEE Trans Speech Audio Process 3(1):1–21
Chapter 5
Emotional Speech Recognition

Abstract Recent years have been marked by a growing need for systems that
can grasp human emotions and in particular, recognize emotions. Emotions lie at
the centre of any social communication and form the basis for an intelligent and
meaningful interaction. The chapter further discusses the acoustic correlates of
emotions and describes various techniques and developments imperative to sup-
port speech interfaces that recognize emotional expressions in real world settings.
Significant advancement in the areas of knowledge representation, infrastructure
requirements and algorithm implementation is a prerequisite for modeling effec-
tive future speech recognition systems that are more robust and dynamic in nature.

Keywords Emotion ·  Human computer intelligent interaction (HCII)  · Pervasive · 


Hidden markov models (HMM) ·  Automatic speech recognition (ASR) · 
Multimodal  ·  Knowledge representation  · Self-adaptive

Speech recognition is the process of interpreting human speech from a computer


by extraction, characterization and recognition of information in the speech signal.
At the rudimentary level, speech conveys a message via words but at higher lev-
els, it suggests the emotion, gender and identity of the speaker. There are contents
of speech that carry information, e.g. the prosody of the speech indicates gram-
matical structures, and the stress of a word shows its importance. Research and
development on speech recognition techniques has been undertaken for over a few
decades now and continues to be an active area. The ultimate impact of speech
recognition depends on the degree to which it enables people to communicate
more naturally. It is not sufficient for a machine to look like human but the a­ bility
to acquire, express the emotions and learn to understand the emotions forms the
basis for human computer intelligent interaction (HCII). Several experiments
­conducted by Reeves and Nass [1] where one of the humans is taken out and put
in a computer conclude that for an intelligent interaction, the basic human-human
issues should hold.

© The Author(s) 2016 35


S. Johar, Emotion, Affect and Personality in Speech,
SpringerBriefs in Speech Technology, DOI 10.1007/978-3-319-28047-9_5
36 5  Emotional Speech Recognition

Humans interact not only through speech, but vision, gaze, body gestures and
expressions contribute critically to emphasize attributes such as emotion, attitude,
mood etc. As a consequence, information exchanges via the natural sensory modes
of sight, sound and touch are steadily being accommodated through new inter-
face technologies. Making a machine to recognize emotions from speech is not a
new idea but the roles of these multiple modalities and their interplay for natural
interaction are still to be scientifically understood and quantified. Command and
control, dictation, transcription of recorded speech, searching audio documents
and interactive spoken dialogues are some of the potential applications of speech
recognition systems. However, several applications exist where it is beneficial for
computers to recognize human emotions. Stress monitoring, online tutoring and
diagnosis of psychological disorders are prospective areas where a computer’s
functionality may be enhanced to be more aware of the human user’s emotional
and attentional expressions.
Psychology and engineering communities are working towards development of
automatic ways to analyze gestures, facial expressions, vocal emotions and physi-
ological signals to understand and characterize emotions as a goal towards achiev-
ing human-computer intelligent interaction. This knowledge of labeling emotions
into different states suggests growing evidence of the importance of emotions in
leading the research to use pattern recognition approaches using different modali-
ties as inputs to the emotion recognition models. The on-going availability of
large-scale emotional speech data collections will primarily benefit the emotional
speech research in future, and improvement of theoretical models for speech pro-
duction [2] and vocal communication of emotion will gain absolute attention [3].

5.1 Significant Developments in Speech Recognition

In today’s era, voice and natural language understanding are at the forefront owing
to steady progress in the technologies needed to help machines understand human
speech, including machine learning and statistical data-mining techniques. The
rapid rise in voice technology coupled with the combination of more data with
more computing power has resulted in increased proficiency in the emerging mar-
ket for speech interfaces. There is a growing need to integrate pervasiveness in the
process of speech recognition to make it more powerful and incredibly accurate. It
is believed that within a few years, speech technology research would have to be
architected to run on wearable computers where the user will just be able to com-
municate without touching the interface and the system’s response will be based
on trigger words. The availability of new generation parallel computation systems
have enthused researchers to explore possible approaches to improve automatic
speech recognition. The evolution of these technologies can be mainly attributed
to the advances in very large-scale integration (VLSI) and digital signal process-
ing (DSP) technologies that have allowed complex algorithms like HMM (Hidden
Markov Models) to be performed in real time.
5.1  Significant Developments in Speech Recognition 37

Researchers who plan to apply speech technologies in an application need to


take into consideration the acoustic environment where the application will be
used. In various environments, noise tends to be correlated with speech which
exacerbates the situation. Reverberations and correlated noise sources affect the
proper identification of speech amidst noise and this aspect remains the most
important factor that requires immediate solution. Accounting for such scenarios
shall enhance the accuracy supported by speech recognition in various applica-
tion domains and strengthen the practicality of employing speech recognition tech-
nologies. The focus on providing useful implementations of driving applications
largely depends on real world latency requirements. Improving latency and accu-
racy shall eventually lead to improvement in recognition throughput of the system
as a whole. With the increasing adoption of parallel multicore processors, we can
investigate significant opportunities of speech recognition.
Despite important advances, one necessary ingredient for natural interaction
that has recently gained attention is emotion. Emotions play an important role
in human interaction and allow people to express themselves beyond the verbal
domain. As computers learn to recognize gestures, facial expressions, eye contact
etc., similarly speech or voice can be used to convey linguistic as well as para-
linguistic information. There are cues on the face and in the voice that reveal a
person’s emotions, attitudes and attentional states. By exploring how these cues
can be used to train a machine to recognize human emotion from audio/video, dif-
ferent emotional states can be labelled to help in training models that recognize
emotional expressions. Moreover, it must be investigated how multimodal data
can be treated in realistic scenarios for more efficient emotion recognition. During
the next few years, a large quantum of research will be focussed upon all these
aforementioned issues, steering the academia and professionals towards qualitative
studies of vocal emotions.
Combining audio and visual cues has been studied in recent years for speech
recognition [4]. According to Scherer [5], human ability to recognize emotions
from purely vocal stimuli is about 60 %. Most works have concentrated on the
analysis of human vocal emotions and study of human abilities to recognize vocal
emotions and these studies have been done largely independent of facial expres-
sion recognition. There are situations when speech signals become noisy and cues
from facial movements improve the accuracy of speech recognition. Chen et al. [6]
have shown that in order to accomplish real time multimodal analysis, the multi-
sensory data should be processed according to a context-dependent tool and cannot
be considered separately. De Silva and Ng [7] proposed a rule-based system for
classification of audiovisual data into one of the five emotion categories: happiness,
fear, anger, surprise and dislike. They constructed a system using nearest neighbour
method to classify the extracted facial features and acoustic features were classified
using HMM models. Although, advances have been made to make the multimodal
analysis of human affective state tractable, there are only a few research efforts
that aim at combining nonverbal modalities into a single distinct system for affect-
sensitive analysis of human behaviour. Difficulty in obtaining authentic data cor-
responding to a particular emotional state is another problem that is holding back
38 5  Emotional Speech Recognition

the fusion of multimodal information for emotion recognition. Various other factors
that influence affective data collection are a result of familial, personal or culturally
learned rules. In majority of the situations the environment in which emotions are
recorded is artificial or unreal which greatly effects the spontaneity of the subject’s
response. Moreover, if the subject knows the purpose of the experiment he/she will
act in appropriate ways, thereby, controlling the real expression. Besides these con-
cerns there are other social and ethical issues involved [8]. As a consequence, real
emotions are largely overlooked and experimentalists and theorists are increasingly
shifting efforts towards robust context-sensitive, multimodal and adaptive analy-
sis of human nonverbal affective states to develop systems that are able to monitor
human behaviour, adapt to the current context and user and are perceptually aware
in a ubiquitous computing environment.

5.2 Future Directions in Speech Recognition


and Understanding

Many developments in the technology of speech recognition and understanding


have taken place over the past 20 years and there are a growing number of prac-
tical applications in this sector. As knowledge representation systems and algo-
rithms are getting sophisticated and infrastructure of speech corpora is expanding,
research and development are making technology to become more capable and
cost-effective and is significantly advancing towards challenging multidisciplinary
task domains in collaboration with related human language technologies.
Report of the Speech Understanding Working Group [9] highlights few fertile
areas for future research:
Knowledge Representation: The goal of speech understanding is to extract a
symbolic description of those contents of a speech signal which are relevant for a
given application. To comprehend the facets of syntax, semantics and acoustics of
a system, the knowledge about discourse domain and dialogue strategies becomes
necessary to be realized. The speech understanding task thus requires an explicit
representation and efficient use of all this knowledge by the system. A number of
system architectures have been developed and applied to speech understanding and
recognition tasks. The technology of expert systems and the progress in knowledge
representation have influenced the development of powerful systems in the areas of
speech recognition and understanding. For a successful spoken dialogue system,
different knowledge sources must interact cooperatively and collaboration between
signal related recognition processes and symbol related understanding processes
is necessary. This is different from spoken text systems where recognition and
understanding of speech are considered two sequential, non-interacting tasks.
Therefore, how knowledge is organized, activated and focused largely depends on
the required purpose and application of the recognition and understanding system.
Current ASR systems do not draw much benefit from human speech perception,
understanding and cognition. Sufficient understanding on how the human brain
5.2  Future Directions in Speech Recognition and Understanding 39

processes spoken language and adapts to non-native accents is needed for speech
recognition and understanding applications to perform and reach a level compa-
rable to humans. Current HMMs focus on the linguistic information and remove
most of the paralinguistic information from the speech signal. As discussed pre-
viously and shown by speech perception experiments, paralinguistic information
plays a crucial role in human speech perception. Morgan et al. [10] discusses a
parametric and structure based approach which overcomes the mentioned limita-
tion and exploits the knowledge and mechanisms of human speech perception and
production by taking into account the relationship between speaking rate varia-
tions and the corresponding changes in the acoustic features [9].
The communicative intent of a spoken utterance is hidden in the meaning and
representation of meaning could provide feedback for further processing in appli-
cations such as interrogation and emotion recognition. The past few years have
seen unprecedented growth in computation and storage capabilities of systems
permitting the use of large training databases with inputs from multiple knowl-
edge sources. The resulting effects on speech recognition and understanding have
been enormous with streams of data becoming more heterogeneous and from dif-
ferent modalities. Additionally, we are coming into a period where the resources
are available to integrate different modalities and ensure heterogeneous parallel-
ism in algorithms and architectures in a much more significant way. Multi core
processors allow incorporation of detailed models of spoken language and imple-
mentation of parallelism in novel computational architectures for knowledge–rich
speech recognition requires further research and exploration [11].
Infrastructure: A speech signal is characterised by many parameters and thus
maintaining a large corpus becomes critical in modeling a given task to improve
performance and capture crucial information and tremendous variability in the
speech signal to be decoded. In order to make systems more powerful and to
understand the nature of speech itself, well-labelled annotated speech corpora
needs to be created for today’s systems to evolve. Consequently, design systems
must be tolerant to labelling errors and thus, standard conventions for labelling
must be determined for developing future methodologies.
As the internet is becoming a major source of information exchange, the avail-
ability of large amount of speech data which is readily accessible has become a
possibility. YouTube and other media sharing sites are a rich source of high vol-
ume audio data which might be recorded or streamed. These resources reflect a
more spontaneous and natural form of speech than present-day systems have
typically been developed to recognize and shall increase the robustness and tran-
scription capabilities of the future systems under wide range of conditions. As the
knowledge sources have increased, a large number of high quality speech tools
to collect, label and process large quantities of data have also evolved. Both open
source and commercial web-based tools have become valuable for cost-effective
processing of data in many languages and new initiatives are being aimed towards
elicitation of huge amounts of speech corpora in different languages to make a sig-
nificant impact on the automation of speech and language itself.
40 5  Emotional Speech Recognition

Algorithms: The human speech system constantly evolves and adapts to non-
native accents and languages without explicit supervision. Current speech recog-
nition systems are fairly statistic models with built-in knowledge that becomes
obsolete over a period of time or in a particular real world application. They
undergo supervised training and do not learn. There is a need to incorporate self
learning into speech and language processing systems to make them learn from the
data and apply the learned knowledge for specific results. The long term goal is to
create self-adaptive speech technology [9] to cope with changing environments,
dialects, accents, non-speech sounds etc. A learned system may perform automatic
pattern discovery and generalization through machine learning and advance the
natural language processing, information retrieval and cognitive abilities of new
improved speech systems. Klein [12], Park [13] and Venkataraman [14] explain
unsupervised acquisition of speech and natural language across different cultures.
Another important feature that needs to be pervaded and that significantly
affects the speech signal is the context in which the speech is captured and com-
municated. Speaker characteristics, speaking style, acoustic environment, channel
of transmission and language characteristics [9] are some of the factors that bring
in variability in the speech signal and controlling such factors presents a significant
challenge to speech community. Background noise is said to be the dominant cause
of harmful variability that degrades the system performance. Various filtering tech-
niques are currently applied to remove noise and distortions [15]. Also, speaking
style and dialect varies for each individual speaker and current ASR systems adapt
to these variations to a certain extent by including a large database of speakers in
the training phase. This approach is very data intensive and impractical for a large
real time application. Modeling and exploiting speaking rate of the signal during
the recognition process may be seen as a promising mechanism and a solution to
this problem, thus making ASR more robust and effective in acoustic models.
To implement these grand challenge tasks mentioned above, it is crucial to
carry out promising research and development in focused directions and prosper-
ous areas to enable this technology of speech recognition and understanding to
become progressively more capable and transform a number of high-utility appli-
cations to reality.

References

1. Reeves B, Nass C (1996) The media equation: how people treat computers, television and
new media like real people and places. Cambridge University Press, Cambridge
2. Flanagan JL (1972) Speech analysis, synthesis, and perception, 2nd edn. Springer, New York
3. Scherer KR (2003) Vocal communication of emotion: a review of research paradigms.
Speech Commun 40:227–256
4. Potamianos G, Neti C, Gravier G, Garg A, Senior A (2003) Recent advances in the automatic
recognition of audiovisual speech. Proc IEEE 91(9):1306–1326
5. Scherer K (1996) Adding the affective dimension: a new look in speech analysis and synthe-
sis. In: Proceeding of international conference on spoken language processing (ICSLP 1996),
pp 1808–1811
References 41

6. Chen LS, Tao H, Huang TS, Miyasato T, Nakatsu R (1998) Emotion recognition from audio-
visual information. In Proceedings of IEEE workshop on multimedia signal processing, Los
Angeles, CA, pp 83–88, 7–9 Dec 1998
7. De Silva L, Ng P (2000) Bimodal emotion recognition. In: Proceedings of automatic face and
gesture recognition, 2000, pp 332–335
8. Schneiderman B (1993) Human values and the future of technology: a declaration of respon-
sibility. In: Schneiderman B (ed) Sparks of innovation in human-computer interaction, Ablex
Publ, 1(1), Jan 1994, pp 67–71 (ACM Interactions )
9. Baker J, Deng L, Glass J, Khudanpur S, Lee C, Morgan N, O’Shaughnessy D (2009)
Developments and directions in speech recognition and understanding, Part 1 [DSP
Education]. IEEE Signal Process Mag 26(3):75–80
10. Morgan N, Zhu Q, Stolcke A, Sonmez K, Sivadas S, Shinozaki T, Ostendorf M, Jain P,
Hermansky H, Ellis D, Doddington G, Chen B, Cetin O, Bourlard H, Athineos M (2005)
Pushing the envelope—aside. IEEE Signal Process Mag 22(5):81–88
11. Olukotun K (2006) A conversation with John Hennessy and David Patterson. ACM Queue
Mag 4(10):14–22
12. Klein D (2005) The unsupervised learning of natural language structure. PhD thesis, Stanford
University
13. Park A (2006) unsupervised pattern discovery in speech: applications to word acquisition and
speaker segmentation. PhD thesis, MIT
14. Venkataraman A (2001) A statistical model for word discovery in transcribed speech.
Comput Linguist 27(3):352–372
15. Rosenberg AE, Lee CH, Soong FK (1994) Cepstral channel normalization techniques for
HMM-based speaker verification. In: Proceedings of the IEEE international conference on
acoustics, speech and signal processing, 1994, pp 1835–1838
Chapter 6
Where Speech Recognition Is Going:
Conclusion and Future Scope

Abstract  Today, voice and natural language processing are at the forefront of any
human machine interaction environment. The chapter emphasizes the tremendous
progress that has taken place in machine learning, statistical data-mining and pat-
tern recognition approaches that can help in making speech interfaces more ver-
satile and pervasive. The growing requirements of speech interfaces also warn
against the impediments that may come in the way of successful implementation
of acoustically robust natural interfaces. Finally, the chapter underlines the techni-
cal advances and research efforts to be undertaken for high performance real-time
speech recognition that will completely change the way humans interact with their
computing devices.

Keywords Machine learning · Big data · Hidden markov model (HMM) · 


Automatic speech recognition (ASR)  · Extralinguistic · Paralinguistic · Phonetics · 
Speech corpora · Liear discriminant analysis (LDA) · Mel-frequency cepstral
­coefficients (MFCC)

Since the invention of computers and emerging technologies that enable more
natural ways of interacting with computers, speech recognition technologies have
come a long way and become more approachable to people and play a substantial
role in this technological evolution. Since 1990, their performance has exponen-
tially improved and has reached a level where a completely ubiquitous user inter-
face exists with less or no hardware requirement. During the 1990s and the 2000s,
state-of-the-art speech recognition systems were using evolved HMM variants,
human perceptual versions of cepstral or linear predictive coding feature vectors,
and sophisticated pattern matching and scoring algorithms.
Steady progresses in the technologies have provided the much needed thrust
to help machines understand human speech, including machine learning and sta-
tistical data-mining techniques. Devices are becoming more aware and smart,
making it possible to blend complex modeling approaches of powerful symbolic
processing, machine learning that take advantage of big data and knowledge

© The Author(s) 2016 43


S. Johar, Emotion, Affect and Personality in Speech,
SpringerBriefs in Speech Technology, DOI 10.1007/978-3-319-28047-9_6
44 6  Where Speech Recognition Is Going: Conclusion and Future Scope

representation to populate the design frameworks with observed patterns and


instances.
Speech recognition has come a long way since the introduction of sound spec-
trograph in 1947. Since then, it has been constantly progressing from isolated
word recognition with small vocabulary to large vocabulary continuous speech
recognition [1]. Approaches have spanned from human aural and spectrogram
comparisons, to simple template matching, dynamic time-warping and more mod-
ern statistical pattern recognition approaches, such as neural networks and Hidden
Markov Models (HMMs). Over the course of the past decade, automatic speech
recognition (ASR) technology has advanced to the point where a number of com-
mercial applications are now successfully deployed. Today, there exists a neces-
sity for speech recognition with large number of alternatives and a huge dictionary
to cover all words for a given language in future. The importance of information
retrieval for speech recognition must be underlined and more complex models that
can achieve new features must be applied in future works.
ASR systems today, lag human speech perception and are quite sensitive to var-
iations in speech. Though it has experienced significant developments in applica-
tions like dictation, speech commands and speaker localization, its performance
has yet to attain an entirely persuasive and natural level to become a complete
acoustically robust and widespread interface.

6.1 Obstacles in the Implementation and Acceptance


of ASR

One of the major obstacles coming in the way of ASR is the lack of an unambigu-
ous boundary between paralinguistic and prosodic information. The systems today
are unable to effectively process the prosodic cues and generate specific speaker
qualities like age, emotion, attitudes etc. The human spoken speech or sponta-
neous speech encodes linguistic as well as interpersonal streams of information
[2]. The verbal meaning of the message is carried by the former one and the lat-
ter enriches the speech with paralinguistic cues. Nonverbal sounds or nonverbal
vocalisations as they are technically called are one of the characteristics of sponta-
neous speech that distinguish it from written text and improve speech recognition
accuracy. They can provide valuable paralinguistic cues and occur frequently in
spontaneous speech.
Researchers have divided speech information into various categories over
the years. Laver [3] suggests that paralinguistic signals are used to denote affec-
tive information through voice tone and extralinguistic refers to voice qualities
that identify individual speaker. Roach et al. [4] define paralinguistic features as
those that are intentional and non-linguistic features as unintentional. They fur-
ther define prosodic features as unambiguously signalling linguistic information
at one end and vocal features independent of pitch such as voice quality on the
6.1  Obstacles in the Implementation and Acceptance of ASR 45

Fig. 6.1  Elements of Prosody Paralinguistics


prosody and paralinguistic
Linguistic Paralinguistics

Message Discourse Expressive Physiological


Structure

paralinguistic end. Roach et al. [4] divide prosodic features further into tempo,
pitch range, rhythm, pause and intonation. On the other hand, according to Crystal
[5], prosodic features are characterised by variations in pitch, loudness, duration
and silence. Carlson [6] uses the term extralinguistic for inhalation, exhalation
and hesitation and refers to attitudes and emotions as extralinguistic [7]. As can be
seen from the various classifications, prosody can be used to signal both linguis-
tic as well as paralinguistic information. Broadly, prosodic information in speech
may be divided into linguistic information that includes verbal content and lexi-
cal stress and paralinguistic comprising of attitude, intention and emotional state.
Thus, one may regard paralinguistic phonetics as a subset of prosody. Fig. 6.1
gives a broad distinction between prosody and paralinguistic phonetics
Further research has been devoted to nonverbal sounds like filled pauses,
silence duration and linguistic disfluencies such as repetitions, fillers etc. and con-
firm that they have a systematic and non random nature and modeling them as reg-
ular words improves recognition performance [8]. The results with filled pauses
illustrate their role as linguistic boundaries for modeling more natural human
speech [9, 10].
Prylipko et al. [11] have investigated the potential of a wide range of nonver-
bals for language modeling of conversational speech and conclude that nonverbal
tokens lower the overall perplexity of the full test data and including nonverbal
into the model as regular words increases the perplexity of verbal tokens. Also,
modeling of breath as a regular language model event leads to a substantial
improvement in both perplexity and speech recognition accuracy. It has been pre-
sented that filled pauses have a crucial role as markers of prosodic and linguistic
segment boundaries and are better predictors for the following words and ignoring
or omitting them from context makes local perplexity worse. These tokens can sig-
nificantly enrich transcriptions with paralinguistic information, which may further
enhance natural speech processing and understanding.

6.2 Role of Paralanguage in ASR

Voice recognition with nonspeech paralinguistic features such as voice timber


and average speaking may be sufficient for recognizing voices and in the case of
unfamiliar phonology increased exposure to these features greatly improves the
46 6  Where Speech Recognition Is Going: Conclusion and Future Scope

performance of voice recognition. This has been demonstrated by the study of


Zarate, Tian, Woods and Poeppel [12] where they investigate the contributions of
acoustic, phonological, lexical and semantic information toward voice recognition.
Paralinguistic features appear in more than one dimension and hence, it is often
difficult to measure and isolate them acoustically from other features. Prosodic
variations are processed at all levels: syntactic, lexicon, semantics and pragmat-
ics. Moreover, there seem to be both inter and intra-speaker differences in para-
linguistic distinctions such as age, emotions etc. which make automatic speech
recognition difficult. An ideal speech recognition or text to speech system should
be able to produce any paralinguistic variations, but due to the lack of adequate
paralinguistic models to handle these features, we are still unable to produce natu-
ral sounding speech. Insufficient knowledge and understanding of the human hear-
ing capability and insufficient research attempts to handle paralinguistic features
are other possible reasons for the limited success with paralinguistic phonetic vari-
ation [13]. In case of automatic speaker recognition, high inter and intra-speaker
variability poses difficulties in automatic extraction and modeling of prosodic and
paralinguistic features (suprasegmental features). Normalisation and adjustment
techniques that are being used presently are unreliable to solve paralinguistic vari-
ation as these methods remove important speaker-specific features and are unable
to differentiate between dissimilar speakers [14]. Similar problems exist for text to
speech conversion systems as well and there is a need to formalize the relation-
ship between paralinguistic and prosodic phonetic variation and techniques for pre-
dicting intonation, duration and spectral qualities from abstract patterns need to be
evolved. As an important prerequisite, a standard multidisciplinary categorization
system to describe paralinguistic and prosodic features must be proposed to ensure
unambiguous and exhaustive transcription system and overcome the problems
listed above. A better understanding of human listeners’ perceptual behaviour in
terms of acoustic spectral and waveform features is equally important to construct
efficient theoretical and computational models for paralinguistic features [15]. With
the database of speech corpora expanding, new speech corpora must be labelled
manually for emotions and paralinguistically annotated [4, 16, 17]. More research
is needed to develop a multidisciplinary approach embracing linguistics, phonetics
and engineering domains. The growth of synthesized speech has reduced the gap
between phonetics and speech technology and this gap may further reduce if some
of the possible solutions suggested above are implemented to control the paralin-
guistic features in future speech recognition systems to achieve highly reliable and
accurate methods for tapping and retrieving all sorts of paralinguistic information.

6.3 Technical Advances in Speech Recognition

Research and development in speech recognition has been undertaken for quite
a few decades and it continues to be an active area of research. The corpora
have evolved from small, private corpora to publically available large corpora
6.3  Technical Advances in Speech Recognition 47

under more realistic conditions. Such extensive corpora allow the field to evolve
and mature to the point that applications to control access to information, using
speaker verification as a powerful biometric in conducting benchmark evalua-
tions and research on realistic data are making commercial headway by shifting
the research and development effort to unconstrained situations. Obtaining speech
from a wide variety of channels and acoustic environments allows integration
of real world robustness and development of new and improved compensation
techniques.
The emergence of variable noise conditions, channels, text-independent speech
and multi-speaker speech indexing emphasize the need to understand and study
unconstrained situations and tasks. Current systems use low-level spectrum fea-
tures which are susceptible to channel effects and other noises. Systems that are
being developed these days must cater to high-level features like prosodic meas-
ures and idiolect that offer improved accuracy and robustness.
The conditions mentioned in the above paragraph direct the entire speech
research community to make advances in the current techniques to achieve the
desired technology sophistication. The most significant paradigm shift required in
this direction is the use of statistical models that assist in statistical discriminative
training techniques and use maximum mutual information that results in minimum
error [18, 19]. The basic HMM-like acoustic models focus on the likelihood for
matching criteria. Over the past decade, incremental advances in HMM technol-
ogy have advanced to the point where segmental models [20–24], and structured
speech and language models [24–26] are being employed for commercial use.
Linear Discriminant Analysis (LDA) and Heteroscedastic LDA analysis for fea-
ture extraction, use of determinization and minimization in decoding graph com-
pilation and discriminative training are some of the major recent advances in this
area that produce impressive results for many speakers under different conditions.
HLDA [27] and neural net-based features [28] allow multiple types of feature-
based transformations that can be applied both in parallel and sequentially.
In order to unravel the difficulty of a wide range of variable conditions for
channel, noise, speaker, vocabulary, accent, and recognition context, speaker adap-
tation models that are more closely tuned to the individual and environment have
been intensively studied and are the focus of a significant amount of research.
They enable rapid application integration and are a key to the successful commer-
cial deployment of speech recognition technology. Maximum a Posteriori proba-
bility (MAP) estimation [29] is the simplest form of acoustic adaptation technique
in this area. Maximum Likelihood Linear Regression (MLLR) [30] which adjusts
the gaussians and feature vectors so as to increase the data likelihood is another
popular adaptation technique.
Consequently, to develop and evaluate complex real-time algorithms and to
process the large public speech corpora, computational infrastructure that ena-
bles these statistical models and algorithm development needs to be built up for
establishment of ever-increasing capabilities. To intelligently extract informa-
tion from this huge database of speech, efficient knowledge representation tech-
niques that allow multiple sources of knowledge to be incorporated into a common
48 6  Where Speech Recognition Is Going: Conclusion and Future Scope

probabilistic framework for parallel searching are another area to be explored.


Till recently, Mel-Frequency Cepstral Coefficients (MFCC) and Cepstral Mean
Substraction (CMS) are the common techniques used for the development of per-
ceptually motivated speech signal representations.
The techniques and models discussed above for high performance large-
scale continuous speech recognition will lose importance and significance in the
absence of adoption of rigorous benchmark evaluations and standards for these
procedures and practices. A National Agency which establishes these stand-
ards is critical in developing increasingly powerful and capable systems. Speech
Recognition has an immense potential to establish important associations between
humans and machine in near future. Appropriate units of representations for
speech, language and dialogue must be investigated to transform the way we
tackle situations. Acoustic robustness, handling exhaustive word libraries and lexi-
cons and efficient methods for representation are some of the major key research
challenges for the future. Deploying systems with speech interfaces, even in con-
trolled environments, will also lead to greater innovations in continuous speech
recognition. Live speech-to-text and speech-to-speech translation should be the
next step to produce real-time recognition results. New possibilities for expanding
the speech recognition feature set and prosodic models at the acoustic level must
be explored for producing natural speech. The curiosity in the scope of various
novel applications of speech technology is growing and is leading to the design
of efficient dialogue generation. Artificial intelligence techniques can be used for
emulating emotions and empathy. Although, a great challenge lies ahead in terms
of computational power and software sophistication, researchers are confident that
speech recognition and understanding will eventually show the way to true artifi-
cial intelligence.

References

1. Geoffrey Z, Picheny M (2004) Advances in large vocabulary continuous speech recognition.


Adv Comput 60:249–291
2. Campbell N (2007) On the use of nonverbal speech sounds in human communication. In:
Campbell N (ed) Verbal and nonverbal communication behaviours LNAI, vol 4775. Springer,
New York, pp 117–128
3. Laver J (1980) The phonetic description of voice quality. Cambridge University Press,
Cambridge
4. Roach P, Stibbard R, Osborne J, Arnfield S, Setter J (1998) Transcription of prosodic and
paralinguistic features of emotional speech. J Int Phonetic Assoc 28(1–2):83–94
5. Crystal D (1969) Prosodic systems and intonation in English: David Crystal. Cambridge
University Press, Cambridge
6. Carlson R (2002) Dialogue system. Slide presentation, speech technology, GSLT, Göteborg,
23 Oct 2002. http://www.speech.kth.se/~rolf/gslt/GSLT021023_dialogue.pdf. Accessed 17
August 2015
7. Rolf C, Granström B (1997) Speech synthesis. In: Hardcastle WJ, Laver J (eds) The hand-
book of phonetic sciences. Blackwell Publishers Ltd, Oxford, pp 768–788
References 49

8. Schultz T, Rogina I (1995) Acoustic and language modeling of human and nonhuman noises
for human-to-human spontaneous speech recognition. In: Proceedings of ICASSP, IEEE, vol
1, Detroit, pp 293–296
9. Siu M, Ostendorf M (1996) Modeling disfluencies in conversational speech. In: Proceedings
of the 4th international conference on spoken language processing (ICSLP-96), vol I,
Atlanta, pp 386–389
10. Siu MH, Ostendorf M (2000) Variable N-grams and extensions for conversational speech lan-
guage modeling. IEEE Trans Speech Audio Process 8(1):63–75
11. Prylipko D, Vlasenko B, Stolcke A, Wendemuth A (2012) Language modeling of nonverbal
vocalizations in spontaneous speech. In: Proceedings of 15th international conference on
text, speech and dialogue, 2012. LNCS 7499. Springer, Heidelberg, pp 4625–4628
12. Mary ZJ, Tian X, Woods KJ, Poeppel D (2015) Multiple levels of linguistic and paralinguis-
tic features contribute to voice recognition. Sci Rep 5:11475
13. Schötz S (2002) Linguistic & paralinguistic phonetic variation in speaker recognition & text-
to-speech synthesis. GSLT papers: speech technology 1
14. Furui S (1997) Recent advances in speaker recognition. Pattern Recogn Lett 18(9):859–872
15. Klatt D (1987) Review of text-to-speech conversion for English. J Acoust Soc Am
82:737–783
16. Roach P (2000). The emotion in speech project. In: Proceedings of the ISCA workshop on
speech and emotion. Newcastle, Northern Ireland, Sept 2000, pp 53–59
17. Gustafson-Capková S (2001) Emotions in speech: tagset and acoustic correlates. Term
paper in speech technology 1, Swedish National Graduate School of Language Technology
(GSLT), Stockholm University, Department of Linguistics
18. Bahl L, Brown P, de Souza P, Mercer R (1986) Maximum mutual information estimation of
hidden Markov model parameters for speech recognition. In: Proceedings of the IEEE inter-
national conference on acoustics, speech, and signal processing, Tokyo, Japan, pp 49–52
19. He X, Deng L, Wu C (2008) Discriminative learning in sequential pattern recognition. IEEE
Signal Process Mag 25(5):14–36
20. Deng L (1993) A stochastic model of speech incorporating hierarchical nonstationarity. IEEE
Trans Speech Audio Process 1(4):471–475
21. Deng L, Aksmanovic M, Sun D, Wu J (1994) Speech recognition using hidden Markov mod-
els with polynomial regression functions as nonstationary states. IEEE Trans Speech Audio
Process 2:507–520
22. Poritz A (1998) Hidden Markov models: a guided tour. In: Proceedings of the international
conference on acoustics, speech, and signal processing, vol 1, Seattle, WA, pp 1–4
23. Glass J (2003) A probabilistic framework for segment-based speech recognition. In: Russell
M, Bilmes J (eds) New computational paradigms for acoustic modeling in speech recogni-
tion, computer, speech and language (special issue), vol 17(2–3), pp 137–152
24. Deng L, Yu D, Acero A (2006) Structured speech modeling. IEEE Trans Audio, Speech Lang
Process (special issue on Rich Transcription) 14(5):1492–1504
25. Chelba C, Jelinek F (2000) Structured language modeling. Comput Speech Lang 14:283–332
26. Wang Y, Mahajan M, Huang X (2000) A unified context-free grammar and n-gram model
for spoken language processing. In: Proceedings of the international conference on acoustics,
speech, and signal processing, Istanbul, Turkey, vol 3, pp 1639–1642
27. Kumar N, Andreou A (1998) Heteroscedastic analysis and reduced rank HMMs for improved
speech recognition. Speech Commun 26:283–297
28. Morgan N, Zhu Q, Stolcke A, Sonmez K, Sivadas S, Shinozaki T, Ostendorf M, Jain P,
Hermansky H, Ellis D, Doddington G, Chen B, Cetin O, Bourlard H, Athineos M (2005)
Pushing the envelope—Aside. IEEE Signal Process Mag 22:81–88
29. Gauvain J-L, Lee C-H (1997) Maximum a posteriori estimation for multivariate Gaussian
mixture observations of Markov chains. IEEE Trans Speech Audio Process 7:711–720
30. Leggetter C, Woodland P (1995) Maximum likelihood linear regression for speaker adapta-
tion of continuous density hidden Markov models. Comput Speech Lang 9:171–185
Index

A Human-computer interaction (HCI), 29, 31


Acceptance of ASR, 44 Human behaviour, 17
Auditory cues, 3
Automatic speech recognition (ASR), 28, 36,
38, 40, 44 I
Interpersonal, 18, 21
Implementation, 44
B
Benevolence, 13
Big data, 43 K
Body language, 20 Kinesics, 2
Knowledge representation, 38

C
Communication, 17–19 L
Coverbal, 19, 20 Linear discriminant analysis (LDA), 47
Coverbal behaviours, 19 Linguistic, 2, 3, 5, 18–21

E M
Emotion, 36–39 Machine learning, 43
Emotional markers, 11 Major auditory attribute, 9
Emotional speech recognition, 35 Mel-frequency cepstral coefficients (MFCC), 48
Extralinguistic, 44 Modality theory, 29
Extroversion, 13 Multimodal, 37
Multimodal interfaces, 26
Multimodality, 25–27, 30
G
Gesture, 20, 21
N
Neuroticism, 13
H Nonverbal, 1, 2, 20–22
Hidden markov model (HMM), 36, 37, 39, 43,
44, 47
Human computer intelligent interaction O
(HCII), 35 Obstacles, 44

© The Author(s) 2016 51


S. Johar, Emotion, Affect and Personality in Speech,
SpringerBriefs in Speech Technology, DOI 10.1007/978-3-319-28047-9
52 Index

P Semantic, 18
Paralanguage, 2–4 Significant developments, 36
Paralanguage in ASR, 45 Speech corpora, 46, 47
Paralinguistic, 44–46 Speech markers, 11
Personality markers, 12 Speech recognition, 36, 38, 43, 46
Pervasive, 36 Spoken dialogue systems, 25, 27, 28
Phonemes, 9 Suprasegmental, 3
Phonetics, 45, 46 Syntactic, 18
Phonological, 18
Physiology, 13
Pitch, 9–11, 13, 14 T
Pitch range, 13 Technical advances, 46
Potential benefits, 26
Prosody, 3, 5, 9–11
Proxemics, 2 U
Understanding, 20, 38

S
Segmental, 3 V
Self-adaptive, 40 Vocalizations, 4, 5

You might also like