Privacy Against Real-Time Speech Emotion Detection Via Acoustic Adversarial Evasion of Machine Learning

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

Privacy against Real-Time Speech Emotion Detection via Acoustic

Adversarial Evasion of Machine Learning


BRIAN TESTA, YI XIAO, AVERY GUMP, and ASIF SALEKIN, Syracuse University, USA
Emotional Surveillance is an emerging area with wide-reaching privacy concerns. These concerns are exacerbated by
ubiquitous IoT devices with multiple sensors that can support these surveillance use cases. The work presented here considers
one such use case: the use of a speech emotion recognition (SER) classifier tied to a smart speaker. This work demonstrates
the ability to evade black-box SER classifiers tied to a smart speaker without compromising the utility of the smart speaker.
arXiv:2211.09273v1 [cs.LG] 17 Nov 2022

This privacy concern is considered through the lens of adversarial evasion of machine learning. Our solution, Defeating
Acoustic Recognition of Emotion via Genetic Programming (DARE-GP), uses genetic programming to generate non-invasive
additive audio perturbations (AAPs). By constraining the evolution of these AAPs, transcription accuracy can be protected
while simultaneously degrading SER classifier performance. The additive nature of these AAPs, along with an approach
that generates these AAPs for a fixed set of users in an utterance and user location-independent manner, supports real-time,
real-world evasion of SER classifiers. DARE-GP’s use of spectral features, which underlay the emotional content of speech,
allows the transferability of AAPs to previously unseen black-box SER classifiers. Further, DARE-GP outperforms state-of-
the-art SER evasion techniques and is robust against defenses employed by a knowledgeable adversary. The evaluations in
this work culminate with acoustic evaluations against two off-the-shelf commercial smart speakers, where a single AAP
could evade a black box classifier over 70% of the time. The final evaluation deployed AAP playback on a small-form-factor
system (raspberry pi) integrated with a wake-word system to evaluate the efficacy of a real-world, real-time deployment
where DARE-GP is automatically invoked with the smart speaker’s wake word.
CCS Concepts: • Computer systems organization → Embedded systems; Redundancy; Robotics; • Networks → Network
reliability.
ACM Reference Format:
Brian Testa, Yi Xiao, Avery Gump, and Asif Salekin. 2022. Privacy against Real-Time Speech Emotion Detection via Acoustic
Adversarial Evasion of Machine Learning. 1, 1 (November 2022), 27 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

1 INTRODUCTION
Emotional surveillance, also known as affective computing, is defined as "computing that relates to, arises
from, or influences emotions" [50]. Companies like MARS [5] and Kellogg [4] are using emotion detection software
to develop targeted and customized ad-campaigns. Meanwhile, government actors such as China [56] and England
[43] are experimenting with emotion detection software as part of law enforcement activities. In the United States,
concerns regarding the potential abuses of affective computing [19] and the invasive nature of smart devices
tied to emotion recognition classifiers, like the Amazon Halo Band [32], have escalated as high as the United
States Senate. On the world stage, the United Nations Human Rights Council has weighed in [46], acknowledging
"...emotion recognition technologies as an emergent priority in the global right to privacy in the digital age". This
global right to privacy is in tension with our ever-expanding reliance upon smart devices that collect our personal
Authors’ address: Brian Testa, bptesta@syr.edu; Yi Xiao, yxiao54@syr.edu; Avery Gump, atgump@syr.edu; Asif Salekin, asalekin@syr.edu,
Syracuse University, 223 Link Hall, Syracuse, New York, USA, 13244.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first
page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy
otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
permissions@acm.org.
© 2022 Association for Computing Machinery.
XXXX-XXXX/2022/11-ART $15.00
https://doi.org/10.1145/nnnnnnn.nnnnnnn

, Vol. 1, No. 1, Article . Publication date: November 2022.


2 •

information. Smart speakers, in particular, have access to a wide range of information including recordings of
our speech.
In generic human conversations, emotional information can be leaked via an utterance’s transcript [7]. However,
generic smart device commands do not explicitly contain emotional content. To validate this assumption, we
collected over 100 common Alexa commands and submitted their transcripts to Analyze Sentiment in Google
Cloud [24]. Over 90% of these commands were scored with an emotional magnitude in [−0.4, 0.4], which indicates
a weak positive or negative sentiment. Further, among all phases of the datasets for this work (section 6), all
but only 1% phrases were scored in the same range, demonstrating similarity in characteristics. While the
text of the commands does not explicitly contain emotional content, there are multiple contemporary works
demonstrating the efficacy of Speech Emotion Recognition (SER) classifiers on utterances of emotionally neutral
phrases [25, 35, 59] that are comparable to smart speaker commands.
The presented study aims to address the question: Can we exploit the utility of smart speakers without com-
promising a user’s emotional information? To answer this question, there are several important constraints to
consider:
(1) Black Box Classifiers: It is unlikely that a smart speaker manufacturer would expose details regarding
any attached SER system. This means that any solution will need to work against a black box classifier.
(2) Maintain Audio Utility: The smart speaker still needs to “understand” the user; speech-to-text transcrip-
tion still needs to work in order to maintain the utility of the device.
(3) Real-Time: Smart speakers are typically used in an ad hoc manner without pre-planning for a given
request. This means that a solution needs to work in real-time.
(4) Noninvasive: The solution to this problem should not impose an undue burden upon the users. This
means that the solution should not require significant infrastructure, environment assumptions, or size,
weight, and power (SWaP) [58].
To Defeat a Black Box Classifier, a solution cannot rely upon knowledge of the specific SER classifier, its
gradients, or even the features that it used. Weninger et al. [68] observed a strong correlation between spectral
features of speech and the perceived emotion. This finding, consistent with previous work by Bălan et al. [9],
supports the hypothesis that spectral-based audio perturbations can affect the apparent emotional context of an
utterance without requiring details of a specific black-box SER classifier, and Gao et al. [22] demonstrated that
this could be exploited to defeat SER classifiers using spectral envelope attacks.
Exploiting the link between spectral features and detected emotion requires a mechanism that can be constrained
to maintain the audio’s utility. Specifically, the smart speaker still needs to transcribe the audio. One effective
mechanism to solve this sort of constrained optimization problem is genetic programming [33].
This paper presents an approach called Defeating Acoustic Recognition of Emotion via Genetic Pro-
gramming (DARE-GP). Using genetic programming to generate spectral perturbations constrained by the
need to preserve transcription accuracy, this approach is able to defeat previously-unseen black box models while
maintaining audio utility.
There have been multiple works that have looked at the defeat of non-SER audio classifiers [3, 27, 31, 38], but
these approaches did not work in real-time. These approaches developed bespoke perturbations for each audio
sample that generated evasive variants to fool the target classifier. Even when authors considered real-world
playback of evasive samples [38], the approaches required recording an audio sample outside of the range of the
smart device, calculating the perturbation, generating the evasive sample, and then replaying it within range
of the smart device. DARE-GP generates a universal adversarial audio perturbation [2]; a single, robust, and
utterance-independent additive-acoustic-signal, a combination of spectral-based perturbation. Development of
this ‘additive audio perturbation (AAP)’ front-loads all significant processing and allows DARE-GP to work
in real-time. These AAPs are the foundation of the presented approach. Evaluations in Section 7.5 demonstrate

, Vol. 1, No. 1, Article . Publication date: November 2022.


• 3

that AAPs are effective in real-world settings and are non-invasive with loudness comparable to the background
noise of typical homes. In addition, this approach also allows for a very simple, SWaP-efficient deployment; a
DARE-GP device only needs to detect a wake word and play the pre-computed AAP.
Figure 1 shows an employment scenario where a DARE-GP device is used in tandem with a smart speaker
to play a pre-generated AAP in real-time. When the user says the smart speaker’s wake word (e.g. - "Alexa"),
this triggers both the smart speaker and the DARE-GP device. The smart speaker in this scenario hears both the
user’s speech and the AAP played by the DARE-GP device. The audio perturbation from the DARE-GP device
masks the emotional spectral content of the user’s speech while preserving the transcription-relevant content
when acoustically mixed, which fools the SER classifier.

Fig. 1. DARE-GP employment scenario. An adjacent DARE-GP device can be used to prevent reliable inference by a smart
speaker’s back-end SER classifier.

The challenges associated with this work are:


• It is unlikely that any smart speaker provider would provide access to back-end SER classifiers. To address
this, DARE-GP is developed using a surrogate classifier. This requires evaluations to ensure that attacks
against the surrogate model transfer to other, previously unseen SER models.
• Developing an attack to thwart an audio classifier would require iterative testing in an acoustic environment;
acoustic playback and recording are prohibitively slow and must be done serially. To address this, DARE-GP
utilizes digital mixing to measure progress toward a successful attack. This requires evaluations to ensure
that digitally-derived attacks work when played acoustically.
• A smart speaker may not provide access to recordings from the embedded microphone array or the internal
speech-to-text implementation. DARE-GP uses an open-source speech-to-text implementation, and all
AAPs were developed using an off-the-shelf microphone. This requires evaluations to ensure that generated
AAPs would be effective against commercial smart speakers.
• Users do not use smart speakers in pristine, static environments; the environment can have unanticipated
background noise, and the user is not always in the same position when commanding the smart speaker.
AAPs need to be robust against environmental variance and changes in the user’s distance and direction to
the smart speaker.

, Vol. 1, No. 1, Article . Publication date: November 2022.


4 •

To summarize, the contributions of this work are:


• DARE-GP is a first-of-its-kind work to deceive black-box SER classifiers while preserving the transcription-
relevant content of the speech; evaluations against state-of-the-art SER evasion techniques have been
performed.
• By targeting the spectral features underlying emotional content in speech, DARE-GP demonstrates at-
tack transferability to a broad set of SER classifiers with heterogeneous classes, topologies, and feature
representations.
• DARE-GP AAPs are universal adversarial perturbations (UAPs) [2] with respect to a fixed set of users;
within that set of users, a single AAP is effective against previously unheard utterances. Combined with
the fact that AAPs are additive in nature, DARE-GP can evade SER classifiers over-air in real-time.
• Demonstrated that a knowledgeable adversary cannot significantly degrade the effectiveness of DARE-GP
through either passive (indirect emotion inference due to true/predicted class relationships) or active
(modification of audio samples to allow true emotion inference) defenses.
• Performed over-air evaluation against two representative commercial smart speakers: a 1st generation
Amazon Echo and an Amazon Echo Dot, from different distances and directions. Notably, we performed an
end-to-end real-time test of DARE-GP in a realistic deployment scenario that demonstrates that an AAP
can be deployed on commercial smart speakers in a usable, automated configuration.
• Performed analysis of classifier decision explanations using SHapley Additive exPlanations (SHAP) [42]
to reason about why select perturbations were effective (or ineffective).
• AAPs are made robust against environmental noise such as reverberation, de-amplification, and reflection
by augmenting the training data with synthetic, noisy variants.

2 RELATED WORKS
There are three key attributes of DARE-GP that are important when comparing this work to other relevant
research; DARE-GP:
• Provides protection when the user interacts with the smart speaker (i.e. - anticipates being recorded).
• Protects against emotion inference by black-box models and transfers to previously unseen models.
• Supports real-time protection in a real-world setting.

2.1 Protection of Audio Against Unanticipated Recording


There are several works that address protection from unanticipated recording. These works include both protection
from unauthorized devices and cases where the user wants to prevent known devices from unauthorized recording.
Patronus, developed by Li et al. [37], provided an approach to perform real-time audio jamming that could only
be decoded by devices with the correct "key." Sun et al. developed MicShield [61], which prevented smart speakers
from recording audio outside the window of a command being directed at the smart speaker. In both cases,
the solutions presented prevent access to any audio by unanticipated devices at unplanned times. This should
protect any emotional content of the audio except for when the user invokes an approved device. DARE-GP is
complementary to these approaches by providing protection with respect to emotional content for devices that
are intended to hear a given command.

2.2 Evasion of Black-Box Classifiers using Genetic Programming


There is a large corpus of work related to evading classifiers without requiring detailed insight into the inner-
workings of the evaded classifiers; these works are generally referred to as black-box adversarial evasion. The
use of genetic programming (GP) in DARE-GP was inspired by previous applications of GP towards adversarial
evasion in multiple domains, including: PDF malware [67], WiFi behavior detection [39], and audio speech

, Vol. 1, No. 1, Article . Publication date: November 2022.


• 5

recognition (ASR) [63]. Like DARE-GP, these works leveraged GP against black-box models where the evasion
attacks were developed without direct access to the target model’s topology, weights, or gradients.
The PDF malware classifier attack developed by Weilin et al. [67] generated additive, stochastic manipulations
on malware-bearing PDF documents to create variants that could evade multiple malware classifiers while
preserving their malicious behaviors. This attack was especially interesting because the fitness evaluation
included PDF evasiveness and continued maliciousness using a virtual detonation chamber [25]; this tension
between evasion and utility is analogous to DARE-GP’s need to evade SER classifiers without breaking the
transcription service. Like DARE-GP, the authors of [67] evaluated their technique against multiple classifiers
of differing topology. Unlike DARE-GP, these attacks were directly developed against each model, whereas
DARE-GP AAPs have been evaluated against previously-unseen black-box models. Further, since the problem
addressed by [67] was not real-time, the authors could generate as many perturbations as necessary to create
evasive PDFs. DARE-GP’s need for real-time evasion for previously unheard utterances required the development
of universal adversarial perturbations.
Liu et al. [39] demonstrated the ability to evade WiFi-based behavior recognition (WBR) systems by introducing
jamming signals whose characteristics were developed using GP. This work targeted multiple WBR systems and
included considerations for transitioning a digitally-derived attack to the real world by addressing the impacts
of real-world delay in the fitness function. DARE-GP also uses additive noise in a real-world environment to
evade black-box classifiers. In addition, like DARE-GP, [39] considered the transferability of attacks to previously
unseen models.
Taori et al. [63] leveraged GP to perform targeted evasion against an ASR system. This work used a new
mutation operation based on gradient estimation to speed convergence. Unlike DARE-GP, this work had direct
access to the target classifier and did not address the transferability of the attack. More importantly, this approach
evolved distinct perturbations for each audio sample over thousands of generations. DARE-GP’s development of
universally adversarial audio perturbations [2] allows DARE-GP to pre-generate AAPs that will create adversarial
audio examples in real-time.

2.3 Adversarial Evasion of Audio Classifiers


There are multiple works related to the evasion of speaker identification systems [2, 27, 31, 37]. There is also
multiple audio speech recognition (ASR) evasion works [2, 3, 11, 63]. Evasion of speech emotion recognition
(SER), like DARE-GP, is far less common; there is one very recent publication in this area [22].
A recent work by Li et al. [38] attacked a white-box speaker identification classifier. This work is interesting
because the resulting attack was deployed in an acoustic environment; the audio perturbations were played in
tandem with a user’s speech to defeat the target speaker identification system. While there are several similarities
between DARE-GP and [38], such as their use of untargeted evasion, there are some very significant differences
that highlight the limitations of [38]:

• DARE-GP provides a single perturbation that creates evasive samples for multiple previously unseen audio
samples; in [38] perturbations were tailored for each sample.
• DARE-GP considers the utility of audio after perturbation by checking against a representative speech-to-
text system; [38] bounded perturbation loudness using a parameter epsilon, but provided no empirical or
theoretical guarantees on audio utility.
• In [38] room impulse response (RIR) [59] is used to aid in making effective acoustic perturbations from
digitally-derived perturbations. During the development of DARE-GP, use of RIR rendered audio unusable;
neither the vosk/kaldi [51] nor Google Speech speech-to-text implementations could consistently extract
the transcripts from these RIR-modified samples.

, Vol. 1, No. 1, Article . Publication date: November 2022.


6 •

While this work was interesting, it was not intrinsically real-time; the audio perturbations were calculated
for each specific utterance; the authors performed no evaluation of universal perturbations that could cause
misclassification of a user’s identity for previously unheard utterances.
This was not the case for Gao et al. [22]; in this recent work, the authors demonstrated that three spectral
envelope-based attacks could defeat several previously unseen Speech Emotion Recognition (SER) models. Similar
to DARE-GP, these attacks were computed without any insight into the target classifiers. While this model
independence is desirable, this attack is not executable in a real-time, acoustic environment. The issue is that,
unlike DARE-GP’s AAPs, the spectral envelope filters applied (vocal tract length normalization, modulation
spectrum smoothing and McAdams transformation) cannot be represented as additive noise that can be played
when a user speaks. To deploy these filters in a real-world scenario, the users’ utterances would need to be
recorded, modified, and then replayed within range of the target smart speaker. Further, these filters introduced
significant transcription errors, measured as Word Error Rate (WER). For reference, the target WER for DARE-GP
is 0. The effectiveness of DARE-GP is compared with these filters is provided in Section 7.3.

3 THREAT ANALYSIS
3.1 Threat Model
This work addresses the concern of leaked emotions via user audio recorded using a smart device when the user
is executing commands. The SER Provider has full access to the audio associated with any recordings. Further, the
SER Provider’s classifier details are completely private; any evasion approach cannot assume access to the model
or any details regarding features, topology, or weights/hyperparameters.
The protection goals of any solution are the disruption of an unseen SER classifier without breaking the smart
speaker’s ability to transcribe command audio in order to perform its primary functions. Furthermore, this
protection should exist in real-time, meaning that the protection can be employed on demand without requiring
a cumbersome record/modify/replay approach typified by other audio evasion works [3, 27, 31, 38]. Finally,
assuming that the SER Provider has knowledge of DARE-GP, they should not be able to easily defend against it.

3.2 Evaluation Metrics


The following metrics are applicable for any solution associated with defeat of an audio classifier.
• Evasion Success Rate (ESR): The primary evaluation metric is success rate. A sample modified by a
solution is only considered a success if it successfully evades the SER classifier and is transcribed correctly.
• False Label Independence: A knowledgeable SER Provider, given details of a solution’s approach, could
conceivably infer an utterance’s true emotion based upon the false one caused by the solution. For example,
if audio samples with the emotion calm, when perturbed, always resulted in samples with the false emotion
happy, it would be trivial for the SER Provider to discern the true emotion. This metric assesses the strength
of the relationship between the true and the false emotion/class. Two information theoretic measures used
for this metric in this work are: Normalized Mutual Information (NMI) [21] and Matthews Correlation
Coefficient (MCC) [12].

4 PROBLEM STATEMENT AND APPROACH OVERVIEW


This paper aims to address the following research question:
In real-time, can DARE-GP cause a smart speaker’s black-box SER classifier to misclassify audio samples for a fixed
set of users without degrading the speech transcription utility of the audio?
To formalize this question:

, Vol. 1, No. 1, Article . Publication date: November 2022.


• 7

SER Provider Stakeholder responsible for operating and training the SER classifier; the pur-
pose of DARE-GP is to prevent this principle’s SER classifier from performing
accurate inferences about a user’s emotions.
Audio Sample Speech sample with a valid transcript and classification by a target emotion
classifier. These are the samples for which we want to create evasive variants.
Evasive Audio Sample Modified version of an audio sample that a) the target classifier cannot deter-
mine the actual class and b) the transcript is still correct.
Additive Audio Perturbation (AAP) A combination of tones that can be combined with an audio sample to generate
a potentially evasive sample.
Tone An acoustic signal specified by a single frequency, offset, duration and muzzle.
Frequency Frequency of a tone in Hz.
Offset Time in seconds before a given tone starts playing.
Duration Length of a tone in seconds.
Muzzle Decrease in the amplitude of a tone.
Genetic Program An iterative process over multiple generations, breeding a population based
upon a process of selection, crossover, and mutation.
Population Collection of individuals resulting from a generation.
Individual Member of the population whose parameters can be used to generate an
additive audio perturbation (AAP).
Generation A single iteration of a genetic program.
Selection Operation to select Individuals from the current generation to serve as parents
for the next generation.
Fitness Score A value calculated for each individual based upon problem-specific criteria to
identify the healthiest members of the population. This score is used during
selection.
Crossover Operation to combine two parents into new offspring.
Parent An individual from the current population selected as an input to a crossover
operation.
Offspring An individual generated as the output from a crossover operation.
Mutation Random modification of an individual.
Table 1. Definitions of Terms

Given: A smart speaker executing speech-to-text transcription system S and a black-box SER classifier C,
within environment E. This classifier has been trained to work on user population U.
Generate: A single additive audio perturbation (AAP), P, that increases the misclassification rate of C on
utterances generated by users in U within the confines of E without impacting the effectiveness of S.
Definitions for additive audio perturbation (AAP) and other terms in this section are provided in Table 1. At a
high level, each AAP is a combination of three tones, each of which is characterized by its frequency, duration,
offset (delay before it starts playing), and muzzle (inversely proportional to the tone’s amplitude).
There are four sub-problems that need to be addressed which are expounded upon in the Technical Approach
(Section 5):
• Generation of a set of AAPs, Pˆ, that cause misclassification for a surrogate SER classifier C* within a
digital environment 𝐸ˆ without impacting the effectiveness of S.
• Assess the transferability of Pˆ from C* to C, the original black-box SER classifier, within a digital environ-
ment 𝐸.ˆ
• Given a SER Provider with access to audio samples mixed with Pˆ assess their ability to infer the true
emotion of the original utterance.
• Assess the efficacy of digitally-developed AAPs when played in the original environment E; that is, assess
Pˆ → P.

, Vol. 1, No. 1, Article . Publication date: November 2022.


8 •

5 TECHNICAL APPROACH
Section 5.1 provides details of how AAP development is represented as a genetic programming problem. Section
5.2 addresses how the transferability of AAPs developed against the surrogate SER classifier C* to representative
black-box SER classifiers was evaluated. Section 5.3 describes the evaluation approach used to compare DARE-GP
with other state-of-the-art audio evasion approaches. Section 5.4 describes the approach to evaluate the efficacy
of state-of-the-art defenses against DARE-GP. Finally, Section 5.5 addresses over-air acoustic evaluations of
AAPs against commercial smart speakers.

5.1 Generation of Additive Audio Perturbations


Table 1 and Figure 2 present the details of the genetic programming approach used to generate Additive Audio
Perturbations (AAPs). As shown in Figure 2, the approach requires a labeled set of audio samples (i.e., dataset)
from the users. There are two constraints on this dataset: samples need to be correctly classified by the SER
classifier or by some surrogate classifier, and the transcription service needs to be able to correctly extract text
transcripts from these samples. Naturally misclassified samples are not interesting because they are already
misclassified. Likewise, audio samples that cannot be transcribed correctly are not useful because they do not
support constraining the AAPs to limit transcription errors.

Fig. 2. Genetic Programming Adversarial Evasion Workflow

An AAP consists of three tones, each of which is represented by four parameters. Hence, as described in Table
1, an individual is a list of twelve real numbers scaled to [0,1] (each representing a parameter value), while an
AAP is the audio time-series created from those parameters. As the first step, a population of individuals (Table
1), each of which has the parameters necessary to generate an AAP, is initialized. These individuals are then
assigned fitness scores by generating their AAPs, combining those AAPs with audio samples from the training
set to generate potentially evasive audio samples, and then assessing the fitness of those samples with respect
to the classifier C* and the utility of those samples with respect to the transcription system S. These are then
evolved using the genetic programming algorithm described in Figure 2, via selection, crossover, and mutation.
This process is iterated over multiple generations, with each generation’s individuals generating slightly better
AAPs. The following subsections provide additional detail on this high-level workflow.

5.1.1 Fitness Score. The fitness score measures the relative health of individuals in a population and is calculated
using a fitness function. For DARE-GP, this fitness function is composed of two components: a deception score
and a transcription score. The deception score, given in Equation 1, measures the extent to which the AAP

, Vol. 1, No. 1, Article . Publication date: November 2022.


• 9

generated by an individual fools the surrogate classifier C*.

𝑑𝑒𝑐𝑒𝑝𝑡𝑖𝑜𝑛(𝑖𝑛𝑑) = 𝑥 ∈ 𝑆 [𝑠𝑚𝑎𝑥 (𝑥,𝑐𝑥 )−𝑠𝑚𝑎𝑥


|𝑆 |
(𝑥 ⊕𝑖𝑛𝑑,𝑐 𝑥 ) ]
+ 𝑏𝑜𝑛𝑢𝑠 (𝑖𝑛𝑑)
(1)
(
𝑏, 𝑖 𝑓 ∃𝑜𝑡ℎ𝑒𝑟 ≠𝑐𝑡𝑟𝑢𝑒 : 𝑠𝑚𝑎𝑥 (𝑥 ⊕ 𝑖𝑛𝑑, 𝑐𝑡𝑟𝑢𝑒 ) < 𝑠𝑚𝑎𝑥 (𝑥 ⊕ 𝑖𝑛𝑑, 𝑐𝑜𝑡ℎ𝑒𝑟 )
𝑏𝑜𝑛𝑢𝑠 (𝑖𝑛𝑑) =
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
where:
𝑑𝑒 𝑓
𝑆 = a random subset of the audio training data
𝑑𝑒 𝑓
𝑏 = a bonus awarded for each successful misclassification
𝑑𝑒 𝑓
𝑐 𝑥 = the actual class for x
𝑑𝑒 𝑓
𝑠𝑚𝑎𝑥 (𝑥, 𝑐) = the value of class c in the softmax result from simple_classifier.predict(x)
𝑑𝑒 𝑓
𝑥 ⊕ 𝑖𝑛𝑑 = the result of mixing the audio sample x with the AAP ind

The first term of the deception score is the mean decrease in the actual class’ score from the softmax in our
surrogate classifier. The second term is a bonus applied for each misclassification. The purpose of this term is
to incentivize additional successful misclassifications rather than unnecessary increases in the confidence of
misclassifications. The transcription score in Equation 2 penalizes transcription errors:



 0, 𝑖 𝑓 ∃𝑥 ∈𝑆 : 𝑡𝑠𝑐𝑟 (𝑥 ⊕ 𝑖𝑛𝑑) ≠ 𝑡𝑠𝑐𝑟 (𝑥)


𝑡𝑟𝑎𝑛𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛(𝑖𝑛𝑑) = 0, Í
𝑖 𝑓 ∃𝑥 ∈𝑆 : 𝑐𝑜𝑛𝑓 (𝑥 ⊕ 𝑖𝑛𝑑) < 𝑡 (2)
𝑥 ∈𝑆 [𝑐𝑜𝑛𝑓 (𝑥)−𝑐𝑜𝑛𝑓 (𝑥 ⊕𝑖𝑛𝑑) ]

1 −

, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
 |𝑆 |
where:
𝑑𝑒 𝑓
𝑡𝑠𝑐𝑟 (𝑥) = the transcript extracted from audio sample x
𝑑𝑒 𝑓
𝑐𝑜𝑛𝑓 (𝑥) = the confidence of the transcription result for x, scaled to [0,1]
𝑑𝑒 𝑓
𝑡 = the minimum acceptable transcription confidence.

The transcription score goes to 0 if any of the audio samples when mixed with an individual’s AAP, cannot be
transcribed correctly. In addition, if any of the transcription confidence scores fall below a specified confidence
threshold, then the transcription score goes to 0. Otherwise, the score is 1 minus the mean confidence decrease
as a result of applying the AAP to audio samples in S. The fitness score is calculated as the product of the
deception(ind) which is a real number, and the transcription(ind) which is a real number in [0, 1].

𝑓 𝑖𝑡𝑛𝑒𝑠𝑠 (𝑖𝑛𝑑) = 𝑑𝑒𝑐𝑒𝑝𝑡𝑖𝑜𝑛(𝑖𝑛𝑑) × 𝑡𝑟𝑎𝑛𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛(𝑖𝑛𝑑) (3)


Taking the product of the deception and transcription scores allows for incremental improvements to classifier
deception while harshly penalizing transcription errors. This approach was taken due to an obvious complication;
a sufficiently loud AAP could drown out the speech in the audio samples, thus providing near-perfect evasive
quality while rendering the audio effectively useless.
5.1.2 Genetic Programming Steps. As discussed above, the refinement of individuals and the generated AAPs
is driven by evolution. The high-level algorithm is given in Figure 3. In each generation (i.e., evolution), the
population of individuals goes through the selection, crossover, and mutation evolutionary steps.

, Vol. 1, No. 1, Article . Publication date: November 2022.


10 •

Fig. 3. High-level genetic programming steps for a single generation. First, the selection is performed to cull weak and create
duplicates of strong individuals. Next, the crossover is performed between selected pairs of individuals. Finally, mutations
are randomly applied to these population members.

Selection: This step selects a subset of individuals from a population to carry forward into the next generation.
Selection is performed using a tournament selection method [34] that allows for a tradeoff between individual
strength and population diversity in the next iteration of AAP generation. Figure 4 presents an example where a
population of five undergoes a tournament selection with 𝑛𝑆𝑒𝑙 = 3. The two weakest individuals, with fitness
values 6 and 0, are culled as a result of this round of selection. The most-fit individuals (fitness of 20 and 18) are
copied into the offspring two times each. These individuals in the resulting population are then considered for
crossover.

Fig. 4. Example evolutionary selection of a size 5 population using a tournament of size 3. The participants in each tournament
are selected randomly, and the winner has the highest fitness score. The resulting Selected Population is then carried forward
for crossover and mutation operations.
Crossover: In the genetic crossover, offspring “. . . are created by exchanging the genes of parents among
themselves.” [44] This step is a method to generate new individuals, and thus new AAPs, by combining the
attributes of two existing individuals to create two new individuals. Each individual is paired with another
arbitrary individual from the population. Each attribute of these parents is swapped with probability 𝑝𝐶𝑋 to
generate the offspring. For example, as shown in Figure 5, the individuals shown at the bottom are the same as
their parents, except that they have swapped durations for one tone within their respective AAPs.
Mutation: Finally, the population members resulting from the previous two operations are considered for
mutation, where new individuals are generated by randomly modifying select individuals. Mutation introduces
the greatest amount of variability in the population; a given individual can undergo any number of changes,
leading to significant improvement or degradation. While this can allow a stagnant population to become more
productive [54], it can also sabotage a productive population if the probability of mutation is too high. In DARE-GP,
mutations are performed by randomly shuffling an individual’s attributes with probability 𝑝𝑀𝑈𝑇 . Figure 6 gives
an example of mutation. The individuals here have twelve attributes, four for each tone comprising an AAP. In
this case, one mutation occurred which exchanged attributes between two of the tones while the third was left
unchanged.
5.1.3 AAP Downselect. Once the genetic program is run for the final time, resulting in one last population, the
most promising individual is selected using the fitness score defined in Section 5.1.1. As previously discussed,

, Vol. 1, No. 1, Article . Publication date: November 2022.


• 11

Fig. 5. A simple example crossover between two individuals. The resulting offspring are the same as the parents except that
they have exchanged the duration of the first two tones.

Fig. 6. A simple example of mutation where the duration for tone 1 and tone 2 were swapped.

the goal of this work is to generate a single AAP that would result in the most misclassifications. In this case, the
fitness is calculated over a validation set pulled from the training dataset, and the individual that generates the
AAP with the highest fitness score is selected.

5.2 Transferability of Additive Audio Perturbations (AAPs)


The development of AAPs in Section 5.1 exclusively leverages C*, a surrogate model created for this purpose. It is
important to emphasize that, while the topology and weights of this model are available during AAP development,
the only privileged characteristics of C* used are the raw scores from its softmax layer during fitness evaluation.
That said, evasion of C* is an interim step towards evasion of C, the smart speaker’s black-box SER classifier. To
assess this transferability, the final AAP with the highest fitness is used without modification to deceive several
off-the-shelf SER models.
To address transcription preservation, DARE-GP was developed using an open-source transcription model;
this allowed the development of AAPs in a device-independent manner. DARE-GP was developed under the
assumption that a commercial transcription solution would be at least as robust to background noise as the
open-source implementation used for these experiments. This assumption was validated with the experiments in
Section 7.5.

, Vol. 1, No. 1, Article . Publication date: November 2022.


12 •

5.3 State-of-the-Art Comparisons


There are not currently any directly-comparable works to DARE-GP. At the time of writing, there is one published
work evading black-box audio emotion detection by Gao et al. [22]. This work proposed three spectral envelope
filters to evade an SER classifier. As mentioned previously (Section 2), the evasion attacks presented in this work
are not real-time and were not evaluated acoustically.
This work is compared directly to DARE-GP by attempting to generate evasive variants for a set of audio
samples and then calculating the ESR with respect to the black-box classifiers described above.

5.4 Defenses Against DARE-GP


There are two types of defenses against DARE-GP covered in this work: passive and active.
In the case of a passive defense, the SER provider does not directly attempt to defeat DARE-GP. Rather, in a
passive defense, the SER provider would attempt to infer the true class of an AAP-modified audio sample. One
way to attempt this would be to identify relationships between the true and false classes for a given sample. For
example, if the application of an AAP always made happy audio samples sound surprised, then a knowledgeable
SER Provider could still perform emotional inference with reasonable accuracy.
To assess this, an analysis of the relationships between the actual and predicted classes of DARE-GP’s evasive
samples was performed. The assessment performed considered the black-box SER classifiers discussed in Section
5.2. As mentioned in Section 3, this false label independence is assessed using two measures: the Matthews
Correlation Coefficient (MCC) [12] and Normalized Mutual Information (NMI) [21].
Active defenses against audio evasion attacks have been proposed in multiple works. Since the defender does
not have access to ground truth labels for AAP-modified audio samples, adversarial training is not an option.
Hussain et al. [26] developed Waveguard; a set of defenses for ASR systems. These defenses were a set of audio
transformations that should not change the transcript extracted by an ASR model; any mismatches between
the transcripts of these transformed audio variants would indicate an attempted evasion. The main difference
between the use of these defenses for SER versus ASR is the expected outcome. In the ASR use case, these defenses
were used as detection mechanisms, whereas in this work a defense is considered successful if an SER classifier,
when presented with the defended audio, predicts the correct emotion.
To bolster these defenses, two additional defenses are introduced: noise reduction and AAP frequency bandpass
filters. Noise Reduction is performed using the off-the-shelf noisereduce library in Python. AAP frequency
bandpass filters assume that the SER Provider could detect roughly where the AAP’s tones are in an audio sample.
This is implemented by zeroing out the AAP tone frequency bands +/- 5Hz.
To evaluate the efficacy of these defenses, an AAP is applied to a set of audio samples which are then evaluated
for ESR against the black-box classifiers discussed above. The defenses are applied to any successfully evasive
samples and are then presented to the classifiers again. If the defense reverted the sample to the correct class,
then the sample is no longer considered evasive. The resulting decrease in ESR is used as the evaluation metric.

5.5 Evaluating Against a Commercial Smart Speaker


Finally, AAPs were evaluated in an acoustic setting against two real smart speakers: a 1st-generation Amazon
Echo and an Amazon Echo Dot. AAPs for these evaluations are developed as described in Section 5.1. The
primary change in these evaluations is the acoustic mixing of AAPs and audio samples; AAPs and audio samples
are played simultaneously and recorded in a real-world environment. Evaluations performed include changes
in user location with respect to two different smart speakers and automated AAP playback triggered by the
smart speaker’s wake word. When calculating the ESR for a given AAP, the smart speakers’ speech-to-text
transcription is used instead of the vosk/kaldi implementation used for training. This allowed the assessment of
the transferability of AAPs to a new speech-to-text implementation and a new microphone (the smart speaker’s

, Vol. 1, No. 1, Article . Publication date: November 2022.


• 13

A B C D

Fig. 7. (A) DARE-GP evaluation setup with Amazon Echo. (B) DARE-GP evaluation setup with Amazon Dot. (C) DARE-GP
test setup with Amazon Dot and 3rd-party microphone for AAP and utterance capture. (D) DARE-GP evaluation setup
with Amazon Dot and DARE-GP device implemented with a Raspberry Pi. Direction markings 1-8 used for AAP direction
evaluation.
Dataset Speakers Distinct Utterances # of Classes Total Recordings Source
RAVDESS 24 2 8 1440 [40]
TESS 2 200 5 1800 [17]
Table 2. Dataset details

microphone array). Additional details on these evaluations are provided in Section 7.5. The audio recording
configurations used in this work are shown in Figure 7.

6 DATASETS
The experiments described in this paper were performed using two standard audio datasets: the Ryerson Audio-
Visual Database of Emotional Speech and Song (RAVDESS) [40], and the Toronto Emotional Speech Set (TESS)
[17] datasets. Details of both of these datasets are provided in Table 2.
The spoken part of the RAVDESS dataset (no sung content) consists of 1, 440 samples from 24 speakers. The
speakers were actors, split evenly between males and females. The speakers performed two separate utterances
while demonstrating 8 different emotions. The subset of the TESS dataset used in this paper consisted of 1800
audio samples generated by two actresses. These samples spanned five emotions (neutral, angry, happy, sad, and
fearful) and included 200 utterances. This dataset was utilized in assessing the extensibility of AAPs to multiple,
previously unheard utterances.
Synthetic Data Variants: In addition to the base datasets, synthetic variants of the RAVDESS data were
generated by artificially introducing background environmental noise, reverberation, and audio – amplification
and -de-amplification (following [55]) to the audio files. They were generated to simulate environmental diversity
at training time. These variants were used exclusively for RAVDESS dataset training and were excluded during
the evaluation.
The TESS dataset was also played and recorded in the target environment for over-air acoustic experiments
(Section 7.5). This recorded dataset simulated how user data would be collected in the home, along with any
background noise, reverberation, etc. typical in such an environment. The recorded TESS data was used during
training and validation for the Section 7.5’s evaluations.
Before performing any experiments, specific samples were culled if a) the speech-to-text library was unable
to extract a correct transcript or b) the surrogate SER classifier misclassified the sample before any AAPs were
applied. This culling process ensured that AAPs were not penalized for audio samples, and ESR values were not
inflated due to natural misclassifications from the surrogate classifier.

, Vol. 1, No. 1, Article . Publication date: November 2022.


14 •

7 EVALUATION
The following experiments address five research questions:
1. Can DARE-GP successfully evade a known, black-box SER classifier? (Section 7.1)
Addressing research question 1 proves that the disruption in spectral information can degrade conventional
SER classifiers’ performance without degrading the transcription utility of the audio. The following research
questions relate to previously-unseen, black-box classifiers; the surrogate SER classifier (used in addressing
research question 1) is used during training, and the previously-unseen, black-box classifiers are used during
evaluation.
2. Do AAPs developed against the surrogate SER classifier transfer to previously unseen, black-box SER
classifiers? (Section 7.2)
3. How does DARE-GP compare against comparable audio evasion techniques? (Section 7.3)
4. Can someone familiar with DARE-GP defend against AAP-caused misclassification? (Section 7.4)
The previous research questions were evaluated digitally. The following question considers the real-world
deployment of digitally-generated AAPs.
5. Can digitally-developed AAPs work acoustically (over-air) against a real smart speaker? (Section 7.5)

The metaparameter values (discussed in Section 5) are provided in Table 3. Most of these parameters were
found empirically based on the desired level of diversity and AAP effectiveness. The AAP Frequency Range was
set to limit AAP frequencies to ranges within the typical human speech. The AAP Muzzle range was determined
based on some initial assessments of AAP loudness on transcription scores. The range of muzzle values was
limited to protect transcription accuracy and prevent AAPs from becoming irritating to users. Most importantly,
all metaparameter values were derived without any access to the black-box SER classifiers and these values
were invariant across all of the evaluations presented in this work. This consistency prevented the introduction
of any hidden bias based on knowledge of the black-box classifiers’ inner workings.
AAP Training Date AAP Test Data ESR
Parameter Value RAVDESS - No Variants RAVDESS - No Variants 0.471
Tournament Selection Size (𝑛𝑆𝑒𝑙) 3 Male Speakers 0.468
Crossover Probability (𝑝𝐶𝑋 ) 0.5 Female Speakers 0.474
TESS TESS 0.62
Mutation Probability (𝑝𝑀𝑈𝑇 ) 0.1 Speaker 1 0.683
AAP Frequency [100,4000] Hz Speaker 2 0.632
AAP Duration [2.5, 4.0] s RAVDESS with Variants RAVDESS - No Variants 0.51
AAP Offset [0.0, 0.5] s Male Speakers 0.52
Female Speakers 0.505
AAP Muzzle [25, 150] Table 4. AAP evasion results against surrogate classifier. All
Table 3. AAP Generation Meta Parameter Values evaluations were performed with a disjoint 80/20 train/test
split.

7.1 Evasion of Surrogate SER Classifier


This section evaluates the effectiveness of AAPs when evading a known surrogate SER classifier; this classifier
was available as a black-box at training time and could be queried as part of the AAP fitness evaluation. The
surrogate SER classifier is a simple DNN that demonstrates state-of-the-art performance using Mel-frequency
cepstral coefficients (MFCCs) as the feature representation; MFCCs are a widely used feature in many audio
classification use cases. [10, 14, 20, 41, 53, 69] Additional details of this classifier are provided in Table 5.
The measure of effectiveness used was ESR (defined in Section 3.2). These evaluations used 80/20 train/test
splits on the RAVDESS and TESS datasets. The RAVDESS dataset evaluation tested the generated AAPs’ efficacy

, Vol. 1, No. 1, Article . Publication date: November 2022.


• 15

across a broad user base. The TESS dataset evaluation tested the AAPs’ efficacy against previously unheard
utterances. In addition, one evaluation was performed by training AAPs using the RAVDESS dataset augmented
with synthetic variants designed to simulate background noise and reverberations in an acoustic environment.
These variants did not introduce additional emotional content, but the distortions decreased transcription
confidence in the samples. By training with these samples, generated AAPs should be less disruptive to speech
transcription, thus improving the ESR.
In all the evaluations, the goal was the development of a single AAP that generated evasive audio samples for
the users. The results of these evaluations are provided in Table 4.
There were a few interesting takeaways from this section’s initial experiments. First, the overall approach
worked; with no built-in knowledge of the classifier other than a presumed relevance of spectral features and
access to the classifier’s score, DARE-GP was able to generate a single AAP that created adversarial variants for
47% of the RAVDESS evaluation dataset. There was no significant skewing of the evasive samples; they were
spread fairly evenly across all of the speakers in the dataset and were not tied to any specific utterance. Further,
the AAP worked was effective for both male and female users in the dataset. This also demonstrated that even
with a very broad set of users (24 in this case), a single AAP could significantly degrade an SER classifier’s
performance for all of those users.
Next, during the TESS dataset evaluation, the generated single AAP caused misclassification on just over half
of the evaluation dataset with no noticeable skewing in favor of one speaker or another. Since TESS has 200
distinct utterances, the train/test split was performed with respect to these utterances; none of the utterances
used for testing were available during training. This demonstrated that AAP generation is also effective against
previously unheard utterances, which is a crucial attribute of this approach.
Finally, the RAVDESS evaluation trained with synthetic variants resulted in a 15% increase in the number of
evasive samples generated from a single AAP. Examination of these results showed that roughly the same number
of possibly evasive samples fooled the SER classifier; the increase in evasive samples was due to a decrease in
transcription errors on these samples. The synthetic variants tended to have lower transcription confidences
before perturbations were applied; this meant that the perturbations trained using these variants evolved down
more conservative paths so that they would not drive the transcription confidence below the threshold used in
these experiments (0.5).
Notably, This section’s evaluation demonstrates that the disruption in the spectral attribute of speech through
AAPs can degrade SER classifiers’ ability to perceive emotional content without impacting the utility of the audio.
Moreover, generated AAPs were effective against previously unheard utterances, and their effectiveness was not
noticeably skewed based on speaker or speaker gender. Based on these findings, the following sections evaluate
the practical challenges (research questions 2-5) of deploying generated AAPs in real applications.

7.2 AAP Transferability


This section’s evaluations assess the transferability of AAPs from the surrogate classifier C* to a previously-unseen,
black-box classifier, C. It is important to note that these black-box classifiers have different topologies, activation
functions, classes, and/or feature representations. The details of these black box SER classifiers are provided in
Table 5. Once again, the measure of effectiveness is ESR, but in this case, the classification evasion portion of
ESR was calculated against the black-box classifiers. These evaluations used a combined RAVDESS and TESS
dataset with an 80/20 train/test split.
Notably, the same AAP, which was developed using only the surrogate classifier, was used to evade all of the
black box classifiers. The AAP was not developed or modified in any way to address the specifics of the black-box
classifiers.

, Vol. 1, No. 1, Article . Publication date: November 2022.


16 •

Classifier Description Training Classes Accuracy


Data
Simple DNN (the sur- This is a 7-layer, densely connected DNN RAVDESS 8: neutral, calm, .877
rogate classifier) constructed for this project. This model is TESS happy, sad, an-
used in the AAP fitness function. gry, fearful, dis-
gust, surprised
Data Flair [15] 2-Layer, multi-Layer perceptron model RAVDESS 4: calm, happy, .665
published as part of the data-flair machine fearful, disgust
learning training program.
Speech Emotion Ana- This 18-Layer, convolutional neural net- RAVDESS 10: angry, calm, .921
lyzer (SEA) [52] work (CNN) is capable of detecting five SAVEE fearful, happy,
different male/female emotions from audio sad for both male
speeches. and female
Speech Emotion Clas- Uses a bi-directional LSTM attention mech- RAVDESS 8: neutral, calm, .916
sification with Py- anism and a CNN to classify audio samples TESS happy, sad, an-
Torch - CNN-LSTM based upon their Mel Spectrograms. Model gry, fearful, dis-
(SEC) [30] was retrained due to low-accuracy on TESS gust, surprised
data.
Convolutional Neu- SER based upon an implementation by Gao RAVDESS 8: neutral, calm, 0.952
ral Network with et al. [22] as a target for adversarial evasion. TESS happy, sad, an-
Multi-scale Area At- This classifier, based upon the work of Xu gry, fearful, dis-
tention (CNN-MAA) et al. [70], uses MFCC features. gust, surprised
[22]
Time delay Neural A Time Delay Neural Network (TDNN) RAVDESS 8: neutral, calm, 0.678
Network (TDNN) SER classifier built specifically for this TESS happy, sad, an-
work. The input to this classifier is a 3s gry, fearful, dis-
spectrogram. gust, surprised
Residual Neural Net- A RESNET SER classifier built specifically RAVDESS 8: neutral, calm, 0.773
work (RESNET) for this work. The input to this classifier is TESS happy, sad, an-
a 3s spectrogram. gry, fearful, dis-
gust, surprised
Table 5. Classifiers for Transferability Evaluation

These classifiers used different combinations of Mel-frequency cepstral coefficients (MFCCs) [48], mean Chroma
[57] and mean Mel-scaled spectrogram [60] values for their features. In addition, the class labels and network
topologies were different between classifiers. Since DARE-GP is oblivious to these details, application to these
other classifiers was completely transparent. The results of applying these AAPs are presented in Table 6; the
ESR was calculated on samples that the classifiers originally classified correctly.
The ESR against the SEC, TDNN, and RESNET classifiers were particularly significant; these classifiers used
Mel-scaled spectrograms for their feature representation and were the least similar to the surrogate classifier
in both topology and feature representation. These results underscore the appeal of DARE-GP; since AAPs are
generated to create spectral noise, they are transferable to previously unseen models with different topologies,
classes, and feature representations, without modification.

, Vol. 1, No. 1, Article . Publication date: November 2022.


• 17

7.3 Comparison with Other Evasion Attacks


The evaluations in this section compare DARE-GP with state-of-the-art evasion attacks on the black-box SER
classifier proposed by Gao et al. [22]. Three attacks against the SER classifiers using three spectral envelope
distortions: Vocal Tract Length Normalization (VTLN) [36], the McAdams transformation [45], and Modulation
Spectrum Smoothing (MSS) [62] were introduced [22].
The VTLN [29] distorts the voice content by distorting the spectral curves. McAdams transformation [49]
modifies the formants of speech by performing linear predictive coding (LPC) [28] on the signal and changing the
frequency of harmonics of the speech. MSS modifies the temporal fluctuation of speech by applying a short-time
Fourier transform (STFT) [18] on the speech, a zero-phase low pass filter on the log amplitude spectrogram, and
reverse-STFT to combine smoothed amplitude with the original phase spectrogram.
It is important to note that none of these attacks are executable in a real-time, acoustic environment (see 2.3);
hence are not the true baseline or competitors of this paper. However, during the time of writing, they were the
only relevant work; therefore, they were compared in this section.
To compare DARE-GP to the attacks in [22], a combined RAVDESS/TESS dataset was used. Four different
potentially evasive samples were generated for each audio sample by applying an AAP (i.e., our approach) and
each of the three spectral envelope distortions described above. The results are provided in Table 6.

Classifiers Evasion Success Rate (ESR)


Simple DNN Data Flair SEA SEC CNN MAA TDNN RESNET
VTLN [22] 0.097 0.441 0.235 0.638 0.333 0.516 0.661
McAdams [22] 0.056 0.062 0.061 0.571 0.323 0.542 0.538
MSS [22] 0.0 0.0 0.01 0.663 0.402 0.673 0.508
DARE-GP 0.472 0.852 0.675 0.687 0.686 0.698 0.692
Table 6. ESR values for various attacks against the SER classifiers considered in this work.

In our targeted speech samples (i.e., datasets), emotion is not conveyed by uttering emotional words/context;
instead, it is conveyed by the underlying tone depicted by the temporal content of speech. Moreover, existing
literature [64, 65] shows the third and fourth formants convey the emotional content of speech, and the three
spectral envelope attacks distort the spectral attributes, including temporal content and the formant information;
hence can deceive the SER classifiers. However, these transformations are generic and applied to all frequency
ranges and harmonics, distorting speech’s transcription utility as well, resulting in a relatively lower ESR.
Moreover, spectral envelope attacks perform relatively well on SER classifiers that take spectrogram directly as
input (i.e., SEC, TDNN, and RESNET). They are less effective on the classifiers that apply MFCC on the acoustic
data (i.e., DNN, Flair, SEA, CNN-MAA); this transformation filters out some of the distorted information during
the feature extraction phase. In contrast, DARE-GP only introduces additive noise on the specific spectral aspects
(through generated tones for specific frequencies and amplitudes), which convey emotional information and does
not interrupt the speech transcription-relevant information, nor does it modify the overall spectral attributes;
hence outperforming the spectral envelope attacks and more importantly performs well against all SER classifiers.
These evaluations further demonstrate DARE-GP’s generalizability and efficacy against different SER classifiers,
even outperforming the attacks that need off-line spectral transformations [22].

7.4 Defenses Against DARE-GP


DARE-GP is essentially an evasion attack against an unknown SER classifier. This attack’s effectiveness has been
shown above, but this begs the question regarding whether or not a knowledgeable SER Provider, one familiar

, Vol. 1, No. 1, Article . Publication date: November 2022.


18 •

Classifier Mean MCC Mean MCC - AAP Mean NMI Mean NMI - AAP
Data Flair 0.304 0.093 0.267 0.174
SEA 0.911 0.290 0.351 0.241
SEC 0.695 0.144 0.653 0.089
CNN MAA 0.916 0.220 0.822 0.243
TDNN 0.638 -0.135 0.457 0.150
RESNET 0.754 -0.121 0.642 0.108
Table 7. Matthews Correlation Coefficients (MCCs) and Normalized Mutual Information (NMI) for baseline classifier
performance and when the classifiers were presented with AAP-modified audio samples

with the details of DARE-GP, could deploy countermeasures to allow them to perform accurate emotion inference.
This question is considered from two different vantage points: passive and active defenses. The datasets and
AAP used in these evaluations were the same used in Section 7.3.
7.4.1 Passive Defense - True Class Inference. Passive defenses attempt to infer the true class of an audio sample
without modifying the audio. The SER Provider could exploit strong relationships between false and true classes to
defeat DARE-GP. As mentioned in Section 3, the evaluation metrics used to assess the false label independence
(the absence of such strong relationships) are the Matthews Correlation Coefficient (MCC) [12] and Normalized
Mutual Information (NMI) [21].
The MCC is a special case of the Pearson Correlation Coefficient used for classification problems. As with the
Pearson Correlation Coefficient, the score is in the range [-1, 1], where 1, -1 and 0 are strong positive correlation,
strong negative correlation and uncorrelated, respectively. The baseline and perturbed class distributions were
also assessed using their NMI. NMI is scaled in the range [0, 1]; a value of 1 indicates that one can create a perfect
one-to-one mapping from predicted class labels to actual class labels. A value of 0 indicates that there is no
dependency between the predicted and actual labels.
Using the sklearn.metrics implementations of MCC and NMI, values for the black-box classifiers were calculated
under normal circumstances and when samples were perturbed by an AAP. The values in Table 7 aggregate the
values for the unified RAVDESS and TESS dataset.
The application of an AAP significantly degraded both the MCC and NMI between true and predicted classes
for all of the classifiers. These values of NMI and MCC indicate very weak correlations between the true and
predicted classes. In the case of CNN MAA, over 68% of the misclassifications were predicted as angry or surprised.
The SEC classifier misclassified over 80% of the AAP-perturbed audio samples to sad or angry. The RESNET and
SEC classifiers, on the other hand, strongly favored one class for misclassification (sad and happy, respectively).
The Data Flair, SEA, and TDNN classifiers, on the other hand, did not strongly favor any class for misclassification;
these classifiers’ misclassifications were spread out fairly evenly across all possible classes.
Further investigation into why these classifiers’ misclassifications fell into these distributions is left as future
work. The main takeaway from these results is that the AAP-driven misclassifications do not reveal information
that would allow the SER Provider to reverse engineer the actual classes from these forced misclassifications.
7.4.2 Active Defense. Active defenses attempt to mitigate the impacts of DARE-GP before attempting to infer
the true class. One common approach, adversarial training, is not viable because the SER Provider would not
have access to the ground truth emotional content associated with AAP-modified audio samples. That said, there
are multiple other approaches that could be used to mitigate the impacts of AAPs on a SER classifier. This work
considers several active defenses:
• Audio Resampling [26]
• Mel Spectrogram Extraction and Inversion [26]

, Vol. 1, No. 1, Article . Publication date: November 2022.


• 19

ESR with Δ after applying the defenses


No Defense Audio Resmpling Mel Spectrogram Extraction Noise Reduction AAP Band Pass
Data Flair 0.839 0.000 +0.011 -0.024 0.000
SEA 0.675 0.000 -0.011 -0.013 -0.001
SEC 0.686 0.000 -0.004 -0.002 -0.040
CNN MAA 0.680 -0.001 -0.005 -0.095 -0.076
TDNN 0.690 -0.093 -0.067 -0.096 -0.056
RESNET 0.656 -0.018 -0.051 -0.049 0.056
Table 8. The effects of various defenses against DARE-GP on the effective ESR against each SER classifier. Rather than using
a single AAP, an AAP was randomly selected from the top 5 generated AAPs.

• Noise Reduction
• AAP Band Pass
Firstly, a simple countermeasure to many defenses is swapping amongst multiple AAPs at runtime. For these
evaluations, rather than selecting the best AAP from the final generation, the top five AAPs were selected. During
the evaluation, one of these was selected at random for mixing with a target audio sample. From this point
on, ESR was calculated against each black-box SER classifier, consistent with the approach used in Section 7.2.
Table 8 shows the ESR represented by randomly selecting from the top five AAPs against each black-box SER
classifier when no defense was in place and when each of the defenses listed above was applied.
These defenses were largely ineffective. The failure of audio resampling was not entirely surprising; the tones
used to comprise AAPs are constant over their duration and are present regardless of the sample rate. Mel
spectrogram extraction and inversion was also ineffective, which was not surprising. The surrogate classifier
and several of the black box classifiers use MFCCs as their feature representations; if AAPs were significantly
degraded during the MFCC extraction process, then they would not have been effective against any of these
MFCC-based classifiers. One interesting result here was the improved ESR against Data Flair when this defense
was applied. Data Flair uses 40 MFCCs as the feature representation; the defense used here used 22 MFCCs to
be consistent with [22]. This lossy preprocessing negatively impacted Data Flair’s performance on a few audio
samples, which degraded the performance and improved the ESR.
Noise reduction was also fairly ineffective; we believe that this is because the AAP is composed of tones within
the range of human speech, which helped to mask the AAP from the noise reduction algorithm. Finally, band-pass
filters were considered; these filters are "a linear transformation of the data that leaves intact the components of
the data within a specified band of frequencies and eliminates all other components". [13] In this case, bandpass
filters were applied to remove within +/- 10Hz of the AAP frequencies. Again, due to the interspersion of AAP
frequencies with human speech, removing these bands also impacted the SER classifiers. Discussion on the
relative importance of various bands of the audio samples to classifier results is covered in Section 8.

7.5 Acoustic AAP Evaluations


The final set of evaluations assessed whether or not digitally-generated AAPs could be effective as over-air,
playable sounds. ESR is the metric used for these evaluations. Since these evaluations involved commercial smart
speakers, the ESR calculation used the Alexa-based speech-to-text implementation rather than the vosk/kaldi
libraries used in the rest of this work. These evaluations used the RAVDESS ‘with synthetic variants’ data for
training, followed by additional training and validation with the partial TESS dataset to select the best AAPs for
the target users and environments, and, finally, evaluation using the rest of the TESS dataset. RAVDESS was used
to train "factory default" AAPs without any insight into the target environment or users. The TESS data (the
portion used to train and select the best AAPs) acted as user-supplied audio samples to tailor the "factory default"
AAPs to the expected users. 80% of the RAVDESS dataset was used for training, and the train/validation/evaluation

, Vol. 1, No. 1, Article . Publication date: November 2022.


20 •

split on the TESS dataset was 20%/20%/60%. Details regarding AAP training, refinement, validation, and evaluation
are provided in Figure 8.
The evaluations presented in this section are:
• Real-time playback of AAPs during user utterances at a real smart speaker from a fixed distance of 1ft
(Section 7.5.1)
• Evaluation of AAPs when the user is speaking from different distances (Section 7.5.2)
• Evaluation of AAPs when the user is speaking from different directions relative to the DARE-GP device
(Section 7.5.3)
• Automated, real-time AAP invocation triggered by the smart speaker’s wake word (Section 7.5.4)

Fig. 8. (A) Pre-train a generic population using the “canonical” dataset. This training was performed over 40 generations. (B)
Starting with this pre-trained population, tailor for the targeted users by training using a subset of the “user” dataset, and
generate a tailored population. This training was performed over 10 generations. (C) Play the AAPs generated by this tailored
population at varying loudness in environment E and record them. The loudness settings for this evaluation were: 100%,
93.75%, 81.25%, 68.75%, and 50% of maximum playback volume. (D) Pick the top AAP/loudness combinations by digitally
mixing with the “user” validation data and calculating the ESR for each AAP/loudness combination. (E) Acoustically mix the
best AAP/loudness with the “user” evaluation dataset and calculate the ESR.

7.5.1 Acoustic Evaluation on Commercial Smart Speakers. The first set of evaluations considers the effectiveness
of digitally-derived AAPs when played in an over-air acoustic setting. These evaluations were performed against
two commercial smart speakers as shown in Figures 7 (A) and (B). In step C (Figure 8), the candidate AAPs are
played and recorded in the environment so that, in step D, they can be digitally mixed with the validation dataset
to identify the most promising AAP/loudness combination. Notably, it would not be reasonable to have an end
user record data using their smart speaker and then download the samples to perform step D; there is no API
to access such recordings. To address this, a separate microphone was used. This microphone was adjacent to
the Alexa device and was at a 100-degree angle (empirically determined) with respect to the speaker playing
the AAPs (see Figure 7 (C)). This setup allowed the microphone to record sound based on the room’s acoustics
and the smart speaker’s audio conduction [66]. This microphone acted as a surrogate capturing the AAPs on
behalf of the smart speaker. These recorded AAPs were then mixed with the validation dataset, and the best
were selected for the evaluation phase. To evaluate the ultimate utility of the AAPs the separate microphone
was removed, the audio samples from the evaluation set were played simultaneously with the AAPs, and were
recorded by the smart speaker using a custom Alexa skill that treated each of these AAP/audio sample recordings
as an interaction. These audio files and corresponding Alexa transcripts were manually downloaded from the
Alexa console[8]. The ESR was then calculated for each AAP by counting any misclassified sample as a successful
evasion if the transcript extracted by Alexa was correct.

, Vol. 1, No. 1, Article . Publication date: November 2022.


• 21

Fig. 9. Acoustic mixing for Amazon Echo and Ama-


zon Dot, Train RAVDESS 40 generations, TESS 10 Fig. 10. Acoustic mixing for Amazon Echo and Amazon Dot
generations, TESS data train/validation/eval split with the user at 1ft, 2ft, 3ft, 5ft and 9ft. Alexa transcripts were
of 25/25/50. used when calculating ESR.

Figure 9 shows the evaluation results for the configurations shown in Figure 7 (A) and (B) when the "user"
was 1ft away from the speaker. The mean ESR for the Amazon Echo and Amazon Dot were 0.779 and 0.736,
respectively. Both had comparable SER classifier evasion rates (around .9), but with very low transcription error
rates, likely due to the robust microphone arrays present in smart speakers.
Since these were the first trials to play AAPs acoustically, this was the first opportunity to assess how invasive
the AAPs were. During these evaluations, the mean background noise in the room was 46dB. The loudest AAP
generated in these experiments was, at its peak, 50dB, which is below the normal conversation range [1].
7.5.2 Evaluation of AAP Robustness at Different Distances. The next evaluation used the two smart speakers
again with the "user" at different distances. In these cases, the AAPs and loudness used in Section 7.5.1 (Figure
8, step E) were used at each distance to assess how well these AAPs would work if the user were interacting
with the smart speaker from different locations in the room. The results of these evaluations are provided in
Figure 10; the 1ft distance data from Figure 9 is included here for easy comparison.
Notably, even at a distance of 9ft, there is minimal degradation in the AAP’s effectiveness. There are two
reasons for this robustness. First, irrespective of “user’s” distance from the speaker, the AAPs are always played
from a device touching the smart speaker; this means that the smart speaker only needs to be robust with respect
to capturing the user’s speech from different ranges, which is a primary requirement of these devices. Secondly,
during training with the canonical RAVDESS data, the training data was augmented with synthetic variants.
These variants aided with AAP robustness by simulating the background noise and reverb inherent in a real
environment.
7.5.3 Evaluation of AAP Robustness from Different Directions. In addition to distance, the direction from which
a user speaks could also impact the effectiveness of an AAP. To evaluate this, the same audio samples used
in Section 7.5.2 were played at 5ft and 9ft from the smart speaker at each of the directions shown in Figure
7 (D). The AAP was played from the small speaker adjacent to the smart speaker as shown in Figure 7 (D),
with no change to AAP or utterance loudness at the respective speakers. The resulting ESR of this AAP at each
distance/direction combination is provided in Table 9.
User direction with respect to the smart speaker and the DARE-GP device does play some role in AAP
effectiveness. The trials in Table 9 showed a slight degradation when the user was aligned closely with the
DARE-GP speaker. This added relatively higher disruption on the speech signal, causing higher degradation on
transcription efficacy, resulting in lower ESR. The effect was much more pronounced in the 9ft test, which was
not surprising considering the larger variance observed at 9ft in the distance evaluation (see Figure 10).

, Vol. 1, No. 1, Article . Publication date: November 2022.


22 •

Directions Mean Variance


1 2 3 4 5 6 7 8
5ft 0.876 0.852 0.848 0.864 0.86 0.836 0.844 0.852 0.854 0.00012
9ft 0.72 0.708 0.74 0.764 0.604 0.656 0.772 0.752 0.7145 0.00294
Table 9. ESR values for an AAP where

7.5.4 Automated AAP Invocation. The final evaluation performed on DARE-GP was an end-to-end configuration
to show that an AAP could be deployed in a usable, automated configuration. To achieve this, an off-the-shelf
wake word detection engine [47] was used to listen for "Alexa". Upon hearing the wake word, this system then
played a pre-determined AAP at a pre-determined volume while a user’s utterance was played for an Amazon
Dot. The utterance playback simulating the user was performed using the same setup shown in Figure 7 (D).
The wake word detection and AAP playback code were executed on Raspberry Pi 3, Model B with a Quad Core
1.2GHz Broadcom BCM2837 64bit CPU and 1GB of RAM. In this configuration, DARE-GP consumes less than 3.7
W of power during peak usage.
The evaluation for this trial used the same AAP and volume as the first trial at 1ft using the Amazon Dot smart
speaker from Section 7.5.1. The original ESR for this trial was 0.76. When executing the same AAP in a fully
automated manner, the resulting ESR was 0.738. The average time between wake word detection and the start of
AAP playback was 0.15 seconds.
This evaluation, along with the previous ones, demonstrate that DARE-GP’s digitally-generated AAPs can
transition successfully to an acoustic setting with multiple smart speakers in a realistic setting. In fact, this
approach benefits from characteristics of the realistic setting (i.e. - the superior microphones and speech-to-text
utilities in the smart speakers), resulting in a very effective solution. Further, it demonstrates that this approach’s
use of a pre-generated, universal AAP allows real-time deployment on a very small form factor.

8 DISCUSSION
There were some important experiment design choices and constraints that are worth discussing in more detail:
Use of Recorded Sound and Off-the-Shelf Models: For this work, evaluations used previously-recorded
audio rather than performing a new set of emotive user data. This approach provided a benchmark against a set
of known datasets, rather than possibly introducing some inadvertent bias that would lead to unrealistic results.
Likewise, the off-the-shelf SER classifiers, none of which were selected when we conceived of this approach,
were used for a similar reason. The use of existing data and existing SER classifiers demonstrates the ability of
DARE-GP to generalize.
Surrogate Model: The surrogate SER model used to train DARE-GP is significant because of its distinction
from the black-box models. This model is a simple DNN that uses MFCC features. While the .877 accuracy is
acceptable for learning the emotion-relevant spectral speech attributes, it is not as high as the black box models.
Additionally, it has different classes than Flair and SEA, and different features than Flair, SEC, RESNET and TDNN.
This is important concerning the utility of DARE-GP; by targeting an underlying attribute of a user’s speech,
the spectral components, DARE-GP was effective despite the differences between the surrogate and black box
models’ implementations and performance.
Targeted versus Untargeted Evasion: DARE-GP and the AAPs that it generates is an example of untargeted
evasion; an evasive sample has fooled the SER classifier as long as the predicted class is different from the actual
class. A more challenging approach would have been targeted evasion, where an AAP attempted to push all
audio samples to a specific class. Based on the objectives of DARE-GP, untargeted evasion was sufficient. As
demonstrated in Section 5.2, the actual and false classes are sufficiently independent of one another, which

, Vol. 1, No. 1, Article . Publication date: November 2022.


• 23

would prevent the SER Provider from inferring the actual emotion associated with a given utterance. This means,
for example, that no successful targeted advertising is possible utilizing the emotional content of any speaker’s
utterance.
Effects of Loudness on AAP Performance: During the evaluation, user loudness was considered when
evaluating the effectiveness of AAPs. No formal evaluations were performed in this area, but some observations
were made. As stated in Section 7.1, there was no significant skewing with respect to AAP performance based on
speaker identifier in either the RAVDESS or TESS datasets, but there were significant differences in the loudness
of the audio samples associated with those users. For example, the TESS dataset had two users, both females. For
neutral utterances, Speaker 1’s and 2’s mean loudness was approximately 78dB SPL and 54dB SPL, higher than
the peak of the loudest AAP. In addition, the evaluations in Sections 7.5.2 and 7.5.3 address differences in user
location with respect to both the smart speaker and the DARE-GP device. A more detailed analysis of the effects
of user loudness would be a valid topic for future work.
Optimization of Genetic Programming Environment: The genetic programming (GP) metaparameters
and evolution operations determine both the speed of convergence to, and the quality of, a solution to a given
problem. In this work, the primary focus was on the upper and lower bounds of AAP ranges, which determined the
search space for DARE-GP. These ranges and the operations used empirically provided good results. Additional
investigation into alternate evolutionary operations (e.g. - self-adaptive simulated binary crossover [16]) is left as
future work.
Fine Tuning of SER Models: As stated previously, AAP development did not make any assumptions regarding
the distribution of training data used for the black box SER models. That said, there was an assumption that these
models were normally effective on unperturbed audio for the intended users. None of the models considered in
this work were initially effective for the TESS dataset; in order to improve this native performance, incremental
training was performed on each of the SER models with a subset of the TESS dataset. The results presented in
this paper were achieved against these retrained SER models.
Insights on How AAPs Generate Evasive Samples: DARE-GP generates tones at varying frequencies and
intensities and is able to generate evasive audio samples for multiple classifiers. It is important to understand
which Mel-spectrogram frequencies are critical for the SER classifiers’ accurate inferences since introducing
AAP tones on those frequencies would cause successful misclassifications. Kernel SHAP, a model-agnostic
interpretation framework, was employed to attain such an understanding. Kernel SHAP approximates the
conditional expectations of Shapley values in deep learning models by perturbing the input (i.e., different portions
of input) and observing the impact on the output [42].
Different frequencies were perturbed in the audio sample Mel-spectrograms to generate associated SHAP
importance scores. These scores represent which audio frequencies contribute more or less (depending on the high
or low scores) toward accurate classification. Figure 11 shows the simple DNN SER model Shapley interpretation
for the TESS dataset. Figures from the left are the spectrograms from two different audio samples, along with
explanations of their classification results generated using SHAP, and the spectrograms of successfully and
unsuccessfully evasive perturbed variations of these audio samples. SHAP explanations are shown in two formats:
the color map where red coordinates push toward correct and blue toward incorrect classification; and the bar
chart where the right and left direct magnitude corresponds to the respective coordinates’ importance toward
correct and incorrect classification.
In the first, the successful evasions had AAP tones aligned with the blue regions (or left directional bars) in
the SHAP explanation, meaning that those regions contributed negatively towards the correct classification.
AAP tones introduce higher amplitude on those frequencies (i.e., regions), causing SER misclassifications. In the
second, both successful evasions have a tone overlapping the third and fourth formants (i.e., F3 and F4) of speech.
Formants are the frequency peaks in the spectrum with a high degree of energy. F3 and F4 have been associated
with emotional speech expressions [64, 65]. The AAP tones introduce noise on those formant frequencies to

, Vol. 1, No. 1, Article . Publication date: November 2022.


24 •

Fig. 11. Spectrograms of two TESS audio samples, SHAP explanations of the SIMPLE DNN classification of these samples,
and example perturbed instances, one successfully evasive, the other not.

mislead the SER classifiers. Finally, none of the AAP tones overlaps with the first two formants (F1 and F2), which
are closely tied to the articulations of vowels and consonants, and the effective speech transcription [6, 23, 71].
This demonstrates how the AAPs generation approach ensures higher speech transcription efficacy.

9 CONCLUSION
The objective of this work was to degrade the privacy loss with respect to speaker emotions incurred by using a
smart speaker (by an SER Provider). The presented DARE-GP approach pre-calculated AAPs that, for a set of
users, are utterance and user location-independent. These attributes allowed DARE-GP to defeat SER classifiers in
real-time, as demonstrated in Section 7.1. Every evaluation using the TESS dataset demonstrated this utterance
independence, while the evaluations in Sections 7.5.2 and 7.5.3 showed only moderate degradation of an AAP’s
effectiveness due to the user’s distance and direction. DARE-GP targeted spectral features that underlie the
emotional content of speech; the evaluations in Section 7.2 demonstrate that this approach allows AAPs to
transfer to a broad, heterogeneous set of black-box SER classifiers. Evaluations in Sections 7.3 and 7.4 demonstrate
DARE-GP’s superior performance compared to state-of-the-art SER attacks and robustness against defenses from
a knowledgeable SER Provider. Finally, DARE-GP was demonstrated effective in a real-time, real-world scenario
in Section 7.5.4, where a SWaP-efficient form factor using wake word detection could automatically deploy
AAPs whenever a user interacted with a smart speaker, demonstrating an ESR of 0.738 against an Amazon Dot.

REFERENCES
[1] 2022. Normal conversation loudness. https://tinyurl.com/9r4xrz24
[2] Sajjad Abdoli, Luiz G Hafemann, Jerome Rony, Ismail Ben Ayed, Patrick Cardinal, and Alessandro L Koerich. 2019. Universal adversarial
audio perturbations. arXiv preprint arXiv:1908.03173 (2019).

, Vol. 1, No. 1, Article . Publication date: November 2022.


• 25

[3] Hadi Abdullah, Kevin Warren, Vincent Bindschaedler, Nicolas Papernot, and Patrick Traynor. 2021. Sok: The faults in our asrs: An
overview of attacks against automatic speech recognition and speaker identification systems. In 2021 IEEE symposium on security and
privacy (SP). IEEE, 730–747.
[4] Affectivia. 2017. A Market for Emotions. Retrieved June 2, 2022 from https://www.affectiva.com/news-item/a-market-for-emotions/
[5] Affectivia. 2019. MARS. Retrieved June 2, 2022 from https://www.affectiva.com/success-story/mars/
[6] A Ali, S Bhatti, and MS Mian. 2006. Formants based analysis for speech recognition. In 2006 IEEE International Conference on Engineering
of Intelligent Systems. IEEE, 1–3.
[7] Nourah Alswaidan and Mohamed El Bachir Menai. 2020. A survey of state-of-the-art approaches for emotion recognition in text.
Knowledge and Information Systems 62, 8 (2020), 2937–2987.
[8] Amazon. 2022. What Is Automatic Speech Recognition? Retrieved May 14, 2022 from https://developer.amazon.com/en-US/alexa/alexa-
skills-kit/asr
[9] Oana Bălan, Gabriela Moise, Livia Petrescu, Alin Moldoveanu, Marius Leordeanu, and Florica Moldoveanu. 2019. Emotion classification
based on biophysical signals and machine learning techniques. Symmetry 12, 1 (2019), 21.
[10] Thomas L Blum, Douglas F Keislar, James A Wheaton, and Erling H Wold. 1999. Method and article of manufacture for content-based
analysis, storage, retrieval, and segmentation of audio information. US Patent 5,918,223.
[11] Nicholas Carlini and David Wagner. 2018. Audio adversarial examples: Targeted attacks on speech-to-text. In 2018 IEEE security and
privacy workshops (SPW). IEEE, 1–7.
[12] Davide Chicco and Giuseppe Jurman. 2020. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy
in binary classification evaluation. BMC genomics 21, 1 (2020), 1–13.
[13] Lawrence J Christiano and Terry J Fitzgerald. 2003. The band pass filter. international economic review 44, 2 (2003), 435–465.
[14] Michael Cowling and Renate Sitte. 2003. Comparison of techniques for environmental sound recognition. Pattern recognition letters 24,
15 (2003), 2895–2907.
[15] data flair.training. 2019. Python Mini Project – Speech Emotion Recognition with librosa. Retrieved May 14, 2022 from https://data-
flair.training/blogs/python-mini-project-speech-emotion-recognition/
[16] Kalyanmoy Deb, Karthik Sindhya, and Tatsuya Okabe. 2007. Self-adaptive simulated binary crossover for real-parameter optimization.
In Proceedings of the 9th annual conference on genetic and evolutionary computation. 1187–1194.
[17] Kate Dupuis and M Kathleen Pichora-Fuller. 2010. Toronto emotional speech set (TESS)-Younger talker_Happy. (2010).
[18] Lutfiye Durak and Orhan Arikan. 2003. Short-time Fourier transform: two fundamental properties and an optimal implementation. IEEE
Transactions on Signal Processing 51, 5 (2003), 1231–1242.
[19] Alex Engler. 2021. Why President Biden should ban affective computing in federal law enforcement. Retrieved May 14, 2022 from https:
//www.brookings.edu/blog/techtank/2021/08/04/why-president-biden-should-ban-affective-computing-in-federal-law-enforcement/
[20] Antti J Eronen, Vesa T Peltonen, Juha T Tuomi, Anssi P Klapuri, Seppo Fagerlund, Timo Sorsa, Gaëtan Lorho, and Jyri Huopaniemi.
2005. Audio-based context recognition. IEEE Transactions on Audio, Speech, and Language Processing 14, 1 (2005), 321–329.
[21] Pablo A Estévez, Michel Tesmer, Claudio A Perez, and Jacek M Zurada. 2009. Normalized mutual information feature selection. IEEE
Transactions on neural networks 20, 2 (2009), 189–201.
[22] Jinxing Gao, Diqun Yan, and Mingyu Dong. 2022. Black-box adversarial attacks through speech distortion for speech emotion recognition.
EURASIP Journal on Audio, Speech, and Music Processing 2022, 1 (2022), 1–10.
[23] Cédric Gendrot and Martine Adda-Decker. 2005. Impact of duration on F1/F2 formant values of oral vowels: an automatic analysis of
large broadcast news corpora in French and German. In Ninth European Conference on Speech Communication and Technology.
[24] Google. 2022. Analyzing Sentiment | Cloud Natural Language API | Google Cloud. Retrieved November 14, 2022 from https:
//cloud.google.com/natural-language/docs/analyzing-sentiment
[25] Claudio Guarnieri. 2012. Cuckoo Sandbox - Malware Analysis. Retrieved May 14, 2022 from https://cuckoosandbox.org/
[26] Shehzeen Hussain, Paarth Neekhara, Shlomo Dubnov, Julian McAuley, and Farinaz Koushanfar. 2021. {WaveGuard}: Understanding
and Mitigating Audio Adversarial Examples. In 30th USENIX Security Symposium (USENIX Security 21). 2273–2290.
[27] Eugene Ilyushin and Dmitry Namiot. 2021. An approach to the automatic enhancement of the robustness of ML models to external
influences on the example of the problem of biometric speaker identification by voice. International Journal of Open Information
Technologies 9, 6 (2021), 11–19.
[28] Fumitada Itakura. 1968. Analysis synthesis telephony based on the maximum likelihood method. Reports of the 6ˆ< th> Int. Cong. Acoust.,
1968 (1968).
[29] Keith Johnson. 2018. Vocal tract length normalization. UC Berkeley PhonLab Annual Report 14, 1 (2018).
[30] Kosta Jovanovic. 2021. GitHub - Data-Science-kosta/Speech-Emotion-Classification-with-PyTorch. Retrieved May 14, 2022 from
https://github.com/Data-Science-kosta/Speech-Emotion-Classification-with-PyTorch
[31] Sachin S Kajarekar, Harry Bratt, Elizabeth Shriberg, and Rafael De Leon. 2006. A study of intentional voice modifications for evading
automatic speaker recognition. In 2006 IEEE Odyssey-The Speaker and Language Recognition Workshop. IEEE, 1–6.

, Vol. 1, No. 1, Article . Publication date: November 2022.


26 •

[32] Amy Klobuchar. 2020. Following Privacy Concerns Surrounding Amazon Halo, Klobuchar Urges Administration to Take Action to
Protect Personal Health Data. Retrieved May 14, 2022 from https://www.klobuchar.senate.gov/public/index.cfm/2020/12/following-
privacy-concerns-surrounding-amazon-halo-klobuchar-urges-administration-to-take-action-to-protect-personal-health-data
[33] John R Koza. 1995. Survey of genetic algorithms and genetic programming. In Wescon conference record. WESTERN PERIODICALS
COMPANY, 589–594.
[34] John R Koza and Riccardo Poli. 2005. Genetic programming. In Search methodologies. Springer, 127–164.
[35] Yogesh Kulkarni and Krisha Bhambani. 2021. Kryptonite: An Adversarial Attack Using Regional Focus. In International Conference on
Applied Cryptography and Network Security. Springer, 463–481.
[36] Li Lee and Richard Rose. 1998. A frequency warping approach to speaker normalization. IEEE Transactions on speech and audio processing
6, 1 (1998), 49–60.
[37] Lingkun Li, Manni Liu, Yuguang Yao, Fan Dang, Zhichao Cao, and Yunhao Liu. 2020. Patronus: Preventing unauthorized speech
recordings with support for selective unscrambling. In Proceedings of the 18th Conference on Embedded Networked Sensor Systems.
245–257.
[38] Zhuohang Li, Cong Shi, Yi Xie, Jian Liu, Bo Yuan, and Yingying Chen. 2020. Practical adversarial attacks against speaker recognition
systems. In Proceedings of the 21st international workshop on mobile computing systems and applications. 9–14.
[39] Jianwei Liu, Yinghui He, Chaowei Xiao, Jinsong Han, Le Cheng, and Kui Ren. 2022. Physical-World Attack towards WiFi-based Behavior
Recognition. In IEEE INFOCOM 2022-IEEE Conference on Computer Communications. IEEE, 400–409.
[40] Steven R Livingstone and Frank A Russo. 2018. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A
dynamic, multimodal set of facial and vocal expressions in North American English. PloS one 13, 5 (2018), e0196391.
[41] Beth Logan. 2000. Mel frequency cepstral coefficients for music modeling. In In International Symposium on Music Information Retrieval.
Citeseer.
[42] Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. Advances in neural information processing
systems 30 (2017).
[43] Thomas Macaulay. 2020. British police to trial facial recognition system that detects your mood. Retrieved May 14, 2022 from
https://thenextweb.com/news/british-police-to-trial-facial-recognition-system-that-detects-your-mood
[44] Vijini Mallawaarachchi. 2017. Introduction to Genetic Algorithms — Including Example Code. Retrieved May 14, 2022 from
https://towardsdatascience.com/introduction-to-genetic-algorithms-including-example-code-e396e98d8bf3
[45] Stephen Edward McAdams. 1984. Spectral fusion, spectral parsing and the formation of auditory images. Stanford university.
[46] Andrew McStay. 2021. UN Human Rights Council adoption of ’Right to privacy in the digital age’: Emotion recognition technologies
acknowledged. Retrieved June 3, 2022 from https://emotionalai.org/news/2021/10/25/un-human-rights-council-adoption-of-right-to-
privacy-in-the-digital-age-emotion-recognition-technologies-now-an-emergent-priority
[47] Eric Mikulin. 2022. GitHub - Picovoice/porcupine: On-device wake word detection powered by deep learning. Retrieved June 20, 2022
from https://github.com/Picovoice/porcupine
[48] Nelson Mogran, Hervé Bourlard, and Hynek Hermansky. 2004. Automatic speech recognition: An auditory perspective. In Speech
processing in the auditory system. Springer, 309–338.
[49] Jose Patino, Natalia Tomashenko, Massimiliano Todisco, Andreas Nautsch, and Nicholas Evans. 2020. Speaker anonymisation using the
McAdams coefficient. arXiv preprint arXiv:2011.01130 (2020).
[50] Rosalind W Picard. 2000. Affective computing. MIT press.
[51] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek,
Yanmin Qian, Petr Schwarz, et al. 2011. The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and
understanding. IEEE Signal Processing Society.
[52] Mitesh Puthran. 2021. GitHub - MiteshPuthran/Speech-Emotion-Analyzer: The neural network model is capable of detecting five
different male/female emotions from audio speeches. (Deep Learning, NLP, Python). Retrieved October 20, 2022 from https://github.
com/MiteshPuthran/Speech-Emotion-Analyzer
[53] Asma Rabaoui, Zied Lachiri, and Noureddine Ellouze. 2007. Towards an optimal feature set for robustness improvement of sounds
classification in a HMM-based classifier adapted to real world background noise. In Proc. 4th Int. Multi-Conf. Systems, Signals & Devices.
[54] Amirhossein Rajabi and Carsten Witt. 2021. Stagnation detection with randomized local search. In European Conference on Evolutionary
Computation in Combinatorial Optimization (Part of EvoStar). Springer, 152–168.
[55] Asif Salekin, Shabnam Ghaffarzadegan, Zhe Feng, and John Stankovic. 2019. A real-time audio monitoring framework with limited data
for constrained devices. In 2019 15th International Conference on Distributed Computing in Sensor Systems (DCOSS). IEEE, 98–105.
[56] Jackie Salo. 2021. China using ‘emotion recognition technology’ for surveillance. Retrieved May 14, 2022 from https://nypost.com/
2021/03/04/china-using-emotion-recognition-technology-in-surveillance/
[57] Roger N Shepard. 1964. Circularity in judgments of relative pitch. The journal of the acoustical society of America 36, 12 (1964), 2346–2353.
[58] Cary R Spitzer and Leanna Rierson. 2017. RTCA DO-297/EUROCAE ED-124 Integrated Modular Avionics (IMA) Design Guidance and
Certification Considerations. In Digital Avionics Handbook. CRC Press, 614–623.

, Vol. 1, No. 1, Article . Publication date: November 2022.


• 27

[59] Guy-Bart Stan, Jean-Jacques Embrechts, and Dominique Archambeau. 2002. Comparison of different impulse response measurement
techniques. Journal of the Audio engineering society 50, 4 (2002), 249–262.
[60] Stanley Smith Stevens, John Volkmann, and Edwin Broomell Newman. 1937. A scale for the measurement of the psychological magnitude
pitch. The journal of the acoustical society of america 8, 3 (1937), 185–190.
[61] Ke Sun, Chen Chen, and Xinyu Zhang. 2020. " Alexa, stop spying on me!" speech privacy protection against voice assistants. In
Proceedings of the 18th conference on embedded networked sensor systems. 298–311.
[62] Shinnosuke Takamichi, Kazuhiro Kobayashi, Kou Tanaka, Tomoki Toda, and Satoshi Nakamura. 2015. The NAIST text-to-speech system
for the Blizzard Challenge 2015. In Proc. Blizzard Challenge workshop, Vol. 2. Berlin, Germany.
[63] Rohan Taori, Amog Kamsetty, Brenton Chu, and Nikita Vemuri. 2019. Targeted adversarial examples for black box audio systems. In
2019 IEEE security and privacy workshops (SPW). IEEE, 15–20.
[64] Teija Waaramaa, Paavo Alku, and Anne-Maria Laukkanen. 2006. The role of F3 in the vocal expression of emotions. Logopedics
Phoniatrics Vocology 31, 4 (2006), 153–156.
[65] Teija Waaramaa, Anne-Maria Laukkanen, Paavo Alku, and Eero Väyrynen. 2008. Monopitched expression of emotions in different
vowels. Folia Phoniatrica et Logopaedica 60, 5 (2008), 249–255.
[66] Martin Weik. 2012. Communications standard dictionary. Springer Science & Business Media.
[67] Xu Weilin, Qi Yanjun, and Evans David. 2016. Automatically evading classifiers: A case study on PDF malware classifiers. In Proceedings
of the 23rd Annual Network and Distributed System Security Symposium.
[68] Felix Weninger, Florian Eyben, Björn W Schuller, Marcello Mortillaro, and Klaus R Scherer. 2013. On the acoustics of emotion in audio:
what speech, music, and sound have in common. Frontiers in psychology 4 (2013), 292.
[69] Chung-Hsien Wu and Chia-Hsin Hsieh. 2006. Multiple change-point audio segmentation and classification using an MDL-based
Gaussian model. IEEE Transactions on Audio, Speech, and Language Processing 14, 2 (2006), 647–657.
[70] Mingke Xu, Fan Zhang, Xiaodong Cui, and Wei Zhang. 2021. Speech emotion recognition with multiscale area attention and data
augmentation. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6319–6323.
[71] Yanli Zheng, Richard Sproat, Liang Gu, Izhak Shafran, Haolang Zhou, Yi Su, Daniel Jurafsky, Rebecca Starr, and Su-Youn Yoon. 2005.
Accent detection and speech recognition for Shanghai-accented Mandarin.. In Interspeech. Citeseer, 217–220.

, Vol. 1, No. 1, Article . Publication date: November 2022.

You might also like