Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

<Society logo(s) and publica-

tion title will appear here.>

Received XX Month, XXXX; revised XX Month, XXXX; accepted XX Month, XXXX; Date of publication XX Month, XXXX; date of
current version XX Month, XXXX.
Digital Object Identifier 10.1109/XXXX.2022.1234567

VoicePAT: An Efficient Open-source


Evaluation Toolkit for Voice Privacy
Research
Sarina Meyer1 *, Student Member, IEEE, and Xiaoxiao Miao2,3 *, Member, IEEE
and Ngoc Thang Vu1 , Member, IEEE
1
Institute for Natural Language Processing, University of Stuttgart, Germany
arXiv:2309.08049v2 [cs.SD] 22 Dec 2023

2
Singapore Institute of Technology, Singapore
3
National Institute of Informatics, Japan
Corresponding author: Sarina Meyer (email: sarina.meyer@ims.uni-stuttgart.de).
* Equal contribution. This study is supported by the Carl Zeiss foundation, JST CREST Grants (JPMJCR18A6), and MEXT KAKENHI Grant
(22K21319).

ABSTRACT Speaker anonymization is the task of modifying a speech recording such that the original
speaker cannot be identified anymore. Since the first Voice Privacy Challenge in 2020, along with the release
of a framework, the popularity of this research topic is continually increasing. However, the comparison
and combination of different anonymization approaches remains challenging due to the complexity of
evaluation and the absence of user-friendly research frameworks. We therefore propose an efficient speaker
anonymization and evaluation framework based on a modular and easily extendable structure, almost fully
in Python. The framework facilitates the orchestration of several anonymization approaches in parallel and
allows for interfacing between different techniques. Furthermore, we propose modifications to common
evaluation methods which improves the quality of the evaluation and reduces their computation time by
65 to 95%, depending on the metric. Our code is fully open source.

INDEX TERMS speaker anonymization, voice privacy, privacy evaluation

I. INTRODUCTION opaque. (2) The static structure of this framework introduces


PEAKER anonymization [1] is a task in which speech many redundant computations, leading to inefficiencies.
S recordings are automatically modified such that the orig-
inal speaker becomes unidentifiable from the audio, usually
Given the current importance of voice privacy research
in society and politics, there is demand for a framework
by changing the voice in the direction of an artificial target that allows quick ideation and experimentation without the
speaker. The goal of this process is to preserve the speaker’s burden of infrastructure issues, in order to promote collab-
voice privacy while keeping enough information from the oration among researchers. Although alternative evaluation
input to use the anonymized audio in downstream tasks (e.g., frameworks [5], [6] exist, they do not adhere to the standard
speech recognition [2]). In order to foster research in this evaluation protocol introduced by the VPC. We thus propose
topic, the Voice Privacy Challenge (VPC) has been held in a robust and modular framework for speaker anonymization.
2020 and 2022 [1], [3]. The open-source framework accom- The framework is implemented almost fully in Python and
panying the challenges —consisting of code bases, baseline consists of two branches (1) for anonymization and (2) evalu-
and evaluation models, datasets, and techniques —have had ation, as shown in Figure 1. Both branches exhibit a modular
a great influence on the field, with most approaches using at structure in which single components can be skipped or
least part of the VPC framework. added. The methods and models in each component can
The primary goal of the VPC framework is to address the be easily exchanged for alternatives. The control over the
challenges associated with voice privacy research. However, composition of a pipeline is done exclusively via configura-
it has two key drawbacks: (1) The heavy reliance on the tion files such that different speaker anonymization systems
Kaldi toolkit [4] results in processes that are complicated and (SASs) or evaluation metrics can be compared with minimal

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

VOLUME , 1
Author et al.:

Pipeline
xy 1 Disentanglement of Original Speech
Anon. ASV-EER
Config ASR Prosody ASV Other Speech Evaluation AA-lazy
Model 2 Model 2 Other Privacy Metrics.
Model 1 Model 1 Model 2 Attributes Extraction Config OA
Model 1 Model 3 E.g.Cllr
Model 3 Model 3 E.g. Emotion AA-semi
Original Content Features Prosody Features Speaker Embeddings Emotion Features
Speech Original
Speech
3 Generation of 2 Modification of Original Features
Anonymized Speech ASR-WER Gvd
Other metrics,
Speaker Embeddings
Anonymized Other Speech Method 1 E.g.Pitch
Speech Synthesis Anonymizer Anonymized
Speech Model 2 Attributes Correlation
Method 1 Method 2 Speech Method 2
Model 1 Model3 Modification
Method 3
Anonymization Evaluation

FIGURE 1: Proposed framework, consisting of separate anonymization and evaluation branches and example configurations
for each branch
.

effort. We extend and improve commonly used evaluation Whereas the latter perceptively modify speech to conceal
methods by employing the ESPnet [7] and SpeechBrain [8] the original speaker’s identity [1], [3], they are less effective
toolkits for evaluation instead of Kaldi [4]. Furthermore, than disentanglement-based methods [9]–[19], which involve
we significantly speed up the evaluation by combining data three steps as shown in the left of Figure 1:
reduction and finetuning techniques for privacy and utility (1) Disentanglement of Original Speech: (i) Frame-level
metrics. Overall, we believe this framework will be an content features extraction via automatic speech recognition
essential tool for advancing research in voice privacy and (ASR) models [2], [20] or self-supervised learning (SSL)
facilitating participation in future Voice Privacy Challenges based content encoders [12], [13]; (ii) Frame-level prosody
in the field. features extraction, e.g., F0 using the YAAPT algorithm
Our contributions are as follows: [21]; (iii) Utterance-level speaker embeddings extraction,
e.g., from pre-trained x-vector [22] or ECAPA-TDNN [23]
• We propose improvements to standard evaluation meth-
models for automatic speaker verification (ASV).
ods for speaker anonymization which make these eval-
(2) Modification of original features: This is a crucial step
uations more efficient, easier to use, and more feasible
to hide the original speaker’s identity. Most works focus on
for intermediate evaluations.
modifying the original speaker embeddings, assuming that
• We show through experiments on two primary VPC
identity is mainly encoded in them. Typically, a selection-
baselines and two state-of-the-art SASs that these im-
based speaker anonymizer [24], replaces an original speaker
provements provide stronger privacy tests, while requir-
with an anonymized speaker vector. This anonymized vector
ing 65 to 95% less time to execute.
is a mean speaker vector derived from a randomly chosen set
• We release all code in a new toolkit, VoicePAT1 (Voice
of speaker vectors in an external pool. However, the averag-
Privacy Anonymization Toolkit), which further includes
ing process on speaker vectors leads to limited speaker diver-
pipelines for running different state-of-the-art SASs.
sity in the generated anonymized voices and serious speaker
• Due to its modularity, extending VoicePAT with more
privacy leakage problems when facing stronger attacker [14].
anonymization systems and evaluation metrics is eas-
To mitigate these problems, recent works adopt DNN-based
ily possible, enabling the comparison of different ap-
anonymizers. For example, [10], [11] introduce a Wasserstein
proaches within one framework and probing the ef-
generative adversarial network (GAN) that is trained to turn
fectiveness of the anonymization using a more diverse
random noise into artificial speaker embeddings that follow
attack landscape.
a similar distribution as the original speaker vectors. Another
approach employs an orthogonal householder neural network
II. Background and Related Work (OHNN) to rotate original speaker vectors, ensuring that
Before describing our proposed framework and evaluation, anonymized vectors follow the original space and maintain-
we will first give some background about the current state ing speech naturalness. The parameters of the OHNN-based
of the topic, existing evaluation metrics, and frameworks. anonymizer are trained using classification and similarity
losses, encouraging distinct speaker identities [14].
A. SPEAKER ANONYMIZATION (3) Generation of anonymized speech: The anonymized
The goal of an SAS [1], [3] is to automatically mod- speaker, prosody, and content features are then fed into a
ify the voice in an audio recording to make the original speech synthesis model [25], [26] to generate anonymized
speaker unidentifiable. Following VPC baseline models, ap- speech.
proaches can be categorized into methods based on disen- The VPC introduced two primary disentanglement-based
tanglement and those based on digital signal processing. SASs. In BL 1.a, speech is disentangled into speaker identity
by a pre-trained x-vector, fundamental frequency by YAAPT
1 https://github.com/DigitalPhonetics/VoicePAT

2 VOLUME ,
<Society logo(s) and publication title will appear here.>

algorithm, and linguistic information by a pre-trained fac- 2) SPEECH UTILITY PRESERVATION


torized time delay neural network (TDNN-F) based ASR
acoustic model [2], [20]. Then, the selection-based speaker As most applications require the anonymization to retain
anonymization scheme [24] modifies a source x-vector to the linguistic content of the speech, the primary evaluation
hide speaker information. The speech synthesis acoustic for speech utility is performed with ASR and measured as
model (SS AM) generates Mel-filterbank features using the word error rate (WER). The VPC introduces two models
anonymized pseudo x-vector, F0, and linguistic features, for this: ASReval (A-lazy) is trained on the original data,
anon
followed by a neural source-filter (NSF)-based waveform and ASReval (A-semi) is trained on the anonymized data.
anon
generation model [25] to synthesize anonymized speech. ASReval tests the best-case scenario in which the down-
Similar to BL 1.a, BL 1.b replaces the traditional speech stream ASR knows how anonymization affects the audio
synthesis pipeline (SS AM + NSF) with a unified HiFi- quality and can adapt to it, whereas ASReval might simulate
GAN [26] NSF model as the waveform generator. a more realistic condition in which such information is not
known. The lower the WER is, the better.
Another common utility metric is the gain of voice dis-
B. EVALUATION
tinctiveness GVD [28] that assesses how well the ability of
Speaker anonymization commonly has two objectives: (1)
distinguishing different speakers is kept during anonymiza-
privacy: protecting the identity of a speaker, and (2) utility:
tion. If the voice distinctiveness in the anonymized space is
keeping other attributes of the original audio needed for use
the same as in the original space, the GVD is close to zero.
in downstream applications (e.g., linguistic content, natural-
If it is improved, the score is above zero, otherwise below.
ness, prosody, speaker emotion) unchanged. The challenge
The metric is closely related to the privacy metrics and is
is to optimize an SAS to achieve a trade-off between both
computed using ASVeval and voice similarity matrices.
objectives, whereby the weighting between them and the
Depending on the application, other speech utility metrics
utility assessment metrics depend on the application.
could be included, e.g., prosody preservation via pitch corre-
1) SPEAKER PRIVACY PROTECTION lation as done in the VPC 2022 [3]. A simple approach could
be to transcribe speech with an ASR model and synthesize
To evaluate the effectiveness of preserving speaker identity
it back to speech from the transcription using a text-to-
against different attackers, it is most common to compute
speech (TTS) system. This conceals speaker identity but also
the equal error rate (EER) using ASV evaluation models.
removes other paralinguistic features like emotion and health
For this, an attacker compares the anonymized trial utterance
status which are important for health applications.
processed by users against enrollment utterances in different
For simplicity, we will focus on EER as a privacy indi-
attack conditions with either a model trained on original
anon cator, and WER and GVD for measuring utility in speech
(ASVeval ) or on anonymized (ASVeval ) data [1], [3]:
recognition tasks in our experiments.
• Unprotected (OO): a baseline metric to assess the effec-
tiveness of the ASVeval attacker when both enrollment
C. EXISTING FRAMEWORKS AND THEIR LIMITATIONS
and trial utterances are not anonymized (i.e., original).
• Ignorant (OA): attackers are unaware of the anonymiza- To support participation in the challenges, the VPC pub-
tion, they use original enrollment data and ASVeval to lished an open-source framework with code for all baselines
infer the identity of anonymized trial utterances. and evaluation metrics. However, this framework lacks the
• Lazy-informed (AA-lazy): attackers use anonymized flexibility to skip single steps in its run pipeline, rerun only
enrollment speech, generated with the same SAS but parts of it, or add new metrics. Most of the algorithms are
inaccurate parameters, and ASVeval to detect identities. written in the C++-based Kaldi toolkit which is challenging
• Semi-informed (AA-semi): the attacker is similar to to maintain and lacks compatibility with standard Python-
the lazy-informed one, but employs a more powerful based speech processing models. Furthermore, the evaluation
anon
ASVeval model trained on anonymized speech, which models included in the framework can take several days for
helps to reduce the mismatch between the original and computations. Combined with the difficulty of skipping pre-
anonymized speech to infer the speaker identity. viously computed calculations, performing a full anonymiza-
tion with subsequent evaluation in the VPC framework is
For a successful anonymization, an EER close to 50% is complicated and expensive, potentially discouraging new
targeted. An alternative to the EER metric is using the researchers from working on SAS development. Motivated
log-likelihood-ratio cost function Cllr as discrimination loss by similar concerns about the framework, [5], [6] recently
min min
(Cllr ) and calibration loss (Cllr − Cllr ) [1]. Other pri- presented an alternative evaluation framework written in
sys
vacy metrics include the linkability (D↔ ) between two Python and exhibiting modular and extendable structures.
utterances [27], [28], the de-identification (DeID ) based on However, they do not test their framework with standard
voice similarity matrices [29], and the expected (DECE ) and SAS approaches nor compare their evaluation metrics with
worst-case (log(l)) privacy disclosure metrics of the ZEBRA the ones in the VPC framework. This makes it difficult to
framework [30]. assess the quality of their improvements. We thus find a lack

VOLUME , 3
Author et al.:

of suitable anonymization and evaluation frameworks for this 1) Evaluation models


topic. Hence, we propose a new alternative in this paper.
In the VPC framework, EER and GVD are computed using a
Kaldi-based x-vector speaker encoder with a PLDA distance
III. PROPOSED EFFICIENT FRAMEWORK model [22], and WERs are computed using a Kaldi-based
The proposed framework for speaker anonymization research TDNN-F model [2]. As ASV and ASR technology develops,
consists of two pipeline branches, shown in Figure 1: One it is important to examine the impact of advanced models
for the anonymization process and one for evaluation. Both on the respective evaluation results. Aiming to find the most
branches consist of several modules, which can be instanti- reliable choice, we propose using evaluation models based
ated with different models. All parameters for selecting the on the state-of-the-art toolkits Speechbrain [8] for ASV and
order and type of modules, as well as all other settings for ESPnet [7] for end-to-end ASR, as shown in Figure 2.
running a pipeline, are given in configuration files. Modules Both toolkits are developed using PyTorch [31] in the
can thus be exchanged, extended, or skipped if the general context of research, and they meet two requirements for
objective allows it. In this way, the framework provides our purposes: (1) They provide a user-friendly approach
a flexible option to modify existing approaches, combine to modify training recipes which cover a wide range of
ideas from different systems, and test the effect of single hyperparameters and architecture choices for the models.
components in a controlled fashion. (2) Both toolkits are continuously developed, ensuring they
incorporate the latest advancements in both ASV and end-
A. ANONYMIZATION BRANCH to-end ASR techniques regularly. For ASV evaluation mod-
els, we present choices including the Speechbrain-based x-
The goal of the anonymization branch is to provide a
vector and the cutting-edge ECAPA-TDNN, featuring both
platform for researchers developing an SAS to evaluate their
cosine and PLDA back-ends. For ASR evaluation models,
ideas quickly. Ideally, if they only want to test a minor
we provide an ESPnet-based Transformer encoder-based
change like a different speaker embedding modification
Connectionist Temporal Classification (CTC) ASR model
mechanism, they would only have to add a new model
with Attention Encoder Decoder (AED) [32]. A transformer-
(a Python class) to the speaker embedding modification
based language model is trained using LibriSpeech-train-
module and adjust the configuration file. If the modification
clean-360 [33] once and used for decoding.
is radical, they might need to add a new module and pipeline.
Generally, an SAS consists of the following components: 2) Modifications for GVD
(1) a configuration file, (2) a pipeline, (3) a collection of
The gain of voice distinctiveness metric GVD [29] is defined
modules, and (4) one specific model in each module. The
as the diagonal dominance ratio of two voice similarity
configuration file specifies the pipeline (e.g., the GAN-based
matrices, one for the original speaker space and one for the
pipeline of [11]), which then defines the obligatory and
anonymized one. In the VPC framework, those similarity
optional modules and their processing order. Each module
scores are computed by the ASVeval model trained on the
can be instantiated with different models or approaches,
original data. ASVeval yields more reliable scores for the
e.g., different speech synthesis models. The selection of one
original data but introduces a mismatch when applied to
model per module and the inclusion of optional modules
anonymized speech. We are interested in exploring different
are specified in the configuration file. Per default, the output
evaluation models for computing similarity scores: (1) using
of each module is saved to disk. This makes it possible to anon
the ASVeval model; (2) using the ASVeval model; (3) using
skip the computation of one module if it has been computed anon
ASVeval for original data and ASVeval for anonymized data
before and its input has not changed, and thus, to test minor
to see whether this could improve the accuracy of similarity
modifications more efficiently.
scores for different types of data. Furthermore, different from
the VPC framework, where all utterances of each speaker are
B. EVALUATION BRANCH considered for similarity computation, we enhance efficiency
Following the standards of voice privacy research, the eval- by randomly selecting 5 utterances per speaker to compute
uation branch is divided into two modules: privacy and the log-likelihood ratios between two speakers.
utility. Each module corresponds to one evaluation aspect anon anon
3) Training strategy of ASVeval and ASReval models
and can consist of one or several metrics. For example,
ASV is one module of privacy evaluation and is mainly Four models are required for privacy and utility evaluation,
measured by one metric, EER, however, multiple models as described in Section II.B: ASVeval and ASReval , trained on
can be used to calculate this. Similar to the anonymization the original LibriSpeech-train-clean-360 dataset, are directly
branch, all settings are again set in configuration files. To provided by the VPC platform. Thus, we will assess only
anon
further improve the efficiency of the proposed framework, their evaluation times, not the training time. ASVeval and
anon
we employ more powerful ASV and ASR models, explore ASReval are trained from scratch like the original models,
the training strategies, and modify the computation of GVD but on the anonymized LibriSpeech-train-clean-360 pro-
as described below. cessed by the same evaluated SAS. This requires extra time

4 VOLUME ,
<Society logo(s) and publication title will appear here.>

VPC: Kaldi-based Time (hours) Proposed: Speechbrain-based


Generation G VD evaluation
anon
ASVeval training EER evaluation x-vector Utt-50: Finetune ASVeval
1 4.25 1 4.25 +cos using part of anon. data
x-vector -56%
All: train from
+PLDA 6 6.2 6 6.25 x-vector
ECAPA-
-40% -87% -95% TDNN scratch using
2 +PLDA
0.5 0.3 0.8 0.3 +cos all anon. data
VPC All Utt-50 G VD-utt-5 ECAPA- G VD-utt-5: select 5 utt. per
TDNN
+PLDA
speaker to compute G VD
Proposed

(a) Privacy (b) Utility

FIGURE 2: Comparisons of VPC and proposed privacy and utility evaluation models. Note that although GVD is a utility
metric, we plotted GVD evaluation time in the privacy subplot as it is computed by ASVeval .

for anonymizing training data and conducting the training


process, in addition to the evaluation time. TABLE 1: Specifications for the evaluated SASs. GST means
Since the anonymization of the entire training dataset global style token [11], ROH means random orthogonal
is quite time-consuming, we aim to explore the impact of Householder [14].
anon anon
obtaining ASVeval and ASReval by finetuning the pretrained System
Content Speaker Speech Speaker
encoder encoder synthesis anon.
ASVeval and ASReval , respectively, using only a subset of the
anonymized data to eliminate the necessity of anonymizing BL 1.a [3] TDNN-F x-vector SS-AM+NSF Select.
BL 1.b [3] TDNN-F x-vector HiFi-GAN+NSF Select.
the entire dataset. We consider two techniques for data GAN [11] Branchformer+CTC/AED GST-based FastSpeech2 + HiFi-GAN GAN
reduction: (1) choosing a limited number of utterances from OHNN [14] SSL ECAPA HiFi-GAN ROH

all the speakers, and (2) selecting all utterances from a


specific subset of speakers. By using this approach, we can
balance the trade-off between anonymization, training time, TABLE 2: Comparison of four privacy attack models using
and the effectiveness of the ASVevalanon anon
and ASReval . the x-vector and ECAPA speaker encoders with PLDA and
cosine as distance measures. Privacy scores for each SAS
IV. EXPERIMENTS and attack condition are given as EER in %. ↑ means higher
In this section, we compare the effectiveness of different values are better, while ↓ means lower values are better.
evaluation models in measuring privacy and utility perfor- SAS Eval OO ↓ OA ↑ AA-lazy ↑ AA-semi ↑
mance across various SASs. The evaluation models used
x-vector + cosine 9.24 54.38 26.39 11.73
here are trained on either the entire original or anonymized
x-vector + PLDA 7.18 52.56 26.41 8.66
LibriSpeech-train-clean-360 dataset. Once the evaluation BL 1.a
ECAPA + cosine 3.11 52.85 25.65 7.62
models are chosen, the focus shifts to exploring the training ECAPA + PLDA 3.49 51.62 25.06 7.69
strategy for the semi-informed evaluation models, particu-
x-vector + cosine 9.24 52.30 25.31 11.33
larly concerning the amount of training data.
x-vector + PLDA 7.18 51.49 25.98 8.27
To draw more general conclusions, we choose diverse BL 1.b
ECAPA + cosine 3.11 49.82 24.67 7.21
disentanglement-based SASs, including both traditional and ECAPA + PLDA 3.49 50.84 23.10 8.51
state-of-the-art methods, to generate anonymized speech and
evaluate it using our proposed VoicePAT. The specifications x-vector + cosine 9.24 49.65 45.55 47.09
x-vector + PLDA 7.14 49.34 45.55 42.47
for these SASs are listed in Table 1. All SASs follow OHNN
ECAPA + cosine 3.11 48.22 44.26 44.62
the three steps described in Section II.A with different
ECAPA + PLDA 3.63 50.18 44.44 41.47
realizations of each component and different anonymization
techniques on the speaker embeddings. All experiments are x-vector + cosine 9.24 53.05 48.57 43.11
performed on the VCTK [34] and LibriSpeech [33] test x-vector + PLDA 7.03 52.80 48.71 44.12
GAN
ECAPA + cosine 3.11 51.45 46.11 44.61
sets as given by the VPC. The results consistently report
ECAPA + PLDA 3.68 50.85 47.57 47.20
the average score on them. Further settings, e.g., training
hyperparameters, can be found in our source code. All time
measurements apply to experiments conducted on single
no matter which ASV attacker is used, the EERs of BL 1.a
NVIDIA A100 GPU, except ASR evaluation using 4 GPUs.
and BL 1.b decrease by around 25% under the AA-lazy and
7%-11% under the AA-semi condition, indicating severe pri-
A. CHOICE OF EVALUATION MODELS
vacy leakage when facing the stronger, semi-informed attack
1) ASV evaluation models anon
model ASVeval . However, the EERs of OHNN- and GAN-
Table 2 lists the mean EERs for various SASs under all based SASs yield over 40% across all the attack conditions,
conditions computed by different evaluation models. First, showing remarkable privacy protection capabilities. In order

VOLUME , 5
Author et al.:

TABLE 3: EERs and WERs obtained by the proposed TABLE 4: Comparisons of GVD obtained by the VPC and
and the VPC evaluation models. The proposed models are proposed ECAPA-TDNN + cosine evaluation models. The
anon
ECAPA-TDNN + cosine for ASV and transformer-based evaluation models can be either only ASVeval , ASVeval , or a
anon
CTC/AED ASR. -l and -s stand for -lazy and -semi. combination of both (ASVeval for original and ASVeval for
anonymized data). #utts per spk=5 means randomly
EER (%) ↑ WER (%) ↓
SAS Eval OO ↓ OA AA-l AA-s O A-l A-s selecting 5 utterances per speaker for similarity computation.
VPC 3.58 50.07 30.31 10.31 8.48 8.96 7.79 SAS Eval # utts per spk ASVeval anon
ASVeval Both
BL 1.a
Proposed 3.11 52.85 25.65 7.62 7.24 8.00 7.66
VPC all -7.71 0.18 -1.11
VPC 3.58 51.11 28.98 9.93 8.48 10.16 7.61
BL 1.b BL 1.a Proposed all -6.57 1.37 0.03
Proposed 3.11 49.82 24.67 7.21 7.24 8.42 7.50
Proposed 5 -6.79 1.46 0.11
VPC 3.58 49.93 45.94 42.15 8.48 10.24 8.01
OHNN VPC all -7.29 0.13 -1.09
Proposed 3.11 48.22 44.26 48.61 7.24 7.92 7.70
BL 1.b Proposed all -7.40 1.72 0.62
VPC 3.58 51.09 46.49 44.33 8.48 8.47 6.86
GAN Proposed 5 -7.39 1.80 0.72
Proposed 3.11 51.45 46.11 48.30 7.24 8.00 7.03
VPC all -1.34 -0.11 -1.10
OHNN Proposed all -1.53 1.69 0.23
to decide which evaluation model provides the best results, Proposed 5 -1.57 1.65 0.15
we look at the OO condition, where no SAS is applied and
VPC all -0.84 0.68 0.11
it is expected to achieve very low EERs on this original
GAN Proposed all -1.74 2.64 2.03
data. The model using ECAPA-TDNN and cosine distance
Proposed 5 -1.62 2.63 1.96
achieves with 3.11% the lowest EER and can therefore
be considered as the best ASV evaluation model. This is
consistent with the findings in the ASV field [35], [36].
evaluation models, except for the GAN-based SAS. Possible
Accordingly, we choose ECAPA-TDNN + cosine as the
reasons could be either that the ASReval provided by VPC
primary proposed ASV evaluation model in all following
was not adequately trained or the structure of this model is
experiments. In the central columns in Table 3, we compare
not powerful enough. In contrast, our proposed ASR model
the EERs with this proposed model to the x-vector + PLDA
is more powerful in achieving accurate results.
model of the VPC toolkit. Compared to the VPC model,
Regarding training and evaluation time for the ASR mod-
the proposed one consistently achieves lower or similar
els (Figure 2b), we observe that our model increases the
EERs across all the conditions and SASs. This means that
time for ASR evaluation from 5 hours of the VPC model
the proposed ASV model is a stronger attacker, which is
to 6 hours. However, the training of the proposed model
reasonable as the proposed model exhibits a more powerful
takes only 20 hours, significantly less than the 72 hours
ability to infer the speaker’s identity.
required by the VPC model. Moreover, comparing the A-
Moreover, the proposed ASV model reduces the time
lazy and A-semi results show that training the ASR model
needed for it to train and perform the evaluation (Figure anon
on anonymized data, resulting in the ASReval model, has
2a). Instead of 6 hours for training the VPC model (which
less effect for the proposed model than the one from the
includes the x-vector encoder and the PLDA), our proposed anon
VPC. It can therefore be argued that training the ASReval
ECAPA-TDNN + cosine model only requires 2 hours. The
model for each evaluation and using the A-semi condition
effect on evaluation time is smaller, though still noticeable:
is not necessary with the proposed model. This is further
reducing the time needed from 30 minutes to 20.
supported by arguing that using an ASR model specifically
2) ASR evaluation models trained on anonymized data may be unrealistic for actual
The right columns of Table 3 summarize the WERs for both applications. Thus, by using a more robust ASR model and
original (O) and anonymized data (A-lazy, A-semi), using the reverting to only the A-lazy condition, we can effectively
VPC (TDNN-F) or proposed (Transformer-based CTC/AED) reduce the evaluation time by 92% from 77 hours of the
evaluation models. One common trend for all SASs is that VPC to 6 hours2 .
the proposed ASR model achieves notably lower WERs in
comparison to the VPC model. For the A-semi condition, B. GAIN OF VOICE DISTINCTIVENESS
anon
the utilization of ASReval can reduce the mismatch between Table 4 lists GVD results for various SASs computed by the
original and anonymized data, further decreasing the WERs. VPC and proposed evaluation models. Looking at the GVD
Another interesting observation is that for the VPC model, achieved by ASVeval , we can see: (1) Anonymized speech
anon
the WERs of A-semi decoded by ASReval are consistently generated through the OHNN- and GAN-based models
lower than the O condition decoded by ASReval . In contrast, exhibits higher voice distinctiveness than BL 1.a and BL
for the proposed model, the original data yield the lowest
WERs for most SASs, regardless of the training data of ASR 2 In the A-semi condition, we have a reduction by 66% (from 77h to 26).

6 VOLUME ,
<Society logo(s) and publication title will appear here.>

1.b, with GVD closer to zero. (2) The proposed model


achieves either similar or lower GVD values compared to TABLE 5: Comparison of EERs (%) and WERs (%) on
the VPC model. (3) Comparing the proposed model using the resynthesized conditions without anonymization. The
all utterances to that using 5 utterances per speaker reveals proposed models are ECAPA-TDNN + cosine for ASV and
similar results, suggesting that the use of 5 utterances may be transformer-based CTC/AED ASR.
sufficient for small test sets with limited voice variation3 . At EER (%) WER (%)
the same time, using only 5 utterances per speaker reduces SAS Eval OO OR RR-lazy O R-lazy
the time for computing the GVD drastically from 6 hours to
only 20 minutes (Figure 2a). Thus, we can speed up the GVD VPC 3.58 17.53 13.24 8.48 10.06
BL 1.a
evaluation by 95% without reducing the result quality. Proposed 3.11 22.47 14.10 7.24 11.00
anon
However, when employing ASVeval or a combination VPC 3.58 21.25 14.62 8.48 12.73
anon
of both ASVeval and ASVeval , the GVD is significantly BL 1.b
Proposed 3.11 28.45 15.28 7.24 9.75
higher and often above zero, indicating a positive gain in
voice distinctiveness. The difference between using the VPC VPC 3.30 7.60 4.82 8.48 9.45
OHNN
models and the proposed ones is more notable, whereby Proposed 3.11 10.05 4.28 7.24 7.58
the proposed models suggest almost no difference between
VPC 3.29 23.09 16.08 8.48 8.16
the SASs anymore. This shows that GVD is an unstable GAN
Proposed 3.11 27.09 17.68 7.24 7.85
metric that highly depends on the model used for evaluation.
In order to measure voice distinctiveness precisely, it may
be necessary to consider downstream tasks like speaker
diarization [37], instead of relying solely on the GVD metric. 20.25 hours needed in the VPC evaluation framework (14.25
hours for anonymization, 6 hours for training), this reduces
anon
anon
C. TRAINING STRATEGY FOR ASVeval anon
AND ASReval the time needed for creating the ASVeval model by 65%.
MODELS Therefore, we recommend to use finetuning with this setting
Figure 3 shows the influence of the different data reduction at least during SAS development, and only revert to training
strategies on the privacy scores and evaluation efficiency on all data from scratch for final evaluation.
for BL 1.a. The experiments are conducted across various
anon
ASVeval , trained using different amounts of anonymized V. ANALYSIS
speech data. This section delves deeper into privacy results by examining
It can be observed from the results in Figure 3a that the resynthesis performance and summarizing rankings when
more we decrease the amount of training data, the more the employing various evaluation models for different SASs.
EER increases. This is especially problematic for #utts
per spk=10 because its scores might suggest that the A. RESYNTHESIS
SAS’s privacy protection would be better than it actually
In the privacy evaluation, we obtain one privacy score
is. For WER, the effect of data reduction is negligible as it
(EER) for each model and attack condition. However,
only changes from 7.66% WER (all) to 7.91% (#utts
as all tested SASs employ a speech synthesis step after
per spk=10) in the worst case.
the actual anonymization method, it is not clear whether
The biggest impact of the different training strategies for
anon the anonymization power of an SAS actually from this
ASVeval can be seen in Figure 3b. It shows that the largest
anonymization method or from the synthesis. We therefore
factor of time needed for privacy evaluation comes from the
explore testing the resynthesis performance of each SAS.
requirement of having to anonymize the training data for
For this, we generate a new version of the evaluation data
each evaluation run. Decreasing the amount of training data
per SAS in which the anonymization method is skipped and
therefore means less time being spent on anonymizing it,
instead the original speaker vector is used for synthesis. We
thus, the time cost of an evaluation decreases linearly with
evaluate two new conditions for ASVeval : OR and RR-lazy,
the reduction of data.
with original or resynthesized enrollment data, respectively,
Based on these results, we found that #utts per
and resynthesized trial data. For ASReval , we test the decod-
spk=50 provides the best balance between EER increase
ing performance on the resynthesized data (R-lazy). In both
and cost4 . This data reduction decreases the total evaluation
cases, we compare to the performance on the original data.
time for ASV evaluation from 16 hours (14.25 hours for
Table 5 shows the results. Except for the OHNN-based
anonymization, 2 hours for training) to 7 hours (6.25 hours
SAS, all SASs clearly exploit the synthesis to increase the
for anonymization, 49 minutes for training). Compared to
privacy protection and do not rely only on their anonymiza-
3 LibriSpeech and VCTK contain read speech segments extracted from tion method. The OHNN-based SAS, on the other hand, has
longer recordings. almost no identification loss during resynthesis. However,
4 We validated this finding using other SASs as well. We omitted them the synthesis in all SASs lead to an increase in WER and
due to limited space. thus reduced intelligibility.

VOLUME , 7
Author et al.:

ASV (EER) ASR (WER) Anonymization Training Total Data

Training data [# utts]


16 100k 1k
Error Rate %

Time [min]
12 80k 800
8 60k 600
40k 400
4 20k 200
0 0 0
all 90 50 10 500 200 all 90 50 10 all 500 200
# utts per spk # spk # utts per spk # spk

anon and ASRanon


(a) Error rates for ASVeval anon
(b) Time and data efficiency for ASVeval
eval

anon anon
FIGURE 3: Effect of different training strategies for ASVeval and ASReval on (a) evaluation metrics for BL 1.a, and (b)
data and time efficiency. The strategies involve finetuning with data reduction, either by restricting the number of utterances
per speaker or the number of speakers. They are compared against using all data for training the models from scratch.

Using the synthesis to increase the privacy protection is tection ability for the chosen SASs, although they produced
not necessarily a drawback of the BL 1.a, BL 1.b and different privacy scores. However, we only tested a limited
GAN-based SAS. However, researchers might not be aware number of attackers from the same ASV family. It is possible
of this effect and might put too much focus on optimizing that an attacker using a different technique, e.g., conformer-
their anonymization method instead of the synthesis. It also based [39] or SSL-based ASV models [40], might result in
decreases the control one has about the actual outcome of a different trend.
the SAS. A related issue was already observed by [38] about Moreover, the choice of attack conditions is still heavily
the vocoder drift of speaker vectors during anonymization. based on assumptions. It is unclear whether semi-informed
attackers are realistic or what we could assume about the
B. EFFECT ON RANKING knowledge and dedication of real attackers. Hopefully, chal-
In this paper, we presented new evaluation models and strate- lenges like a Voice Privacy Attacker Challenge5 will provide
gies for training the lazy-informed attackers as alternatives to new insights and perspectives.
the evaluation framework of the VPC. We have shown that Overall, we saw a particular trade-off between the quality
the perceived strength of an SAS depends on the models of objective results and their usability in terms of time
it has been evaluated with, however, it is also important to requirement for computation, at least for privacy metrics.
anon
analyze how the relative performance of multiple SAS in Retraining the ASVeval from scratch on the full anonymized
comparison (i.e., their ranking) is influenced by the choice training data seems to lead to the strongest attacker. However,
of evaluation models. it is costly, which can be significantly reduced by a finetuning
Comparing the scores for different SAS in the lazy- strategy that leads to minimal reduction in attacker perfor-
and semi-informed attack conditions as shown in Table 3 mance. We propose using this alternative training strategy
reveals consistently the same ranking for the level of privacy during SAS development to speed up voice privacy research.
protection (with GAN- and OHNN-based being partly on However, a full retraining on all data might still be a better
equal places): (1) GAN-based, (2) OHNN-based, (3) BL option for a final assessment of the full privacy capabilities.
1.a, and (4) BL 1.b. This is regardless of whether the VPC What are the drawbacks of the current evaluation metrics?
evaluation models or the proposed ones are applied, and also According to our experiments, GVD is a more problem-
regardless of the training strategy. For ASR evaluation, on atic metric. We tested three model approaches (ASVeval ,
anon
the other hand, the ranking of SASs’ performances does ASVeval , and the combination of both), but their results
not stay consistent but changes depending on the ASR differ considerably. It is unclear which approach is better
model used for evaluation. However, the WER scores of suited for measuring the preservation of voice distinctiveness
all SASs are relatively similar to each other when using the among anonymized speech. We conclude that we need a
proposed evaluation models. Thus, we argue that this change more robust alternative, e.g., to use the anonymized dataset
in ranking for utility evaluation is a rather small effect. to perform downstream speaker verification or speaker di-
arization tasks.
VI. DISCUSSION What is missing? Currently, there are no definitions or
Which evaluation models should we consider? For privacy measurable criteria for success or the guarantee of full
evaluation, we proposed and evaluated various attacker mod-
els against a selection of SAS and found that the choice of 5A
Voice Privacy Attacker Challenge was initially planned for a work-
attack model did not influence the ranking of privacy pro- shop at INTERSPEECH 2023, see https://www.voiceprivacychallenge.org.

8 VOLUME ,
<Society logo(s) and publication title will appear here.>

privacy protection through anonymization since all existing [7] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno,
evaluations rely on assumptions and specific attack models. It N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Ren-
duchintala, and T. Ochiai, “ESPnet: End-to-end speech processing
is unknown when an SAS could be considered good enough toolkit,” in Proc. Interspeech, 2018, pp. 2207–2211.
for use on data where privacy protection matters, or how the [8] M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lu-
remaining privacy risk of current systems can be accurately gosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.-C. Chou,
S.-L. Yeh, S.-W. Fu, C.-F. Liao, E. Rastorgueva, F. Grondin, W. Aris,
measured6 . H. Na, Y. Gao, R. D. Mori, and Y. Bengio, “SpeechBrain: A general-
In summary, being open source, the proposed framework purpose speech toolkit,” arXiv preprint arXiv:2106.04624, 2021.
serves as a platform for unifying researchers and research [9] S. Meyer, F. Lux, P. Denisov, J. Koch, P. Tilli, and N. T. Vu, “Speaker
anonymization with phonetic intermediate representations,” in Proc.
on this topic. Researchers can add their own SASs and Interspeech, 2022, pp. 4925–4929.
evaluation metrics to the framework such that large-scale [10] S. Meyer, P. Tilli, P. Denisov, F. Lux, J. Koch, and N. T. Vu,
and extensive evaluations and comparisons would be possible “Anonymizing speech with generative adversarial networks to preserve
speaker privacy,” in Proc. IEEE SLT. IEEE, 2023, pp. 912–919.
without much additional effort. In this way, we hope that [11] S. Meyer, F. Lux, J. Koch, P. Denisov, P. Tilli, and N. T. Vu, “Prosody
this framework will help towards finding answers to the is not identity: A speaker anonymization approach using prosody
questions above, and towards the development of powerful cloning,” in Proc. IEEE ICASSP. IEEE, 2023, pp. 1–5.
[12] X. Miao, X. Wang, E. Cooper, J. Yamagishi, and N. Tomashenko,
anonymization tools. “Language-independent speaker anonymization approach using self-
supervised pre-trained models,” in Proc. Odyssey, 2022, pp. 279–286.
[13] X. Miao, X. Wang, E. Cooper, J. Yamagishi, and N. Tomashenko,
VII. CONCLUSION
“Analyzing language-independent speaker anonymization framework
We proposed a new Python-based and modular open-source under unseen conditions,” in Proc. Interspeech, 2022, pp. 4426–4430.
framework for speaker anonymization research. It allows [14] X. Miao, X. Wang, E. Cooper, J. Yamagishi, and N. Tomashenko,
“Speaker anonymization using orthogonal householder neural net-
combining, managing, and evaluating several anonymization work,” IEEE/ACM Trans. Audio, Speech, and Language Processing,
approaches within one platform that is simple to apply vol. 31, pp. 3681–3695, 2023.
and extend. We further present various improvements to [15] C. O. Mawalim, K. Galajit, J. Karnjana, S. Kidani, and M. Unoki,
standard evaluation techniques for speaker anonymization. “Speaker anonymization by modifying fundamental frequency and x-
vector singular value,” Computer Speech & Language, vol. 73, pp.
Specifically, we exchange previous Kaldi-based evaluation 101326, 2022.
models with more powerful techniques using the ESPnet [16] P. Champion, A. Larcher, and D. Jouvet, “Are disentangled representa-
and SpeechBrain toolkits. Moreover, we showed that we tions all you need to build speaker anonymization systems?,” in Proc.
Interspeech, 2022, pp. 2793–2797.
could decrease the time required for evaluation by up to [17] A. S. Shamsabadi, B. M. L. Srivastava, A. Bellet, N. Vauquier,
95% by reducing training and test data while keeping the E. Vincent, M. Maouche, M. Tommasi, and N. Papernot, “Differ-
quality of the evaluations at compatible levels. We anticipate entially private speaker anonymization,” Proc. Privacy Enhancing
Technologies, vol. 2023, no. 1, pp. 98–114, Jan. 2023.
that these changes to common development and evaluation [18] H. Turner, G. Lovisotto, and I. Martinovic, “Generating identities
procedures will significantly facilitate and support speaker with mixture models for speaker anonymization,” Computer Speech
anonymization research in the near future. & Language, vol. 72, pp. 101318, 2022.
[19] J. Yao, Q. Wang, L. Zhang, P. Guo, Y. Liang, and L. Xie, “NWPU-
ASLP system for the voiceprivacy 2022 challenge,” in Proc. 2nd Symp.
REFERENCES on Security and Privacy in Speech Communication, 2022.
[20] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network
[1] N. Tomashenko, X. Wang, E. Vincent, J. Patino, B. M. L. Srivastava, architecture for efficient modeling of long temporal contexts,” in Proc.
P.-G. Noé, A. Nautsch, N. Evans, J. Yamagishi, B. O’Brien, et al., Interspeech, 2015, pp. 3214–3218.
“The VoicePrivacy 2020 challenge: Results and findings,” Computer
[21] K. Kasi and S. A. Zahorian, “Yet another algorithm for pitch tracking,”
Speech & Language, vol. 74, pp. 101362, Jul 2022.
in Proc. IEEE ICASSP, 2002, vol. 1, pp. 361–364.
[2] D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmohammadi, and [22] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur,
S. Khudanpur, “Semi-orthogonal low-rank matrix factorization for “X-vectors: Robust dnn embeddings for speaker recognition,” in Proc.
deep neural networks.,” in Proc. Interspeech, 2018, pp. 3743–3747. IEEE ICASSP. IEEE, 2018, pp. 5329–5333.
[3] N. Tomashenko, X. Wang, X. Miao, H. Nourtel, P. Champion, [23] B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN:
M. Todisco, E. Vincent, N. Evans, J. Yamagishi, and J. F. Bonastre, Emphasized channel attention, propagation and aggregation in TDNN
“The VoicePrivacy 2022 challenge evaluation plan,” arXiv preprint based speaker verification,” in Proc. Interspeech, 2020, pp. 3830–3834.
arXiv:2203.12468, 2022.
[24] B. M. L. Srivastava, N. Tomashenko, X. Wang, E. Vincent, J. Yamag-
[4] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, ishi, M. Maouche, A. Bellet, and M. Tommasi, “Design choices for
M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stem- x-vector based speaker anonymization,” in Proc. Interspeech, 2020,
mer, and K. Vesely, “The kaldi speech recognition toolkit,” in Proc. pp. 1713–1717.
IEEE ASRU, Dec. 2011.
[25] X. Wang, S. Takaki, and J. Yamagishi, “Neural source-filter-based
[5] C. Franzreb, T. Polzehl, and S. Moeller, “A Comprehensive Evaluation waveform model for statistical parametric speech synthesis,” in Proc.
Framework for Speaker Anonymization Systems,” in Proc. 3rd IEEE ICASSP, 2019, pp. 5916–5920.
Symposium on Security and Privacy in Speech Communication, 2023, [26] J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial
pp. 65–72. networks for efficient and high fidelity speech synthesis,” in Proc.
[6] Z. Shaohu, L. Zhouyu, and D. Anupam, “Voicepm: A robust privacy NeurIPS, 2020, pp. 17022–17033.
measurement on voice anonymity,” in Proc. 16th ACM Conference on [27] M. Gomez-Barrero, J. Galbally, C. Rathgeb, and C. Busch, “General
Security and Privacy in Wireless and Mobile Networks (WiSec), 2023, framework to evaluate unlinkability in biometric template protection
p. 215–226. systems,” IEEE Trans. Information Forensics and Security, vol. 13,
no. 6, pp. 1406–1420, 2018.
6A
[28] M. Maouche, B. M. L. Srivastava, N. Vauquier, A. Bellet, M. Tommasi,
first approach for privacy risk assessment was proposed in the ZEBRA and E. Vincent, “A comparative study of speech anonymization
framework [30] but is so far not used as a standard metric in research. metrics,” in Proc. Interspeech, 2020, pp. 1708–1712.

VOLUME , 9
Author et al.:

[29] P.-G. Noé, J.-F. Bonastre, D. Matrouf, N. Tomashenko, A. Nautsch,


and N. Evans, “Speech pseudonymisation assessment using voice
similarity matrices,” in Proc. Interspeech, 2020, pp. 1718–1722.
[30] A. Nautsch, J. Patino, N. Tomashenko, J. Yamagishi, P.-G. Noé, J.-F.
Bonastre, M. Todisco, and N. Evans, “The privacy ZEBRA: Zero
evidence biometric recognition assessment,” in Proc. Interspeech,
2020, pp. 1698–1702.
[31] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf,
E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner,
L. Fang, J. Bai, and S. Chintala, “PyTorch: An imperative style, high-
performance deep learning library,” in Advances in Neural Information
Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019.
[32] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hybrid
CTC/attention architecture for end-to-end speech recognition,” IEEE
Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp.
1240–1253, 2017.
[33] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech:
An ASR corpus based on public domain audio books,” in Proc. IEEE
ICASSP, 2015, pp. 5206–5210.
[34] J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK corpus:
English multi-speaker corpus for CSTR voice cloning toolkit (version
0.92),” 2019.
[35] Q. Wang, K. A. Lee, and T. Liu, “Scoring of large-margin embeddings
for speaker verification: Cosine or PLDA?,” in Proc. Interspeech, 2022,
pp. 600–604.
[36] Z. Li, R. Xiao, H. Chen, Z. Zhao, W. Wang, and P. Zhang, “How to
make embeddings suitable for plda,” Computer Speech & Language,
vol. 81, pp. 101523, 2023.
[37] X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland,
and O. Vinyals, “Speaker diarization: A review of recent research,”
IEEE/ACM Trans. Audio, Speech, and Language Processing, vol. 20,
no. 2, pp. 356–370, 2012.
[38] M. Panariello, M. Todisco, and N. Evans, “Vocoder drift in x-
vector–based speaker anonymization,” in Proc. Interspeech, 2023, pp.
2863–2867.
[39] Y. Zhang, Z. Lv, H. Wu, S. Zhang, P. Hu, Z. Wu, H.-y. Lee,
and H. Meng, “MFA-Conformer: Multi-scale feature aggregation
conformer for automatic speaker verification,” in Proc. Interspeech,
2022, pp. 306–310.
[40] S. Chen, Y. Wu, C. Wang, Z. Chen, Z. Chen, S. Liu, J. Wu,
Y. Qiao, Furu W., J. Li, and X. Yu, “Unispeech-sat: Universal speech
representation learning with speaker aware pre-training,” in Proc. IEEE
ICASSP. IEEE, 2022, pp. 6152–6156.

10 VOLUME ,

You might also like