Professional Documents
Culture Documents
Voicepat: An Efficient Open-Source Evaluation Toolkit For Voice Privacy Research
Voicepat: An Efficient Open-Source Evaluation Toolkit For Voice Privacy Research
Received XX Month, XXXX; revised XX Month, XXXX; accepted XX Month, XXXX; Date of publication XX Month, XXXX; date of
current version XX Month, XXXX.
Digital Object Identifier 10.1109/XXXX.2022.1234567
2
Singapore Institute of Technology, Singapore
3
National Institute of Informatics, Japan
Corresponding author: Sarina Meyer (email: sarina.meyer@ims.uni-stuttgart.de).
* Equal contribution. This study is supported by the Carl Zeiss foundation, JST CREST Grants (JPMJCR18A6), and MEXT KAKENHI Grant
(22K21319).
ABSTRACT Speaker anonymization is the task of modifying a speech recording such that the original
speaker cannot be identified anymore. Since the first Voice Privacy Challenge in 2020, along with the release
of a framework, the popularity of this research topic is continually increasing. However, the comparison
and combination of different anonymization approaches remains challenging due to the complexity of
evaluation and the absence of user-friendly research frameworks. We therefore propose an efficient speaker
anonymization and evaluation framework based on a modular and easily extendable structure, almost fully
in Python. The framework facilitates the orchestration of several anonymization approaches in parallel and
allows for interfacing between different techniques. Furthermore, we propose modifications to common
evaluation methods which improves the quality of the evaluation and reduces their computation time by
65 to 95%, depending on the metric. Our code is fully open source.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME , 1
Author et al.:
Pipeline
xy 1 Disentanglement of Original Speech
Anon. ASV-EER
Config ASR Prosody ASV Other Speech Evaluation AA-lazy
Model 2 Model 2 Other Privacy Metrics.
Model 1 Model 1 Model 2 Attributes Extraction Config OA
Model 1 Model 3 E.g.Cllr
Model 3 Model 3 E.g. Emotion AA-semi
Original Content Features Prosody Features Speaker Embeddings Emotion Features
Speech Original
Speech
3 Generation of 2 Modification of Original Features
Anonymized Speech ASR-WER Gvd
Other metrics,
Speaker Embeddings
Anonymized Other Speech Method 1 E.g.Pitch
Speech Synthesis Anonymizer Anonymized
Speech Model 2 Attributes Correlation
Method 1 Method 2 Speech Method 2
Model 1 Model3 Modification
Method 3
Anonymization Evaluation
FIGURE 1: Proposed framework, consisting of separate anonymization and evaluation branches and example configurations
for each branch
.
effort. We extend and improve commonly used evaluation Whereas the latter perceptively modify speech to conceal
methods by employing the ESPnet [7] and SpeechBrain [8] the original speaker’s identity [1], [3], they are less effective
toolkits for evaluation instead of Kaldi [4]. Furthermore, than disentanglement-based methods [9]–[19], which involve
we significantly speed up the evaluation by combining data three steps as shown in the left of Figure 1:
reduction and finetuning techniques for privacy and utility (1) Disentanglement of Original Speech: (i) Frame-level
metrics. Overall, we believe this framework will be an content features extraction via automatic speech recognition
essential tool for advancing research in voice privacy and (ASR) models [2], [20] or self-supervised learning (SSL)
facilitating participation in future Voice Privacy Challenges based content encoders [12], [13]; (ii) Frame-level prosody
in the field. features extraction, e.g., F0 using the YAAPT algorithm
Our contributions are as follows: [21]; (iii) Utterance-level speaker embeddings extraction,
e.g., from pre-trained x-vector [22] or ECAPA-TDNN [23]
• We propose improvements to standard evaluation meth-
models for automatic speaker verification (ASV).
ods for speaker anonymization which make these eval-
(2) Modification of original features: This is a crucial step
uations more efficient, easier to use, and more feasible
to hide the original speaker’s identity. Most works focus on
for intermediate evaluations.
modifying the original speaker embeddings, assuming that
• We show through experiments on two primary VPC
identity is mainly encoded in them. Typically, a selection-
baselines and two state-of-the-art SASs that these im-
based speaker anonymizer [24], replaces an original speaker
provements provide stronger privacy tests, while requir-
with an anonymized speaker vector. This anonymized vector
ing 65 to 95% less time to execute.
is a mean speaker vector derived from a randomly chosen set
• We release all code in a new toolkit, VoicePAT1 (Voice
of speaker vectors in an external pool. However, the averag-
Privacy Anonymization Toolkit), which further includes
ing process on speaker vectors leads to limited speaker diver-
pipelines for running different state-of-the-art SASs.
sity in the generated anonymized voices and serious speaker
• Due to its modularity, extending VoicePAT with more
privacy leakage problems when facing stronger attacker [14].
anonymization systems and evaluation metrics is eas-
To mitigate these problems, recent works adopt DNN-based
ily possible, enabling the comparison of different ap-
anonymizers. For example, [10], [11] introduce a Wasserstein
proaches within one framework and probing the ef-
generative adversarial network (GAN) that is trained to turn
fectiveness of the anonymization using a more diverse
random noise into artificial speaker embeddings that follow
attack landscape.
a similar distribution as the original speaker vectors. Another
approach employs an orthogonal householder neural network
II. Background and Related Work (OHNN) to rotate original speaker vectors, ensuring that
Before describing our proposed framework and evaluation, anonymized vectors follow the original space and maintain-
we will first give some background about the current state ing speech naturalness. The parameters of the OHNN-based
of the topic, existing evaluation metrics, and frameworks. anonymizer are trained using classification and similarity
losses, encouraging distinct speaker identities [14].
A. SPEAKER ANONYMIZATION (3) Generation of anonymized speech: The anonymized
The goal of an SAS [1], [3] is to automatically mod- speaker, prosody, and content features are then fed into a
ify the voice in an audio recording to make the original speech synthesis model [25], [26] to generate anonymized
speaker unidentifiable. Following VPC baseline models, ap- speech.
proaches can be categorized into methods based on disen- The VPC introduced two primary disentanglement-based
tanglement and those based on digital signal processing. SASs. In BL 1.a, speech is disentangled into speaker identity
by a pre-trained x-vector, fundamental frequency by YAAPT
1 https://github.com/DigitalPhonetics/VoicePAT
2 VOLUME ,
<Society logo(s) and publication title will appear here.>
VOLUME , 3
Author et al.:
4 VOLUME ,
<Society logo(s) and publication title will appear here.>
FIGURE 2: Comparisons of VPC and proposed privacy and utility evaluation models. Note that although GVD is a utility
metric, we plotted GVD evaluation time in the privacy subplot as it is computed by ASVeval .
VOLUME , 5
Author et al.:
TABLE 3: EERs and WERs obtained by the proposed TABLE 4: Comparisons of GVD obtained by the VPC and
and the VPC evaluation models. The proposed models are proposed ECAPA-TDNN + cosine evaluation models. The
anon
ECAPA-TDNN + cosine for ASV and transformer-based evaluation models can be either only ASVeval , ASVeval , or a
anon
CTC/AED ASR. -l and -s stand for -lazy and -semi. combination of both (ASVeval for original and ASVeval for
anonymized data). #utts per spk=5 means randomly
EER (%) ↑ WER (%) ↓
SAS Eval OO ↓ OA AA-l AA-s O A-l A-s selecting 5 utterances per speaker for similarity computation.
VPC 3.58 50.07 30.31 10.31 8.48 8.96 7.79 SAS Eval # utts per spk ASVeval anon
ASVeval Both
BL 1.a
Proposed 3.11 52.85 25.65 7.62 7.24 8.00 7.66
VPC all -7.71 0.18 -1.11
VPC 3.58 51.11 28.98 9.93 8.48 10.16 7.61
BL 1.b BL 1.a Proposed all -6.57 1.37 0.03
Proposed 3.11 49.82 24.67 7.21 7.24 8.42 7.50
Proposed 5 -6.79 1.46 0.11
VPC 3.58 49.93 45.94 42.15 8.48 10.24 8.01
OHNN VPC all -7.29 0.13 -1.09
Proposed 3.11 48.22 44.26 48.61 7.24 7.92 7.70
BL 1.b Proposed all -7.40 1.72 0.62
VPC 3.58 51.09 46.49 44.33 8.48 8.47 6.86
GAN Proposed 5 -7.39 1.80 0.72
Proposed 3.11 51.45 46.11 48.30 7.24 8.00 7.03
VPC all -1.34 -0.11 -1.10
OHNN Proposed all -1.53 1.69 0.23
to decide which evaluation model provides the best results, Proposed 5 -1.57 1.65 0.15
we look at the OO condition, where no SAS is applied and
VPC all -0.84 0.68 0.11
it is expected to achieve very low EERs on this original
GAN Proposed all -1.74 2.64 2.03
data. The model using ECAPA-TDNN and cosine distance
Proposed 5 -1.62 2.63 1.96
achieves with 3.11% the lowest EER and can therefore
be considered as the best ASV evaluation model. This is
consistent with the findings in the ASV field [35], [36].
evaluation models, except for the GAN-based SAS. Possible
Accordingly, we choose ECAPA-TDNN + cosine as the
reasons could be either that the ASReval provided by VPC
primary proposed ASV evaluation model in all following
was not adequately trained or the structure of this model is
experiments. In the central columns in Table 3, we compare
not powerful enough. In contrast, our proposed ASR model
the EERs with this proposed model to the x-vector + PLDA
is more powerful in achieving accurate results.
model of the VPC toolkit. Compared to the VPC model,
Regarding training and evaluation time for the ASR mod-
the proposed one consistently achieves lower or similar
els (Figure 2b), we observe that our model increases the
EERs across all the conditions and SASs. This means that
time for ASR evaluation from 5 hours of the VPC model
the proposed ASV model is a stronger attacker, which is
to 6 hours. However, the training of the proposed model
reasonable as the proposed model exhibits a more powerful
takes only 20 hours, significantly less than the 72 hours
ability to infer the speaker’s identity.
required by the VPC model. Moreover, comparing the A-
Moreover, the proposed ASV model reduces the time
lazy and A-semi results show that training the ASR model
needed for it to train and perform the evaluation (Figure anon
on anonymized data, resulting in the ASReval model, has
2a). Instead of 6 hours for training the VPC model (which
less effect for the proposed model than the one from the
includes the x-vector encoder and the PLDA), our proposed anon
VPC. It can therefore be argued that training the ASReval
ECAPA-TDNN + cosine model only requires 2 hours. The
model for each evaluation and using the A-semi condition
effect on evaluation time is smaller, though still noticeable:
is not necessary with the proposed model. This is further
reducing the time needed from 30 minutes to 20.
supported by arguing that using an ASR model specifically
2) ASR evaluation models trained on anonymized data may be unrealistic for actual
The right columns of Table 3 summarize the WERs for both applications. Thus, by using a more robust ASR model and
original (O) and anonymized data (A-lazy, A-semi), using the reverting to only the A-lazy condition, we can effectively
VPC (TDNN-F) or proposed (Transformer-based CTC/AED) reduce the evaluation time by 92% from 77 hours of the
evaluation models. One common trend for all SASs is that VPC to 6 hours2 .
the proposed ASR model achieves notably lower WERs in
comparison to the VPC model. For the A-semi condition, B. GAIN OF VOICE DISTINCTIVENESS
anon
the utilization of ASReval can reduce the mismatch between Table 4 lists GVD results for various SASs computed by the
original and anonymized data, further decreasing the WERs. VPC and proposed evaluation models. Looking at the GVD
Another interesting observation is that for the VPC model, achieved by ASVeval , we can see: (1) Anonymized speech
anon
the WERs of A-semi decoded by ASReval are consistently generated through the OHNN- and GAN-based models
lower than the O condition decoded by ASReval . In contrast, exhibits higher voice distinctiveness than BL 1.a and BL
for the proposed model, the original data yield the lowest
WERs for most SASs, regardless of the training data of ASR 2 In the A-semi condition, we have a reduction by 66% (from 77h to 26).
6 VOLUME ,
<Society logo(s) and publication title will appear here.>
VOLUME , 7
Author et al.:
Time [min]
12 80k 800
8 60k 600
40k 400
4 20k 200
0 0 0
all 90 50 10 500 200 all 90 50 10 all 500 200
# utts per spk # spk # utts per spk # spk
anon anon
FIGURE 3: Effect of different training strategies for ASVeval and ASReval on (a) evaluation metrics for BL 1.a, and (b)
data and time efficiency. The strategies involve finetuning with data reduction, either by restricting the number of utterances
per speaker or the number of speakers. They are compared against using all data for training the models from scratch.
Using the synthesis to increase the privacy protection is tection ability for the chosen SASs, although they produced
not necessarily a drawback of the BL 1.a, BL 1.b and different privacy scores. However, we only tested a limited
GAN-based SAS. However, researchers might not be aware number of attackers from the same ASV family. It is possible
of this effect and might put too much focus on optimizing that an attacker using a different technique, e.g., conformer-
their anonymization method instead of the synthesis. It also based [39] or SSL-based ASV models [40], might result in
decreases the control one has about the actual outcome of a different trend.
the SAS. A related issue was already observed by [38] about Moreover, the choice of attack conditions is still heavily
the vocoder drift of speaker vectors during anonymization. based on assumptions. It is unclear whether semi-informed
attackers are realistic or what we could assume about the
B. EFFECT ON RANKING knowledge and dedication of real attackers. Hopefully, chal-
In this paper, we presented new evaluation models and strate- lenges like a Voice Privacy Attacker Challenge5 will provide
gies for training the lazy-informed attackers as alternatives to new insights and perspectives.
the evaluation framework of the VPC. We have shown that Overall, we saw a particular trade-off between the quality
the perceived strength of an SAS depends on the models of objective results and their usability in terms of time
it has been evaluated with, however, it is also important to requirement for computation, at least for privacy metrics.
anon
analyze how the relative performance of multiple SAS in Retraining the ASVeval from scratch on the full anonymized
comparison (i.e., their ranking) is influenced by the choice training data seems to lead to the strongest attacker. However,
of evaluation models. it is costly, which can be significantly reduced by a finetuning
Comparing the scores for different SAS in the lazy- strategy that leads to minimal reduction in attacker perfor-
and semi-informed attack conditions as shown in Table 3 mance. We propose using this alternative training strategy
reveals consistently the same ranking for the level of privacy during SAS development to speed up voice privacy research.
protection (with GAN- and OHNN-based being partly on However, a full retraining on all data might still be a better
equal places): (1) GAN-based, (2) OHNN-based, (3) BL option for a final assessment of the full privacy capabilities.
1.a, and (4) BL 1.b. This is regardless of whether the VPC What are the drawbacks of the current evaluation metrics?
evaluation models or the proposed ones are applied, and also According to our experiments, GVD is a more problem-
regardless of the training strategy. For ASR evaluation, on atic metric. We tested three model approaches (ASVeval ,
anon
the other hand, the ranking of SASs’ performances does ASVeval , and the combination of both), but their results
not stay consistent but changes depending on the ASR differ considerably. It is unclear which approach is better
model used for evaluation. However, the WER scores of suited for measuring the preservation of voice distinctiveness
all SASs are relatively similar to each other when using the among anonymized speech. We conclude that we need a
proposed evaluation models. Thus, we argue that this change more robust alternative, e.g., to use the anonymized dataset
in ranking for utility evaluation is a rather small effect. to perform downstream speaker verification or speaker di-
arization tasks.
VI. DISCUSSION What is missing? Currently, there are no definitions or
Which evaluation models should we consider? For privacy measurable criteria for success or the guarantee of full
evaluation, we proposed and evaluated various attacker mod-
els against a selection of SAS and found that the choice of 5A
Voice Privacy Attacker Challenge was initially planned for a work-
attack model did not influence the ranking of privacy pro- shop at INTERSPEECH 2023, see https://www.voiceprivacychallenge.org.
8 VOLUME ,
<Society logo(s) and publication title will appear here.>
privacy protection through anonymization since all existing [7] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno,
evaluations rely on assumptions and specific attack models. It N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Ren-
duchintala, and T. Ochiai, “ESPnet: End-to-end speech processing
is unknown when an SAS could be considered good enough toolkit,” in Proc. Interspeech, 2018, pp. 2207–2211.
for use on data where privacy protection matters, or how the [8] M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lu-
remaining privacy risk of current systems can be accurately gosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.-C. Chou,
S.-L. Yeh, S.-W. Fu, C.-F. Liao, E. Rastorgueva, F. Grondin, W. Aris,
measured6 . H. Na, Y. Gao, R. D. Mori, and Y. Bengio, “SpeechBrain: A general-
In summary, being open source, the proposed framework purpose speech toolkit,” arXiv preprint arXiv:2106.04624, 2021.
serves as a platform for unifying researchers and research [9] S. Meyer, F. Lux, P. Denisov, J. Koch, P. Tilli, and N. T. Vu, “Speaker
anonymization with phonetic intermediate representations,” in Proc.
on this topic. Researchers can add their own SASs and Interspeech, 2022, pp. 4925–4929.
evaluation metrics to the framework such that large-scale [10] S. Meyer, P. Tilli, P. Denisov, F. Lux, J. Koch, and N. T. Vu,
and extensive evaluations and comparisons would be possible “Anonymizing speech with generative adversarial networks to preserve
speaker privacy,” in Proc. IEEE SLT. IEEE, 2023, pp. 912–919.
without much additional effort. In this way, we hope that [11] S. Meyer, F. Lux, J. Koch, P. Denisov, P. Tilli, and N. T. Vu, “Prosody
this framework will help towards finding answers to the is not identity: A speaker anonymization approach using prosody
questions above, and towards the development of powerful cloning,” in Proc. IEEE ICASSP. IEEE, 2023, pp. 1–5.
[12] X. Miao, X. Wang, E. Cooper, J. Yamagishi, and N. Tomashenko,
anonymization tools. “Language-independent speaker anonymization approach using self-
supervised pre-trained models,” in Proc. Odyssey, 2022, pp. 279–286.
[13] X. Miao, X. Wang, E. Cooper, J. Yamagishi, and N. Tomashenko,
VII. CONCLUSION
“Analyzing language-independent speaker anonymization framework
We proposed a new Python-based and modular open-source under unseen conditions,” in Proc. Interspeech, 2022, pp. 4426–4430.
framework for speaker anonymization research. It allows [14] X. Miao, X. Wang, E. Cooper, J. Yamagishi, and N. Tomashenko,
“Speaker anonymization using orthogonal householder neural net-
combining, managing, and evaluating several anonymization work,” IEEE/ACM Trans. Audio, Speech, and Language Processing,
approaches within one platform that is simple to apply vol. 31, pp. 3681–3695, 2023.
and extend. We further present various improvements to [15] C. O. Mawalim, K. Galajit, J. Karnjana, S. Kidani, and M. Unoki,
standard evaluation techniques for speaker anonymization. “Speaker anonymization by modifying fundamental frequency and x-
vector singular value,” Computer Speech & Language, vol. 73, pp.
Specifically, we exchange previous Kaldi-based evaluation 101326, 2022.
models with more powerful techniques using the ESPnet [16] P. Champion, A. Larcher, and D. Jouvet, “Are disentangled representa-
and SpeechBrain toolkits. Moreover, we showed that we tions all you need to build speaker anonymization systems?,” in Proc.
Interspeech, 2022, pp. 2793–2797.
could decrease the time required for evaluation by up to [17] A. S. Shamsabadi, B. M. L. Srivastava, A. Bellet, N. Vauquier,
95% by reducing training and test data while keeping the E. Vincent, M. Maouche, M. Tommasi, and N. Papernot, “Differ-
quality of the evaluations at compatible levels. We anticipate entially private speaker anonymization,” Proc. Privacy Enhancing
Technologies, vol. 2023, no. 1, pp. 98–114, Jan. 2023.
that these changes to common development and evaluation [18] H. Turner, G. Lovisotto, and I. Martinovic, “Generating identities
procedures will significantly facilitate and support speaker with mixture models for speaker anonymization,” Computer Speech
anonymization research in the near future. & Language, vol. 72, pp. 101318, 2022.
[19] J. Yao, Q. Wang, L. Zhang, P. Guo, Y. Liang, and L. Xie, “NWPU-
ASLP system for the voiceprivacy 2022 challenge,” in Proc. 2nd Symp.
REFERENCES on Security and Privacy in Speech Communication, 2022.
[20] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network
[1] N. Tomashenko, X. Wang, E. Vincent, J. Patino, B. M. L. Srivastava, architecture for efficient modeling of long temporal contexts,” in Proc.
P.-G. Noé, A. Nautsch, N. Evans, J. Yamagishi, B. O’Brien, et al., Interspeech, 2015, pp. 3214–3218.
“The VoicePrivacy 2020 challenge: Results and findings,” Computer
[21] K. Kasi and S. A. Zahorian, “Yet another algorithm for pitch tracking,”
Speech & Language, vol. 74, pp. 101362, Jul 2022.
in Proc. IEEE ICASSP, 2002, vol. 1, pp. 361–364.
[2] D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmohammadi, and [22] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur,
S. Khudanpur, “Semi-orthogonal low-rank matrix factorization for “X-vectors: Robust dnn embeddings for speaker recognition,” in Proc.
deep neural networks.,” in Proc. Interspeech, 2018, pp. 3743–3747. IEEE ICASSP. IEEE, 2018, pp. 5329–5333.
[3] N. Tomashenko, X. Wang, X. Miao, H. Nourtel, P. Champion, [23] B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN:
M. Todisco, E. Vincent, N. Evans, J. Yamagishi, and J. F. Bonastre, Emphasized channel attention, propagation and aggregation in TDNN
“The VoicePrivacy 2022 challenge evaluation plan,” arXiv preprint based speaker verification,” in Proc. Interspeech, 2020, pp. 3830–3834.
arXiv:2203.12468, 2022.
[24] B. M. L. Srivastava, N. Tomashenko, X. Wang, E. Vincent, J. Yamag-
[4] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, ishi, M. Maouche, A. Bellet, and M. Tommasi, “Design choices for
M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stem- x-vector based speaker anonymization,” in Proc. Interspeech, 2020,
mer, and K. Vesely, “The kaldi speech recognition toolkit,” in Proc. pp. 1713–1717.
IEEE ASRU, Dec. 2011.
[25] X. Wang, S. Takaki, and J. Yamagishi, “Neural source-filter-based
[5] C. Franzreb, T. Polzehl, and S. Moeller, “A Comprehensive Evaluation waveform model for statistical parametric speech synthesis,” in Proc.
Framework for Speaker Anonymization Systems,” in Proc. 3rd IEEE ICASSP, 2019, pp. 5916–5920.
Symposium on Security and Privacy in Speech Communication, 2023, [26] J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial
pp. 65–72. networks for efficient and high fidelity speech synthesis,” in Proc.
[6] Z. Shaohu, L. Zhouyu, and D. Anupam, “Voicepm: A robust privacy NeurIPS, 2020, pp. 17022–17033.
measurement on voice anonymity,” in Proc. 16th ACM Conference on [27] M. Gomez-Barrero, J. Galbally, C. Rathgeb, and C. Busch, “General
Security and Privacy in Wireless and Mobile Networks (WiSec), 2023, framework to evaluate unlinkability in biometric template protection
p. 215–226. systems,” IEEE Trans. Information Forensics and Security, vol. 13,
no. 6, pp. 1406–1420, 2018.
6A
[28] M. Maouche, B. M. L. Srivastava, N. Vauquier, A. Bellet, M. Tommasi,
first approach for privacy risk assessment was proposed in the ZEBRA and E. Vincent, “A comparative study of speech anonymization
framework [30] but is so far not used as a standard metric in research. metrics,” in Proc. Interspeech, 2020, pp. 1708–1712.
VOLUME , 9
Author et al.:
10 VOLUME ,