Professional Documents
Culture Documents
This Work Has Been Funded by The German Research Foundation (DFG) in The Transregio Project Crossmodal Learning (TRR 169)
This Work Has Been Funded by The German Research Foundation (DFG) in The Transregio Project Crossmodal Learning (TRR 169)
This Work Has Been Funded by The German Research Foundation (DFG) in The Transregio Project Crossmodal Learning (TRR 169)
3
3 10
2 2
2 0
1
1 1 -10
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
NFEs NFEs NFEs NFEs
Fig. 1: Performance results when trained and tested on WSj0-C3 of the proposed CRP, StoRM-BBED, predictive baseline and the generative
baseline as a function of NFEs. In this work, the NFEs is the number of NCSN++ evaluations. The number of discretization steps nsteps
is chosen as described in Section 4.4. We have that nsteps = NFE for CRP and the generative baseline. For StoRM-BBED, we have
nsteps + 1 = NFE (see Section 4.5). All solid lines use the same discretization schedule as described in Section 4.4. Dotted blue line uses
discretization schedule as in Section 4.2.
We train with a batch size of 16 and log the averaged PESQ of 10 method with a single NFE is not simply reduced to the predictive
randomly selected files from the validation set during training and baseline, and last we show that the proposed method with only 5
select the best-performing model for testing. We do not employ NFEs performs as well as the generative baseline with 60 NFEs.
EMA for tracking the weights. First, from Fig. 1 we see the performance results of the proposed
StoRM: As the proposed method uses a generative loss and a pre- CRP, generative BBED baseline, predictive baseline and StoRM-
dictive loss, we want to compare it against StoRM [12], which is BBED. In all plots of Fig. 1 we clearly observe the generative base-
also using the exact same generative and predictive loss. StoRM line (blue solid line) and StoRM-BBED (yellow solid line) also drop
has a predictive part and a generative part. In our setup, the predic- in performance when reducing the NFEs to few steps. For instance,
tive part uses the predictive baseline’s NCSN++ architecture. The both the generative baseline and StoRM-BBED achieve only around
generative part of StoRM is due to the generative baseline. We 1 on the MOS scale with NFE = 1 or NFE = 2 respectively. Note
call this modified version StoRM-BBED. For inference, we use the that a 1 on the MOS scale denotes "bad" quality according to ITU-T
same discretization schedule as for the proposed method described P.800. This is in contrast to the proposed CRP method, where perfor-
in Section 4.4. Note that the predictive part must always be executed mance remains relatively stable and achieves "good" (POLQA and
once before solving the reverse SDE. Therefore, NFE = nsteps + 1 WVMOS) or "fair" (PESQ) quality (see black solid line in Fig. 1).
for StoRM-BBED. Second, if only one NFE is applied, the performance of the pro-
posed method seems to be reduced to the predictive baseline. For
4.6. Datasets instance, in the POLQA plot, we report a value of 3.87 for the pro-
posed method at NFE = 1 and a very similar value of 3.85 for the
To analyze the generalization performance of the proposed method predictive baseline. However, the advantage of the generative ap-
for SE, we cross-evaluate on two different datasets. Specifically, we proach becomes visible in Tab. 1 where we show results for the mis-
train either on the VoiceBank-DEMAND dataset (VBD) or on the matched case, i.e., we train on the VBD training set, but test on the
WSJ0-CHiME3 (WSJ0-C3) dataset, but we test only on WSJ0-C3. WSj0-C3 test set. We observe that the proposed generative method
We choose to test on WSJ0-C3 over testing on VBD as the utterances outperforms the predictive baseline by 0.15 in PESQ also with only
of VBD have a short average length of 2.5 seconds, which is not ideal one NFE. These results imply that the better generalization to unseen
for evaluating on PESQ or POLQA [24]. data observed when comparing diffusion models against predictive
VBD: The publicly available VBD dataset [25] is commonly em- models observed in [8, 28] carries over to the proposed approach
ployed as a benchmark in single-channel SE tasks. Note that this even when NFE = 1. Last, we find that the proposed method achieves
dataset does not have a validation set. To ensure unbiased validation, virtually the same performance results with 5 NFEs for the matched
we split the training data into two sets: training set and validation as the generative baseline with 60 NFEs (see Fig. 1, compare solid
set. For validation, we specifically reserved the speakers “p226” and black line to dotted blue line) and mismatched case (see Tab. 1,
“p287” from the dataset. This has also been done in previous works compare first row to last row).
[8, 13].
WSJ0-C3: The WSJ0-C3 dataset that is also used in [8, 13] mixes 6. CONCLUSIONS
clean speech utterances from the Wall Street Journal dataset [26] to
noise signals from the CHiME3 dataset [27] with a uniformly sam- In this work, we proposed to fine-tune a score-based diffusion model
pled signal-to-noise ratio (SNR) between 0 and 20 dB. The dataset by optimizing a predictive loss to correct errors caused by solving the
is split into a train (12777 files), validation (1206 files) and test set reverse process. We find that the proposed method remains relatively
(651 files). stable in performance when lowering the number of function evalua-
tions (NFEs). In contrast, the pure generative approach and StoRM-
5. RESULTS BBED drastically reduce in performance when lowering the NFEs
to a few or single reverse diffusion steps. Moreover, we showed that
We first show that the proposed method remains relatively stable in contrast to a predictive baseline, the proposed method achieves an
when lowering the NFEs. Second, we show that the proposed improved generalization to unseen data even with one diffusion step.
References [16] W. Rudin, Real and Complex Analysis. McGraw-Hill, Inc., 3rd
ed ed., 1987.
[1] R. C. Hendriks, T. Gerkmann, and J. Jensen, DFT-domain
[17] B. D. Anderson, “Reverse-time diffusion equation models,”
based single-microphone noise reduction for speech enhance-
Stochastic Processes and their Applications, vol. 12, no. 3,
ment: A survey of the state-of-the-art. Morgan & Claypool,
pp. 313–326, 1982.
2013.
[18] C. M. Bender and S. A. Orszag, Advanced Mathematical Meth-
[2] T. Gerkmann and E. Vincent, “Spectral masking and filtering,”
ods for Scientists and Engineers. McGraw-Hill, 1978.
in Audio Source Separation and Speech Enhancement (E. Vin-
cent, T. Virtanen, and S. Gannot, eds.), John Wiley & Sons, [19] Y. Song and S. Ermon, “Generative modeling by estimat-
2018. ing gradients of the data distribution,” in Advances in Neural
[3] D. Wang and J. Chen, “Supervised speech separation based on Information Processing Systems (H. Wallach, H. Larochelle,
deep learning: An overview,” IEEE Trans. on Audio, Speech, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, eds.),
and Language Proc. (TASLP), vol. 26, no. 10, pp. 1702–1726, vol. 32, Curran Associates, Inc., 2019.
2018. [20] A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual
[4] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal evaluation of speech quality (PESQ) - a new method for speech
time–frequency magnitude masking for speech separation,” quality assessment of telephone networks and codecs,” IEEE
IEEE Trans. on Audio, Speech, and Language Proc. (TASLP), Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP),
vol. 27, no. 8, pp. 1256–1266, 2019. vol. 2, pp. 749–752, 2001.
[5] Y.-J. Lu, Y. Tsao, and S. Watanabe, “A study on speech en- [21] ITU-T Rec. P.863, “Perceptual objective listening quality pre-
hancement based on diffusion probabilistic model,” IEEE Asia- diction,” Int. Telecom. Union (ITU), 2018.
Pacific Signal and Inf. Proc. Assoc. Annual Summit and Conf. [22] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–
(APSIPA ASC), pp. 659–666, 2021. half-baked or well done?,” in IEEE Int. Conf. on Acoustics,
[6] Y.-J. Lu, Z.-Q. Wang, S. Watanabe, A. Richard, C. Yu, and Speech and Signal Proc. (ICASSP), pp. 626–630, 2019.
Y. Tsao, “Conditional diffusion probabilistic model for speech [23] Y. Song and S. Ermon, “Improved techniques for training
enhancement,” IEEE Int. Conf. on Acoustics, Speech and Sig- score-based generative models,” Advances in Neural Inf. Proc.
nal Proc. (ICASSP), 2022. Systems (NeurIPS), vol. 33, pp. 12438–12448, 2020.
[7] S. Welker, J. Richter, and T. Gerkmann, “Speech enhancement [24] ITU-T, “Wideband extension to Recommendation P.862 for
with score-based generative models in the complex STFT do- the assessment of wideband telephone networks and speech
main,” Interspeech, 2022. codecs.”
[8] J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and [25] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi,
T. Gerkmann, “Speech enhancement and dereverberation “Investigating RNN-based speech enhancement methods for
with diffusion-based generative models,” IEEE/ACM Transac- noise-robust text-to-speech,” ISCA Speech Synthesis Workshop
tions on Audio, Speech, and Language Processing, vol. 31, (SSW), pp. 146–152, 2016.
pp. 2351–2364, 2023.
[26] J. S. Garofolo, D. Graff, D. Paul, and D. Pallett, “CSR-I
[9] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilis- (WSJ0) Complete.”
tic models,” Advances in Neural Inf. Proc. Systems (NeurIPS),
vol. 33, pp. 6840–6851, 2020. [27] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third
CHiME speech separation and recognition challenge: Dataset,
[10] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Er- task and baselines,” IEEE Workshop on Automatic Speech
mon, and B. Poole, “Score-based generative modeling through Recognition and Understanding (ASRU), pp. 504–511, 2015.
stochastic differential equations,” Int. Conf. on Learning Rep-
resentations (ICLR), 2021. [28] J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann,
“Analysing diffusion-based generative approaches versus dis-
[11] P. Vincent, “A connection between score matching and de-
criminative approaches for speech restoration,” IEEE Int. Conf.
noising autoencoders,” Neural Computation, vol. 23, no. 7,
on Acoustics, Speech and Signal Proc. (ICASSP), pp. 1–5,
pp. 1661–1674, 2011.
2023.
[12] J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann,
“Storm: A diffusion-based stochastic regeneration model for
speech enhancement and dereverberation,” IEEE/ACM Trans-
actions on Audio, Speech, and Language Processing, vol. 31,
pp. 2724–2737, 2023.
[13] B. Lay, S. Welker, J. Richter, and T. Gerkamnn, “Reduc-
ing the prior mismatch of stochastic differential equations for
diffusion-based speech enhancement,” Interspeech, 2023.
[14] T. Gerkmann and R. Martin, “Empirical distributions of DFT-
domain speech coefficients based on estimated speech vari-
ances,” Int. Workshop on Acoustic Echo and Noise Control,
2010.
[15] I. Karatzas and S. E. Shreve, Brownian Motion and Stochastic
Calculus. Springer, 2nd ed ed., 1996.