This Work Has Been Funded by The German Research Foundation (DFG) in The Transregio Project Crossmodal Learning (TRR 169)

SINGLE AND FEW-STEP DIFFUSION FOR GENERATIVE SPEECH ENHANCEMENT
Bunlong Lay, Jean-Marie Lermercier, Julius Richter , Timo Gerkmann
Signal Processing, University Hamburg, Hamburg, Germany

bunlong.lay@uni-hamburg.de
ABSTRACT be described using stochastic differential equations (SDEs). This of-

arXiv:2309.09677v1 [eess.AS] 18 Sep 2023
fers more flexibility and opportunities than approaches based on dis-

Diffusion models have shown promising results in speech enhance- crete Markov chains [10]. For example, it allows the use of general-
ment, using a task-adapted diffusion process for the conditional gen- purpose SDE solvers to numerically integrate the reverse process,
eration of clean speech given a noisy mixture. However, at test time, which can affect the performance and number of iteration steps.
the neural network used for score estimation is called multiple times Solving the reverse process, also called reverse SDE when em-
to solve the iterative reverse process. This results in a slow infer- ploying SDEs, means enhancing noisy mixtures as the reverse SDE
ence process and causes discretization errors that accumulate over transforms the distribution of noisy speech into the distribution of
the sampling trajectory. In this paper, we address these limitations clean speech. To solve the reverse SDE at inference, it is essential
through a two-stage training approach. In the first stage, we train the to evaluate the score function of the perturbed data, which is typi-
diffusion model the usual way using the generative denoising score cally intractable. Thus, a neural network, also called score model, is
matching loss. In the second stage, we compute the enhanced signal trained to approximate the required score function. The score model
by solving the reverse process and compare the resulting estimate to is typically trained using the denoising score matching (DSM) loss
the clean speech target using a predictive loss. We show that using [11], which has shown great success for SE [8, 12, 13]. However,
this second training stage enables achieving the same performance as solving the reverse SDE differs from the training condition in var-
the baseline model using only 5 function evaluations instead of 60 ious aspects. First, there exists the so-called prior mismatch [13],
function evaluations. While the performance of usual generative dif- which we define as the difference between the terminating distribu-
fusion algorithms drops dramatically when lowering the number of tion of the forward process and the initial distribution used to solve
function evaluations (NFEs) to obtain single-step diffusion, we show the reverse SDE. Second, the reverse process introduces a discretiza-
that our proposed method keeps a steady performance and therefore tion error when using a numerical solver. This error can be mitigated
largely outperforms the diffusion baseline in this setting and also by using more sophisticated solvers, at the cost of increased NFEs.
generalizes better than its predictive counterpart. Last, the score model itself is not a perfect estimator of the score
Index Terms— Speech enhancement, diffusion models, stochas- function and produces errors. As the aforementioned errors accu-
tic differential equations. mulate over the reverse trajectory, the obtained diffusion states can
deviate from the distribution seen during training, which further im-
pairs the score model performance.
1. INTRODUCTION This work aims to mitigate these errors by employing a two-
stage training. We first train using the DSM objective and fine-
The objective of speech enhancement (SE) is to retrieve the original tune the score model with a predictive loss to correct the discretiza-
clean speech signal from a noisy mixture that is affected by environ- tion and score model errors. To do so, we pick a reverse diffusion
mental noise [1]. Traditional approaches attempt to leverage statis- solver, e.g. the Euler-Maruyama (EuM) first-order method in our
tical relationships between the clean speech signal and the surround- case, to obtain a clean speech estimate. Then, we fine-tune the score
ing environmental noise [2]. In recent years, various machine learn- model parameters with respect to the mean squared error (MSE) loss
ing techniques have been introduced, treating SE as a discriminative between the estimate and the clean speech data. Backpropagating
learning task [3, 4] resulting in so-called predictive approaches. through the whole reverse trajectory is computationally infeasible
In contrast to predictive approaches that learn a direct mapping when using many steps. Therefore, we propose here to only use the
from noisy to clean speech, generative approaches learn a prior dis- gradients accumulated during the last reverse diffusion step to update
tribution over clean speech data. Recently, so-called score-based the parameters.
generative models, also known as diffusion models are introduced for We compare our approach against (a) our generative baseline
SE [5–8]. The fundamental concept behind these models involves it- BBED [13], (b) a predictive baseline using a similar network archi-
eratively adding Gaussian noise to the data using a discrete and fixed tecture as used for score estimation, as well as (c) the StoRM ap-
Markov chain known as the forward process, thereby transforming proach [12], which also combines a generative loss and a predictive
data into a tractable distribution such as the standard normal dis- loss. We show that the proposed method is very robust to a decrease
tribution. Then, a neural network is trained to invert this diffusion in the NFEs used for reverse diffusion as opposed to the purely gen-
process in a so-called reverse process [9]. By letting the step size be- erative approach and StoRM [12]. In fact, with only 5 NFEs the pro-
tween two discrete Markov chain states approach zero, the discrete posed method performs virtually the same as the generative baseline
Markov chain becomes a continuous-time stochastic process that can with 60 NFEs. Furthermore, when taking only one reverse diffusion
step, we achieve competitive SE performance and we show that the
This work has been funded by the German Research Foundation (DFG) proposed generative approach generalizes better to unseen data than
in the transregio project Crossmodal Learning (TRR 169). the predictive baseline.
2. BACKGROUND where ∆t(i) = ti+1 − ti and Zti ∼ NC (0, I). When fixing the step
size ∆t(i), the reverse starting point 0 < trsp := tN ≤ T can be
2.1. Notation and data representation used for trading performance for computational speed. For optimal
performance, one should set trsp = T to reduce the so-called prior
The task of SE is to estimate the clean speech signal S from a noisy
mismatch [13] between the terminating forward distribution and the
mixture Y = S + N, where N is environmental noise. All variables
initial reverse distribution. Using a smaller trsp reduces the number
in bold are the coefficients of a compressed complex-valued spec-
of iterations, but may also degrade the performance [13]. The last
trogram obtained through short-time Fourier transform (STFT) and
magnitude-compression, e.g. Y ∈ Cd and d = KF with K number iteration outputs Xe t0 approximating the clean speech signal.
of STFT frames and F number of frequency bins. As in [8], we use
the magnitude compression: 2.4. Brownian Bridge with Exponential Diffusion Coefficient
(BBED)
c̄ = β|c|α ei∠(c) (1)
Recently, the BBED SDE [13] has been shown to outperform ex-
with parameters β, α > 0 to compensate for the typically heavy- isting SDEs [7, 8] for the task of SE in terms of several intrusive
tailed distribution of STFT speech magnitudes [14]. We set β = instrumental performance metrics. The BBED drift and diffusion
0.15, α = 0.5 as in previous work [8, 12, 13]. coefficients are given by:
Y − Xt
2.2. Stochastic Differential Equations f (Xt , Y) = , (6)
1−t
Following [7, 8, 13], we model the forward process of the score-
and
based generative model with an SDE defined on 0 ≤ t < Tmax :
g(t) = ckt , where c, k > 0 (7)
dXt = f (Xt , Y)dt + g(t)dw, (2) for 0 ≤ t ≤ T < 1, where T , as detailed in Section 2.2, is the
terminating time for the forward SDE, which we set slightly smaller
where w is the standard Wiener process [15], Xt is the current pro- than 1 to avoid division by zero in (6). The closed-form solution for
cess state with initial condition X0 = S, and t a continuous diffusion the variance of the perturbation kernel can be computed from [15,
time variable describing the progress of the process ending at the last (6.11)]:
diffusion time Tmax . The drift term f (Xt , Y)dt can be integrated by
Lebesgue integration [16], and the diffusion term g(t)dw follows
h 2
i
σ(t)2 = (1 − t)c2 (k2t − 1 + t) + log(k2k )(1 − t)E , (8)
Ito integration [15]. The diffusion coefficient g regulates the amount
of Gaussian noise that is added to the process, and the drift f affects E = Ei [2(t − 1) log(k)] − Ei [−2 log(k)] , (9)
the mean and the variance of Xt in the case of linear SDEs (see
[15, Eq. (6.10), (6.11)]). The process state Xt follows a Gaussian where Ei[·] denotes the exponential integral function [18]. The vari-
distribution [15, Section 5] called perturbation kernel: ance admits one peak between 0 and 1 and is zero for t ∈ {0, 1}.
The mean can be computed from [15, (6.10)] and linearly interpo-
p0t (Xt |X0 , Y) = NC Xt ; µ(X0 , Y, t), σ(t)2 I .

(3) lates between the noisy mixture Y and the clean speech signal X0 :
Under mild regularity conditions, each forward SDE (2) can be as- µ(X0 , Y, t) = (1 − t) X0 + tY . (10)
sociated to a reverse SDE [17]:
3. PROPOSED TWO-STAGE TRAINING
h i
dXe t = −f (X e t , Y) + g(t)2 ∇ e log pt (Xe t |Y) dt + g(t)dw e,
Xt
(4) To reduce the NFEs during inference, we now propose a two-stage
where dw e is a Wiener process going backward through the diffu- training procedure. We first train the score model on the DSM loss
sion time. In particular, the reverse process starts at t = T and ends [7, 8, 19] and fine-tune the score model on a predictive loss which
at t = 0. Here T < Tmax is a parameter that needs to be set for we denote in the following as correcting the reverse process (CRP)
practical reasons, as the last diffusion time Tmax is only reached in loss.
limit. The score function ∇Xt log pt (Xt |Y) is approximated by a
neural network called score model sθ (Xt , Y, t), which is parame-
3.1. Denoising Score Matching (DSM)
terized by a set of parameters θ. Assuming sθ is available, we can
generate an estimate of the clean speech data from Y by solving the Following Vincent [11], we fit the score model to the score of the
reverse SDE. perturbation kernel
2.3. Solving the reverse process Xt − µ(X0 , Y, t)

∇Xt log p0t (Xt |X0 , Y) = −
σ(t)2
To solve the reverse SDE the first-order EuM method can be em- Z
ployed to keep the NFEs low. For the EuM method, a discretiza- =− , (11)
σ(t)
tion schedule (tN , tN −1 , . . . , t0 = 0) for the reverse process is
chosen. The reverse process that is used for inference starts with where the last equality holds because of Xt = µ(X0 , Y, t)+σ(t)Z
Xe t ∼ NC (Y, σ(tN )2 I) and iteratively computes [10] with Z ∼ NC (0, I). At each training step we proceed as follows:
N
h i 1) we sample a training pair of clean speech data and noisy speech
2
X i+1 − f (Xti+1 , Y) − g(ti ) sθ (Xti+1 , Y, ti ) ∆t(i)
et = X et e e (X0 , Y), 2) we sample t uniformly from [tϵ , T ], where tϵ > 0 is
i
p a small hyperparameter that assured numerical stability [8], 3) we
+ g(ti ) ∆t(i) Zti , (5) compute Xt = µ(X0 , Y, t) + σ(t)Z. Finally, we obtain the DSM
Method NFEs POLQA PESQ 4.2. BBED parameterization
generative baseline [13] 60 3.61 ± 0.65 2.36 ± 0.58
A parameterization for the BBED SDE has already been grid-
predictive baseline 1 3.39 ± 0.73 2.14 ± 0.57 searched in [13]. We therefore simply set k = 2.6, c = 0.51 and
1 3.47 ± 0.71 2.29 ± 0.61 T = 0.999 as it is done in [13].
proposed CRP
5 3.65 ± 0.65 2.55 ± 0.59 For inference, we employ the predictor-corrector scheme [10].
Table 1: Speech enhancement results (mean and standard deviation) The predictor is the EuM method and the corrector is the annealed
obtained for WSJ0-C3 when trained on VBD (mismatch case). Langevin Dynamics method. We start the reverse process at 0.999
loss, by minimizing the distance of the term on the right in (11) and and we use 30 discretization steps, resulting in 60 NFEss.
the score model based on the L2 norm:
" # 4.3. Training on the DSM loss
2
Z
arg min Et,(X0 ,Y),Z,Xt |(X0 ,Y) sθ (Xt , Y, t) + . For the score model sθ (Xt , Y, t), we employ the Noise Conditional
θ σ(t) 2
Score Network (NCSN++) architecture (see [8, 10] for more details).
(12) The first training stage follows mainly the setup from [13]. More
precisely, we first train the model on the DSM loss function defined
3.2. Correcting the Reverse Process (CRP) in (12) with a learning rate of 10−4 using the ADAM optimizer with
a decay of 10−6 . In addition, an exponential moving average (EMA)
By solving the reverse process as described in Section 2.3, we ob- of the weights of NCSN++ is tracked with a decay of 0.999, to be
tain an estimate X e 0 of the clean speech signal X0 . This estimate, used for sampling [23]. Moreover, a lower bound for the diffusion
however, contains errors from several sources. First, there is the step t is set to be tϵ = 0.03 to avoid numerical instability (see [8, 13]
discretization error from the EuM method. Second, there are errors for details). Furthermore, we train for 200 epochs with a batch size
caused by the score model over the discretization schedule. More- of 16 and logged the averaged PESQ of 10 randomly selected files
over, there is the prior mismatch, as discussed in Section 2.3. This from the validation set during training and select the best-performing
mismatch and the aforementioned error sources are not addressed model for the second training stage.
using the DSM loss in (12). This is because in the third step of com-
puting the DSM loss, we calculate Xt by using the solution of the
forward SDE given by (3), but during inference, we compute the so- 4.4. Training on the CRP loss
lution Xe t of the reverse SDE. After training on the DSM loss, we In the second training stage, we train on (13) with the same ADAM
therefore propose to retrain the score model to adapt to these errors and exponential moving average configurations as for the first train-
and the prior mismatch. Specifically, we first fix a reverse starting ing stage. We train on a batch size of 16 and train for only 10 epochs.
point trsp and a discretization schedule (trsp = tN , . . . , t0 = 0). We also validate during training on 10 randomly selected files from
Then, for each training pair of clean speech and noisy speech (X0 , the validation set and chose the best-performing model accordingly
Y), we compute X e 0 based on (5) by starting at trsp and taking steps for testing.
according to the fixed discretization schedule, i.e. we specifically To train on the CRP loss in (13), we have to choose a discretiza-
run the inference process. tion schedule. To this end, we set the reverse starting point trsp =
In the second training stage, we then optimize on correcting the 0.5. This choice is based on [13], where it has been observed that
reverse process (CRP) loss the BBED can perform for trsp = 0.5 as well as for trsp = 0.999
while reducing the NFEs. The discretization schedule is set as fol-
2
lows. We take nsteps ∈ {1, 2, 3, 4, 5} discretizations steps as follows.
e 0 |(X0 ,Y) X0 − X0
arg min E(X0 ,Y),X , (13)
e
θ 2 The first nsteps − 1 steps are uniformly from trsp to tϵ and the last step
by updating the weights based on the gradients of the last score is taken from tϵ to 0. Moreover, when we take nsteps = 1, then we
take only one step from trsp to 0 directly. We do not use a corrector
model call sθ (Xe t1 , Y, t1 ) used to obtain X
e 0 . This also means that
for training the CRP loss, therefore NFE = nsteps . In addition, we use
when we iterate through (5) the gradients of all other score model
the same discretization schedule for testing as for training.
calls sθ (X
e t , Y, tk ), k = N, . . . , 2 are not used to update the
k
weights of the neural network. The decision to exclusively update
4.5. Baselines
weights in the last score model call is driven by the fact that the
accumulated discretization and score model errors can be corrected Generative baseline: We call the BBED SDE parametrized as in
by the last score model call while keeping the cost of memory low. 4.2 and trained as described in Section 4.3 the generative baseline.
Predictive baseline: We investigate the benefits of the gener-
4. EXPERIMENTAL SETUP ative BBED baseline over a predictive baseline by training the
score model architecture on a predictive loss. We, therefore, train
4.1. Metrics NCSN++ on the MSE loss between enhanced and the clean speech
signal as it is done in the second training stage in (13) and also use
We evaluate the performance on perceptual metrics, wideband PESQ the same magnitude compressed STFT representation from Section
[20] and POLQA [21], on energy-based metrics SI-SDR [22]. We 2.1 for the input of the predictive NCSN++ baseline. Moreover,
also evaluate on a reference-free metric WVMOS using a neural net- since there is no diffusion time embedding t in the input for the
work to predict MOS values. predictive task, we removed the noise-conditioned layers. The mod-
In this work all considered methods are using NCSN++ archi- ified architecture differs by less than 1 percent of its original size
tectures, we therefore define the NFEs the number of NCSN++ calls and therefore changes hardly the capacity of the network. Similar to
for producing the enhanced file. It is a measure of computational training the DSM loss, we use the ADAM optimizer with a learning
expenses during inference. rate of 10−4 and a decay of 10−6 for a maximum of 200 epochs.
proposed CRP STORM-BBED generative baseline generative baseline 60 NFEs Predictive baseline 1 NFE
POLQA MOS PESQ MOS WVMOS SI-SDR in [dB]
4 3 4 20
3
3 10
2 2
2 0
1
1 1 -10
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
NFEs NFEs NFEs NFEs
Fig. 1: Performance results when trained and tested on WSj0-C3 of the proposed CRP, StoRM-BBED, predictive baseline and the generative
baseline as a function of NFEs. In this work, the NFEs is the number of NCSN++ evaluations. The number of discretization steps nsteps
is chosen as described in Section 4.4. We have that nsteps = NFE for CRP and the generative baseline. For StoRM-BBED, we have
nsteps + 1 = NFE (see Section 4.5). All solid lines use the same discretization schedule as described in Section 4.4. Dotted blue line uses
discretization schedule as in Section 4.2.
We train with a batch size of 16 and log the averaged PESQ of 10 method with a single NFE is not simply reduced to the predictive
randomly selected files from the validation set during training and baseline, and last we show that the proposed method with only 5
select the best-performing model for testing. We do not employ NFEs performs as well as the generative baseline with 60 NFEs.
EMA for tracking the weights. First, from Fig. 1 we see the performance results of the proposed
StoRM: As the proposed method uses a generative loss and a pre- CRP, generative BBED baseline, predictive baseline and StoRM-
dictive loss, we want to compare it against StoRM [12], which is BBED. In all plots of Fig. 1 we clearly observe the generative base-
also using the exact same generative and predictive loss. StoRM line (blue solid line) and StoRM-BBED (yellow solid line) also drop
has a predictive part and a generative part. In our setup, the predic- in performance when reducing the NFEs to few steps. For instance,
tive part uses the predictive baseline’s NCSN++ architecture. The both the generative baseline and StoRM-BBED achieve only around
generative part of StoRM is due to the generative baseline. We 1 on the MOS scale with NFE = 1 or NFE = 2 respectively. Note
call this modified version StoRM-BBED. For inference, we use the that a 1 on the MOS scale denotes "bad" quality according to ITU-T
same discretization schedule as for the proposed method described P.800. This is in contrast to the proposed CRP method, where perfor-
in Section 4.4. Note that the predictive part must always be executed mance remains relatively stable and achieves "good" (POLQA and
once before solving the reverse SDE. Therefore, NFE = nsteps + 1 WVMOS) or "fair" (PESQ) quality (see black solid line in Fig. 1).
for StoRM-BBED. Second, if only one NFE is applied, the performance of the pro-
posed method seems to be reduced to the predictive baseline. For
4.6. Datasets instance, in the POLQA plot, we report a value of 3.87 for the pro-
posed method at NFE = 1 and a very similar value of 3.85 for the
To analyze the generalization performance of the proposed method predictive baseline. However, the advantage of the generative ap-
for SE, we cross-evaluate on two different datasets. Specifically, we proach becomes visible in Tab. 1 where we show results for the mis-
train either on the VoiceBank-DEMAND dataset (VBD) or on the matched case, i.e., we train on the VBD training set, but test on the
WSJ0-CHiME3 (WSJ0-C3) dataset, but we test only on WSJ0-C3. WSj0-C3 test set. We observe that the proposed generative method
We choose to test on WSJ0-C3 over testing on VBD as the utterances outperforms the predictive baseline by 0.15 in PESQ also with only
of VBD have a short average length of 2.5 seconds, which is not ideal one NFE. These results imply that the better generalization to unseen
for evaluating on PESQ or POLQA [24]. data observed when comparing diffusion models against predictive
VBD: The publicly available VBD dataset [25] is commonly em- models observed in [8, 28] carries over to the proposed approach
ployed as a benchmark in single-channel SE tasks. Note that this even when NFE = 1. Last, we find that the proposed method achieves
dataset does not have a validation set. To ensure unbiased validation, virtually the same performance results with 5 NFEs for the matched
we split the training data into two sets: training set and validation as the generative baseline with 60 NFEs (see Fig. 1, compare solid
set. For validation, we specifically reserved the speakers “p226” and black line to dotted blue line) and mismatched case (see Tab. 1,
“p287” from the dataset. This has also been done in previous works compare first row to last row).
[8, 13].
WSJ0-C3: The WSJ0-C3 dataset that is also used in [8, 13] mixes 6. CONCLUSIONS
clean speech utterances from the Wall Street Journal dataset [26] to
noise signals from the CHiME3 dataset [27] with a uniformly sam- In this work, we proposed to fine-tune a score-based diffusion model
pled signal-to-noise ratio (SNR) between 0 and 20 dB. The dataset by optimizing a predictive loss to correct errors caused by solving the
is split into a train (12777 files), validation (1206 files) and test set reverse process. We find that the proposed method remains relatively
(651 files). stable in performance when lowering the number of function evalua-
tions (NFEs). In contrast, the pure generative approach and StoRM-
5. RESULTS BBED drastically reduce in performance when lowering the NFEs
to a few or single reverse diffusion steps. Moreover, we showed that
We first show that the proposed method remains relatively stable in contrast to a predictive baseline, the proposed method achieves an
when lowering the NFEs. Second, we show that the proposed improved generalization to unseen data even with one diffusion step.
References [16] W. Rudin, Real and Complex Analysis. McGraw-Hill, Inc., 3rd
ed ed., 1987.
[1] R. C. Hendriks, T. Gerkmann, and J. Jensen, DFT-domain
[17] B. D. Anderson, “Reverse-time diffusion equation models,”
based single-microphone noise reduction for speech enhance-
Stochastic Processes and their Applications, vol. 12, no. 3,
ment: A survey of the state-of-the-art. Morgan & Claypool,
pp. 313–326, 1982.
2013.
[18] C. M. Bender and S. A. Orszag, Advanced Mathematical Meth-
[2] T. Gerkmann and E. Vincent, “Spectral masking and filtering,”
ods for Scientists and Engineers. McGraw-Hill, 1978.
in Audio Source Separation and Speech Enhancement (E. Vin-
cent, T. Virtanen, and S. Gannot, eds.), John Wiley & Sons, [19] Y. Song and S. Ermon, “Generative modeling by estimat-
2018. ing gradients of the data distribution,” in Advances in Neural
[3] D. Wang and J. Chen, “Supervised speech separation based on Information Processing Systems (H. Wallach, H. Larochelle,
deep learning: An overview,” IEEE Trans. on Audio, Speech, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, eds.),
and Language Proc. (TASLP), vol. 26, no. 10, pp. 1702–1726, vol. 32, Curran Associates, Inc., 2019.
2018. [20] A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual
[4] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal evaluation of speech quality (PESQ) - a new method for speech
time–frequency magnitude masking for speech separation,” quality assessment of telephone networks and codecs,” IEEE
IEEE Trans. on Audio, Speech, and Language Proc. (TASLP), Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP),
vol. 27, no. 8, pp. 1256–1266, 2019. vol. 2, pp. 749–752, 2001.
[5] Y.-J. Lu, Y. Tsao, and S. Watanabe, “A study on speech en- [21] ITU-T Rec. P.863, “Perceptual objective listening quality pre-
hancement based on diffusion probabilistic model,” IEEE Asia- diction,” Int. Telecom. Union (ITU), 2018.
Pacific Signal and Inf. Proc. Assoc. Annual Summit and Conf. [22] J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–
(APSIPA ASC), pp. 659–666, 2021. half-baked or well done?,” in IEEE Int. Conf. on Acoustics,
[6] Y.-J. Lu, Z.-Q. Wang, S. Watanabe, A. Richard, C. Yu, and Speech and Signal Proc. (ICASSP), pp. 626–630, 2019.
Y. Tsao, “Conditional diffusion probabilistic model for speech [23] Y. Song and S. Ermon, “Improved techniques for training
enhancement,” IEEE Int. Conf. on Acoustics, Speech and Sig- score-based generative models,” Advances in Neural Inf. Proc.
nal Proc. (ICASSP), 2022. Systems (NeurIPS), vol. 33, pp. 12438–12448, 2020.
[7] S. Welker, J. Richter, and T. Gerkmann, “Speech enhancement [24] ITU-T, “Wideband extension to Recommendation P.862 for
with score-based generative models in the complex STFT do- the assessment of wideband telephone networks and speech
main,” Interspeech, 2022. codecs.”
[8] J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and [25] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi,
T. Gerkmann, “Speech enhancement and dereverberation “Investigating RNN-based speech enhancement methods for
with diffusion-based generative models,” IEEE/ACM Transac- noise-robust text-to-speech,” ISCA Speech Synthesis Workshop
tions on Audio, Speech, and Language Processing, vol. 31, (SSW), pp. 146–152, 2016.
pp. 2351–2364, 2023.
[26] J. S. Garofolo, D. Graff, D. Paul, and D. Pallett, “CSR-I
[9] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilis- (WSJ0) Complete.”
tic models,” Advances in Neural Inf. Proc. Systems (NeurIPS),
vol. 33, pp. 6840–6851, 2020. [27] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third
CHiME speech separation and recognition challenge: Dataset,
[10] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Er- task and baselines,” IEEE Workshop on Automatic Speech
mon, and B. Poole, “Score-based generative modeling through Recognition and Understanding (ASRU), pp. 504–511, 2015.
stochastic differential equations,” Int. Conf. on Learning Rep-
resentations (ICLR), 2021. [28] J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann,
“Analysing diffusion-based generative approaches versus dis-
[11] P. Vincent, “A connection between score matching and de-
criminative approaches for speech restoration,” IEEE Int. Conf.
noising autoencoders,” Neural Computation, vol. 23, no. 7,
on Acoustics, Speech and Signal Proc. (ICASSP), pp. 1–5,
pp. 1661–1674, 2011.
2023.
[12] J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann,
“Storm: A diffusion-based stochastic regeneration model for
speech enhancement and dereverberation,” IEEE/ACM Trans-
actions on Audio, Speech, and Language Processing, vol. 31,
pp. 2724–2737, 2023.
[13] B. Lay, S. Welker, J. Richter, and T. Gerkamnn, “Reduc-
ing the prior mismatch of stochastic differential equations for
diffusion-based speech enhancement,” Interspeech, 2023.
[14] T. Gerkmann and R. Martin, “Empirical distributions of DFT-
domain speech coefficients based on estimated speech vari-
ances,” Int. Workshop on Acoustic Echo and Noise Control,
2010.
[15] I. Karatzas and S. E. Shreve, Brownian Motion and Stochastic
Calculus. Springer, 2nd ed ed., 1996.

This Work Has Been Funded by The German Research Foundation (DFG) in The Transregio Project Crossmodal Learning (TRR 169)

Uploaded by

Copyright:

Available Formats

You might also like

This Work Has Been Funded by The German Research Foundation (DFG) in The Transregio Project Crossmodal Learning (TRR 169)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

This Work Has Been Funded by The German Research Foundation (DFG) in The Transregio Project Crossmodal Learning (TRR 169)

Uploaded by

Copyright:

Available Formats

SINGLE AND FEW-STEP DIFFUSION FOR GENERATIVE SPEECH ENHANCEMENT

Bunlong Lay, Jean-Marie Lermercier, Julius Richter , Timo Gerkmann

Signal Processing, University Hamburg, Hamburg, Germany

ABSTRACT be described using stochastic differential equations (SDEs). This of-

fers more flexibility and opportunities than approaches based on dis-

2.3. Solving the reverse process Xt − µ(X0 , Y, t)

You might also like