Professional Documents
Culture Documents
Forensic Audio
Forensic Audio
Summary. The adaptive filtering techniques have plenty of applications in any ar-
eas where the modeled signals or systems are constantly changing. An adaptive filter
is a system whose structure is alterable or adjustable in such a way that its behavior
or performance improves through contact with its environment [16]. This chapter
focuses on adaptive filtering techniques for forensic audio applications. Multichan-
nel multirate specialized structures are presented as general cases. Five approaches
are studied: spectral equalization, adaptive linear prediction (ALP), adaptive noise
cancellation (ANC), beamforming and deconvolution or derreverberation. Objec-
tive and subjective measurements for the evaluation of intelligibility after speech
enhancement are revised.
1 Introduction
Forensic audio uses science and technology in order to assist a historical,
criminal and civil investigation in courts of law. There are sophisticated tech-
nologies and scientific methodologies of forensic audio that include several
specialties: voice and sound identification, audibility analysis, audio enhance-
ment, authenticity analysis and others. The employment of audio recordings of
telephone conversations and interviews, as well as covert surveillance record-
ings, is an integral part of law enforcement. Such recordings are frequently of
poor quality, but they are admissible if they are intelligible and meet the rules
of evidence. It can be equally important that the recording is listenable, and
the information that contains is easily discerned by the jury. Written tran-
scripts can be vital evidence, and these can only be made with confidence if
the recording is intelligible [5]. Forensic filtering can be used in the laboratory
2 Lino Garcı́a et. al.
to reject the noise and interference, as well as to restore, clarify or enhance the
audio information to assist law enforcement agencies’ criminal investigation,
civil investigation and the court process.
There is a number of possible degradations that can be found in a speech
recording and that can affect its quality. On one hand, the signal arriving
the microphone usually incorporates multiple sources: the desired signal plus
other unwanted signals generally termed as noise. On the other hand, there are
different sources of distortion that can reduce the clarity of the desired signal:
amplitude distortion caused by the electronics; frequency distortion caused by
either the electronics or the acoustic environment; and time-domain distortion
due to reflection and reverberation in the acoustic environment.
Adaptive filters have traditionally found a field of application in noise and
reverberation reduction, thanks to their ability to cope with changes in the
signals or the sound propagation conditions in the room where the recording
takes place. This chapter is an advanced tutorial about multichannel adaptive
filtering techniques suitables for forensic audio to provide relevant theoretical
foundation in this regard. The employment of more than one microphone is
useful for audio surveillance purposes. This is possible when the room where
the recording will be made can be prepared in advance. However, monochan-
nel adaptive filtering can be seen as a particular case of the more complex
and general multichannel adaptive filtering. The different adaptive filtering
techniques are presented in a common foundation useful in other forensic dis-
ciplines.
This chapter is organized as follows: in Sect. 1.1, we introduce a formal
definition of the forensic audio scenario from the multiple-input and multiple-
output (MIMO) perspective and the terminology that is used throughout the
chapter. In Sect. 1.2, signals and systems related to forensic audio are briefly
summarized. Section 1.3 discusses quality measurements. Section 2 is dedi-
cated to the theoretical foundation of the adaptive filters. In Sects. 2.1 and 2.2
the filter’s structure and the adaptation’s algorithms are discussed. The dif-
ferent cost functions, stochastic estimations and optimization strategies over
transversal and lattice structures are shown. In Sect. 3 specialized structures
based on multirate techniques are presented. These schemes allow computa-
tionally efficient algorithms to be suitable for very large impulse responses
involved in forensic audio applications and real time implementations. Two
approaches are considered in Sects. 3.1 and 3.2: the subband and frequency-
domain partitioned adaptive filtering respectly. In Sect. 3.3 the partitioned
convolution is described and in Sect. 3.4 a delayless approach for real-time
applications are commented. Section 4 focuses on the adaptive filtering tech-
niques for forensic audio application: spectral equalization, linear prediction,
noise cancellation, beamforming and deconvolution.
V x2 (n) W y2 (n)
s2 (n)
xP (n) yO (n)
sI (n)
r(n)
The box, on the left, represents a room where the evidence is being
recorded. V is a P × LI matrix that contains the acoustic impulse responses
(AIR) between the I sources and P microphones 1
v11 v12 · · · v1I
v21 v22 · · · v2I
V= . .. . . .. ,
.. . . .
vP 1 vP 2 · · · vP I
vpi = vpi1 vpi2 · · · vpiL . (1)
Sources can be interesting or desired signals (to enhance) or noise and
interference signals (to attenuate). xp (n), p = 1 . . . P , is a corrupted or poor
quality signal that wants to be improved, (P = 1 corresponds to the sin-
gle channel case). r(n) is an additive noise or interference signal due to the
recording device. The forensic filtering goal is to obtain a W matrix so that
yo (n) ≈ ŝi (n) corresponds to a restored or enhanced signal.
The signals in the Fig. 1 are related by
x(n) = Vs(n) + r(n), (2)
y(n) = Wx(n). (3)
s(n) is a LI × 1 vector that collects the source signals,
T
s(n) = sT1 (n) sT2 (n) · · · sTI (n) ,
(4)
T
si (n) = si (n) si (n − 1) · · · si (n − L + 1) .
x(n) is a P × 1 vector that corresponds to the convolutive system output
excited by s(n) and the adaptive filter input of order O × LP . xp (n) is an
1
Note that the discontinuous lines represent only the direct path and some first re-
flections between the s1 (n) source and the microphone with output signal x1 (n).
Each vpi vector represents the AIR between i = 1 · · · I and p = 1 · · · P positions
and is constantly changing depending on the position of both: source or micro-
phone (i.e. mobile recording device), angle between them, radiation pattern, etc.
4 Lino Garcı́a et. al.
Speech
Speech is produced by inhaling air into the lungs and exhaling it through a
vibrating glottis and the vocal tract. The random noise-like, air flow from the
lungs is spectrally shaped and amplified by the vibrations of the glottal cords
and the resonance of the vocal tract. The effect of the glottal cords and the
vocal tract is to introduce a measure of correlation and predictability on the
random variations of the air from the lungs [29]. The speech signal (voice)
is formed by silence, noisy and fricative segments and harmonic or occlusive
segments and is highly modulated.
Speech signal
1
−1
0 0.1 0.2 0.3 0.4 0.5
Seconds
Autocorrelation
400 0.8
200 0.2
−0.4
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Seconds
Time−frequency analysis
2000 0
−20
Hz
1000 −40
−60
−80
0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Seconds
Fig. 2. Example of speech word. In the upper part a 0.5-seconds segment of speech
signal corresponding to the word “sota” [’so ta] is depicted in the time-domain. The
signal was sampled at 8192 Hz. The superimposed solid light gray line corresponds
to the normalized energy of the segment and shows the low-frequency spectrum
that defines the rate at which we utter phonemes while the dotted dark gray line
corresponds to the normalized zero crossing. In the middle part, an autocorrelation
analysis of the speech signal is depicted. In the lower part a time-frequency analysis
of the same segment is depicted. Dark colors represent areas with high energy, light
colors display areas with low energy.
At the bottom of the Fig. 2 each one of these parts can be seen: first a
fricative segment corresponding to ‘s’ is a noisy segment, relatively low-pass;
although there is not a particular spectral region with preponderate energy.
The next segment corresponding to ‘o’ is clearly a harmonic segment. The
energy is concentrated in narrow frequency bands which correspond to the
formants and it is directly related with the autocorrelation (measure of pre-
dictibility) (in the middle of the Fig. 2). The next segment corresponding to
the silence has low energy contribution (probably due to recording device since
any anechoic chamber was used for recording). The next segment correspond-
6 Lino Garcı́a et. al.
ing to ‘t’ is of very short duration and is weakly harmonic. The last occlusive
segment, corresponding the ‘a’ phoneme, is clearly harmonic but with a differ-
ent contribution from ‘o’ phoneme. The probability density function (pdf) of
the speech signal in time-domain is close to Laplacian and it can be assumed
to be sparse2 .
Noise
The types of noises and interferences which can be present in the evidence
recording can be strong, subtle, and/or time varying and these may occur in
the acoustic environment where the microphones are located in. Some common
examples of acoustic noises and interferences include: air conditioning and fan
hum, reverberation and echoes, engines and other machinery, wind and rain,
radio and TV, live music, background speech in public places, other talkers,
vehicular traffic and road noise, etc. Noise and unwanted sounds may lead to
listener fatigue and confusion for untrained listeners. Another class of noise
is related with a distortion that can be introduced by the recording devices
(bandwidth distortion). Both cases have a different approach. In the first case
the noise and interference signal can be considered as any input signal si (n).
In the second case the noise r(n) is added equally to all the channels. Figure
3 shows an example of speech affected by an additive noise.
−1
0 0.1 0.2 0.3 0.4 0.5
Seconds
Autocorrelation
400 0.8
200 0.2
1000
−40
0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Seconds
Fig. 3. Example of a contaminated speech word. The same segment speech signal
depicted in the Fig. 2 is contaminated with a broad band noise signal with a -3 dB of
signal noise rate (SNR). It is difficult to recognize the occlusive segments, however,
in spite of the high level of the noise the speech signal is intelligible and perfectly
recognized (even until for SNR = -10 dB).
2
A signal is sparse if only few of its samples are significantly different from zero.
Sparseness is usually modeled by a Laplacian pdf.
Adaptive Acoustic 7
−1
0 0.1 0.2 0.3 0.4 0.5
Seconds
Autocorrelation
400 0.8
200 0.2
−0.4
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Seconds
Time−frequency analysis
2000 0
−20
−40
Hz
1000
−60
−80
0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
Seconds
Fig. 4. Example of reverberating speech word. The same segment speech signal
depicted in the Fig. 2 is filtered simulating a recording of the evidence in the AIR
room. Note this the harmonic distortion.
AIR
There are several classes of adaptive filtering [17] that can be useful for
audio forensic, as will be shown in Sect. 4. The differences amoung them are
based on the external connections to the filter.
(a)
x(n) Adaptive
filter
Parameters
(b)
(c) d(n)
In the estimator application [see Fig. 5(a)], the internal parameters of the
adaptive filter are used as estimate.
In the predictor application [see Fig. 5(b)], the filter is used to filter an
input signal, x(n), in order to minimize the size of the output signal, e(n) =
x(n) − y(n), within the constrains of the filter structure. A predictor structure
is a linear weighting of some finite number of past input samples used to
estimate or predict the current input sample.
In the joint-process estimator application [see Fig. 5(c)] there are two
inputs, x(n) and d(n). The objective is usually to minimize the size of the
output signal, e(n) = d(n) − y(n), in which case the objective of the adaptive
filter itself is to generate an estimate of d(n), based on a filtered version of
x(n), y(n) [17].
Transversal
The transversal filter, tapped-delay line filter or finite-duration impulse re-
sponse filter (FIR) is the most suitable and the most commonly employed
structure for an adaptive filter. The utility of this structure derives from its
simplicity and generality. Its transfer function can be changed easily con-
trolling the L coefficients wl , l = 1 · · · L. There is a simple linear relation-
ship between
PL the parameters of the filter and the transfer filter function
W (z) = l=1 wl z −l . In a FIR filter, each output value y(n) is determined
by a finite weighted combination of L previous values of the input signal
L
X
y(n) = wl x(n − l + 1) = wH x(n) = hw, x(n)i , (8)
l=1
T T
with w = w1 w2 · · · wL and x(n) = x(n) x(n − 1) · · · x(n − L + 1) .
Equation (8) is called finite convolution sum 4 .
x(n) z −1 z −1 z −1
w1 w2 wL
y(n)
d(n)
xP (n) yP (n)
wP
Lattice
The lattice filter is an alternative to the transversal filter structure for the
realization of a predictor [10]. Figure 8 shows a Lattice-ladder joint-process
estimation consisting of L − 1 stages; the number L − 1 is refered to as the
predictor order. The coefficient of the lattice structure, kl , l = 1 · · · L − 1, are
commonly called the reflection or PARCOR coefficients. In this framework,
instead of applying the input signal to a tapped-delay line, a prediction linear
lattice structure is used between both of them. The L observations in the x(n)
vector are replaced by the set of backward prediction errors b(n).
The prediction problem is to find the projection of x(n) on the subspace
spanned by x(n − 1), · · · x(n − L). The lattice filter is a consequence of finding
a new set of vectors which also span this subspace, at the same time as they
Adaptive Acoustic 13
Stage 1 Stage L − 1
x(n) f1 (n) f2 (n) fL−1 (n) fL (n)
...
k1 kL−1
y(n)
...
Fig. 8. Multistage lattice filter.
Applying the projection theorem and knowing that the optimal backward
2 2
and forward prediction errors have the same norm kf l (n)k = kbl (n)k , 1 ≤
l ≤ L, the reflection coefficients can be obtained as
y(n) e(n)
yP (n)
Once a filter structure has been selected, an adaptation algorithm must also be
chosen. From control engineering point of view, the forensic filtering is a sys-
tem identification problem that can be solved by choosing an optimum criteria
or cost function J(w) in a block or recursive approach. Several alternatives
are available, and they generally exchange increased complexity for improved
performance (speed of adaptation and accuracy of the transfer function after
adaption or misalignment5 ).
Cost Functions
Cost functions are related to the statistics of the involved signals and depends
on some error signal
The error signal e(n) depends on the specific structure and the adaptive
filtering strategy but it is usually some kind of similarity measure between
the target signal si (n) and the estimated signal yo (n) ≈ ŝi (n) (for I = O).
The most habitual cost functions are listed in Table 1.
J(w) Comments
2
e (n)
Mean squared error (MSE). Statistic mean operator
PN −1
1
N n=0
e2 (n) MSE estimator. MSE is normally unknown
e2 (n) Instantaneous squared error
|e(n)|
Pn Absolute error. Instantaneous module error
m=0
λn−m e2 (m) Least squares (Weighted sum of the squared error)
E{kf l (n)k2 + kbl (n)k2 } Mean squared predictor errors (for a lattice structure)
5
A misalignment for transversal filter is defined by = kv − wk2 / kvk2 . A con-
version between lattice and tranversal filter is possible and useful to setup an
equivalent misalignment measure for lattice structures.
Adaptive Acoustic 17
Stochastic Estimation
where N represents the memory size. The input signal matrix to the mul-
tichannel adaptive filtering has the form
T
X(n) = XT1 (n) XT2 (n) · · · XTP (n) .
(29)
In the most general case (with order memory N ), the input signal X(n) is
a matrix of size LP × N . For N = 1 (memoryless) and P = 1 (single channel)
(29) is reduced to (5).
There are adaptive algorithms that use memory N > 1 to modify the
coefficients of the filter, not only in the direction of the input signal x(n),
but within the hyperplane spanned by the input vector x(n)
and its N − 1
immediate predecessors x(n) x(n − 1) · · · x(n − N + 1) per channel. The
block adaptation algorithm updates its coefficients once every N samples as
The new estimator w(n + 1) is updated from the previous estimation w(n)
plus adapting-step or gradient obtained from the cost function minimization
J(w). These algorithms have an infinite memory. The trade-off between con-
vergence speed and the accuracy is intimately tied to the length of memory
of the algorithm.
The error of the joint-process estimator using a transversal filter with
memory can be rewritten like a vector as
The unknown system solution, applying the MSE like the cost function,
leads to the normal or Wiener-Hopf equation. The energy of the error vector
(sum of the squared elements of the error vector) is given by the inner vector
product as6
2
J(w) = kek = eH e = (d − XH w)H (d − XH w), (33)
∂J(w)
= −2Xd + 2XXH w. (34)
∂w
The Wiener filter coefficients are obtained by setting the gradient of the
square error function to zero, this yields
−1
w = XXH Xd = R−1 r.
(35)
The method that estimates the autocorrelation matrix like in (38) with
samples organized as in (27) is known as the covariance method. The matrix
that results is positive semidefinite but it is not Toeplitz.
The exponential windowed method uses a recursive estimation according
to certain forgetfulness factor λ in the rank 0 < λ < 1,
Optimization strategies
The vector g(n) = ∇J(w) is the gradient evaluated at w(n), and the
matrix H(n) = ∇2 J(w) is the Hessian of the cost function evaluated at
w(n).
Several first order adaptation strategies are: to choose a starting initial
point w(0), to increment election ∆w(n) = µ(n)g(n); two decisions are due
7
Sampled data of the random variable are used.
Adaptive Acoustic 21
to take: movement direction g(n) in which the cost function decreases fastest
and the step-size in that direction µ(n). The iteration stops when a certain
level of error is reached ∆w(n) < ξ,
Both parameters µ(n), g(n) are determined by a cost function. The second
order methods generate values close to the solution in a minimum number of
steps but, unlike the first order methods, the second order derivatives are
very expensive computationally. The adaptive filters and its performance are
characterized by a selection criteria of µ(n) and g(n) parameters.
−gl l = 1,
ql = (56)
−gl + βl ql−1 l > 1
( T
uTl vTl l = 1,
gl = T T T (57)
αgl−1 + (1 − α) ul vl l>1
2
kgl k
βl = 2, (58)
kgl−1 k
wl+1 = wl + µul , (59)
Kl+1 = Kl + λl Vl . (60)
The time index n has been removed by simplicity. 0 < α < 1 is a forget-
fulness factor which weights the innovation importance specified in a low-pass
filtering in (57). The gradient selection is very important. A mean value that
uses more recent coefficients is needed for gradient estimation and to generate
more than one conjugate directions vector (57).
allows reducing the complexity [26]. Depending on how the data and filters
are organized, these approaches may upgrade in performance and avoid end-
to-end delay. Multirate schemes adapt the filters in smaller sections at lower
computational cost. This is only necessary for real-time implementations. Two
approaches are considered. The subband adaptive filtering approach splits the
spectra of the signal in a number of subbands that can be adapted inde-
pendently and therefore the filtering can be carried out in a fullband. The
frequency-domain adaptive filtering partitions the signal in time-domain and
projects it into a transformed domain (i.e. frequency) using better proper-
ties for adaptive processing. In both cases the input signals are transformed
into a more desirable form before adaptive processing and the adaptive algo-
rithms operate in transformed domains, whose basis functions orthogonalize
the input signal, speeding up the convergence. The partitioned convolution is
necessary for fullband delayless convolution and can be seen as an efficient
frequency-domain convolution.
from the subband adaptive filters per each channel wpm , p = 1 · · · P ,m =
−1
1 · · · M/2 [25]. The subband filters are very short, of length C = L+N
K −
N
K + 1, which allows to use much more complex algorithms. Although the
input signal vector per channel xp (n) has size L × 1, it acts as a delay line
which, for each iteration k, updates K samples. ↓K is an operator that means
downsampling for a K factor and ↑K upsampling for a K factor. gm is a
synthesis filter in subband m obtained by modulating a prototype filter. H is
a polyphase matrix of a generalized discrete Fourier transform (GDFT) of an
26 Lino Garcı́a et. al.
z −1 z −1 z −1 d(n)
K K K
H
x1 (n) y1 (n) y(n) e(n)
w1
e1 (k)
K w11
z −1
e2 (k)
K H w12
z −1
eM/2 (k)
z −1 K w1(M/2)
xP (n) yP (n)
wP
K
wP 1
z −1
K H wP 2
z −1
z −1 wP (M/2)
K
Fig. 10. Subband adaptive filtering. This configuration is known as open-loop be-
cause the error is in the time-domain. An alternative closed-loop can be used where
the error is in the subband-domain. Gray boxes corresponds to efficient polyphase
implementations. See detail in [25].
Adaptive Acoustic 27
X Q K−1
P X X
y(n) = xp (n − qK − m)wp(qK+m) . (62)
p=1 q=1 m=0
Where the total filter length L, for each channel, is a multiple of the length
of each segment L = QK, K ≤ L. Thus, using the appropriate data section-
9
The prototype filter is a low-pass filter. The band-pass filters are obtained mod-
ulating a prototype filter.
28 Lino Garcı́a et. al.
d(n)
ing procedure, the Q linear convolutions (per channel) of the filter can be
independently carried out in the frequency-domain with a total delay of K
samples instead of the QK samples needed by standard FDAF implementa-
tions. Figure 11 shows the block diagram of the algorithm using the overlap-
save method. In the frequency-domain with matricial notation, (62) can be
expressed as
Y = X ⊗ W, (63)
e = d − y, (64)
T
with d = d(mK) d(mK + 1) · · · d((m + 1)K − 1) . The error in the
frequency-domain (for the actualization of the filter coefficients) can be ob-
tained as
0
e = F K×1 . (65)
e
G = −X∗ ⊗ E. (67)
This is the unconstrained version of the algorithm which saves two FFTs
from the computational burden at the cost of decreasing the convergence
speed. The constrained version basically makes a gradient projection. The
gradient matrix is transformed into the time-domain and is transformed back
into the frequency-domain using only the first K elements of G as
G
G=F . (68)
0K×Q×P
blocks of L samples
1st 2nd J −1 Jth
T T
v1 v2 vQ v1
11stbloque
seg. 22nd seg.
bloque Qth seg.
Q bloque 1st data block
1st seg. 2nd seg. Qth seg. 2nd data block
1 K +1 2K + 1 sum at index −L
n2 bloque
1 K +1 2K + 1 output signal −L
n2 bloque
Fig. 12. Partitioned convolution. Each output signal block is produced taking only
the L − K last samples of the block.
Adaptive Acoustic 31
The filtering operation can be made delayless by operating the first block in
the time-domain (direct convolution) while the rest of the blocks continue to
operate in the frequency domain [22]. The fast convolution starts after the
samples have been processed for direct convolution. The direct convolution
allows giving samples to the output while data is incomming. This approach
is applicable to the multirate frameworks described.
Once the theoretical foundations of the adaptive filtering have been reviewed,
the most important techniques that can be applied to forensic filtering are
introduced.
The adaptive spectral equalization is widely used for noise suppression and
corresponds to the single-input and single-output (SISO) estimator applica-
tion (class a, Fig. 5); a single microphone, P = 1, is employed. This approach
estimates a noiseprint spectra and subtracts it from the whole signal in the
frequency-domain.
T T−1
r(n)
w
d(n) Wiener filter
estimator
The Wiener filter estimator is the result of estimating y(n) from s(n) that
2
minimizes the MSE ky(n) − s(n)k given by y = Qx, x = s + r, and that
results
2 2
|x| − |d|
q∼
= 2 , (69)
|x|
Q = diag q1 q2 · · · qM is a diagonal matrix which contains the spec-
tral gain in the frequency-domain; normally T is a short-time Fourier trans-
form (STFT), suitable for not stationary signals, and T−1 its inverse. In this
case this algorithm is known as short-time spectral attenuation (STSA). The
M × 1 vector q contains the main diagonal components of Q. d is the noise
spectrum (normaly unknown). In this case an estimation of the noiseprint
spectra d = r̂ from the mixture x (noisy signal) is necessary (in intervals
when the speech is absent and only the noise is present). The STFT is defined
PN −mk
as xk = n=1 h(n)x(m − n)WM , m = 0 · · · M − 1, where k is the time
index about which the short-time spectrum is computed, m is the discrete
frequency index, h(n) is an analysis window, N dictates the duration over
which the transform is computed, and M is a number of frequency bins at
which the STFT is computed. For stationary signals the squared-magnitude
of the STFT provides a sample estimate of the power spectrum of the under-
lying random process. This form (69) is basic to nearly all the noise reduction
methods investigated over last forty years [12]. The specific form to obtain Q
is known as the suppresion rule.
Power Subtraction
and the phase of the noisy signal 6 x can be used, if its SNR is reasonably
high, in place of 6 s. α is an exponent and β is a parameter introduced to
control the amount of noise to be subtracted (β = 1 for full subtraction and
β > 1 for over subtraction). A paramount issue in spectral subtraction is to
obtain a good noise estimate; its accuracy greatly affects the noise reduction
performance [3].
r(n)
The adaptive noise cancellation (ANC) cancels the primary unwanted noise
r(n) by introducing a canceling antinoise of equal amplitude but opposite
phase using a reference signal. This reference signal is derived from one or
more sensors located at points near the noise and interference sources where
the interest signal is weak or undetectable. A typical ANC configuration is
depicted in Fig. 15. Two microphones are used, P = 2. The primary input
d(n) = s(n) + r(n) collects the sum of unwanted noise r(n) and speech signal
s(n), and the auxiliary or reference input measures the noise signal x(n) =
r(n).
ANC corresponds to multiple-input and single-output (MISO) joint-process
estimator application (class c, Fig. 5) with at least two microphones, P = 2.
34 Lino Garcı́a et. al.
d(n)
s(n)
r(n)
x(n) y(n) e(n)
w
4.4 Beamforming
x1 (n)
h1
d(n) e(n)
θi
δ
δ cos
hP
xP (n) AIC
θi
b1 bP
FB
g1
y(n)
gP
ABM
4.5 Deconvolution
Both blind signal separation (BSS), also known as blind source separation,
and multichannel blind deconvolution (MBD) problems are a type of inverse
problems with similarities and subtle differences between them: in the MBD
36 Lino Garcı́a et. al.
only one source is considered, and thus the system is single-input single-output
(SISO), while in BSS there are always multiple independent sources and the
mixing system is MIMO; the interest of MBD is to deconvolve the source from
the AIRs, while the task of BSS is double: on the one hand the sources must
be separated, on the other hand the sources must be deconvolved from the
multiple AIRs since each sensor collects a combination of every original source
convolved by diferent filters (AIRs) according to (2) [13].
z −D d(n) = si (n)
xP (n)
vP wP
r(n)
x1 (n) y(n)
v1 w1
si (n)
xP (n)
vP wP
BSFS
r(n)
to achieve the minimum norm solution. In the first, the right generalized in-
verse of V is estimated and then applied to the set of microphone signals
x(n). Another class of algorithms employ the sparseness of speech signal to
design better inversion strategies and identify the minimum norm solution.
Many techniques of convolutive BSS have been developed by extending meth-
ods originally designed for blind deconvolution of just one channel. A usual
practice is to use blind source factor separation (BSFS)Blind source factor sep-
aration technique, where one source (factor) is separated from the mixtures,
and combine it with a deflationary approach, where the sources are extracted
one by one after deflating, i.e. removing, them from the mixed signals. The
MIMO FIR filter W used for BSS becomes a multiple-input single-output
(MISO) depicted in Fig. 18. The output y(n) corresponds to (11) and the
tap-stacked column vector containing all demixing filter weights defined in
(7) is obtained as
u = Rp, (71)
u
w= √ ,
uH Ru
where R is a block matrix where its blocks are the correlation matrices Rpq be-
tween the p-th channel and q-th channel defined in (36) and p is a block vector
where its blocks are the cross-cumulant vector p = cum{x(n), y(n) · · · y(n)}
[13]. The second step in (71) is just the normalization of the output signal
y(n). This is apparent left multiplying by x(n).
The deflationary BSS algorithm for i = 1 · · · I sources can be summa-
rized as following: one source is extracted with the BSFS iterative scheme
till convergence (71) and the filtering of the microphone signals with the
estimated filters from the BSFS method (11) is performed; the contribu-
tion of the extracted source into the mixtures xp , p = 1 · · · P , is estimated
(with the LS criterion) and the contribution of the o-th source into i-th
mixture is computed by using the estimated filter b, c(n) = hb, y(n)i with
y(n) = y(n) y(n − 1) · · · y(n − B + 1) , B << L; deflate the contribution
c(n) from the p-th mixture, xp (n) = xp (n) − c(n), p = 1 · · · P . This method is
very suitable for audio forensic application where only one source should be
extracted, i.e. speech.
It is possible to consider the deflationay BSFS (DBSFS) structure as a
GSC. ABM exactly corresponds to the deflating filters of the deflationary ap-
proach. By comparing the different parts, i.e. the BSFS block and the fixed
beamformer, it is concluded that it may be possible to construct similar algo-
rithms to those of GSC.
5 Conclusions
This chapter is an advanced tutorial about multichannel adaptive filtering for
forensic audio. Different techniques have been examined in a common foun-
38 Lino Garcı́a et. al.
References
1. Armelloni E, Giottoli C, Farina A (2003) Implementation of Real-time Parti-
tioned Convolution on a DSP Board, 2003 IEEE Workshop on the Applications
of Signal Processing to Audio and Acoustics, page(s):71–74.
2. Barnett, P.W, Knight, R.D (1995) The Common Intelligibility Scale, Proc.
I.O.A. 17(7):199-204.
3. Benesty J, Huang Y (2003) Adaptive Signal Processing (Applications to Real-
World Problems). Springer, Berlin Heidelberg New York Hong Kong London
Milan Paris Tokyo
4. Beracoechea, J (2007) Codificacin de Audio Multicanal para entornos de tipo
Ventana Acstica, Ph D. Thesis, Universidad Politcnica de Madrid
5. Betts D, French A, Hicks C, Reid G (2005) The Role of Adaptive Filtering in
Audio Surveillance. AES 26th International Conference.
6. Boray G.K, Srinath M.D (1992) Conjugate Gradient Techniques for Adaptive
Filtering. IEEE Transactions on Circuits and Systems 39(1):1-10.
7. Chau E.Y-H (2001) Adaptive Noise Reduction Using A Cascaded Hybrid Neu-
ral Network. MS Thesis, University of Guelph, Ontario
8. Crochiere R.E, Rabiner L.R (1983) Multirate Digital Signal Processing.
Prentice-Hall, London Sidney Toronto Mexico New Delhi Tokyo Singapore Rio
de Janeiro
9. French, N.R, Steinberg, J.C (1947) Factors governing the intelligibility of speech
sounds, The Journal of the Acoustical Society of America 19:90–119.
10. Friedlander B (1982) Lattice Filters for Adaptive Processing. Proceedings of
the IEEE 70(8):829–867.
40 Lino Garcı́a et. al.