Professional Documents
Culture Documents
Perceptual Coding
Perceptual Coding
ABSTRACT Total Least Squares (TLS) algorithms automatically decompose (audio) frames into a number of exponentially damped sinusoids. This can provide for more ecient modeling than plain sinusoidal modeling, especially in the case of transitional frames. Straightforward implementations of TLS optimize a SNR criterion. In our implementation we apply TLS in a subband scheme in which the number of damped sinusoids is both frame and subband dependent. This is made possible through the use of perceptual information provided by the MPEG-I psycho-acoustic model I. Experiments on dierent audio tracks provide proof of concept for our perceptual ESM, and illustrate the signicant reduction in modeling components compared to a non-perceptual ESM.
INTRODUCTION
A fairly general model that is often used to represent speech and audio signals is based on an AM/FM representation, as shown in eq. (1). In this model, the signal is represented as a sum of sinusoidal components with time-varying amplitude, phase and
frequency. While this model can obviously generate any desired signal s(n), it is required for eective modeling and ecient encoding that amplitude, frequency and phase are only slowly time varying functions of n. In practice these parameters are therefore often considered constant for the duration of an analysis frame (depending on the signal the length of
HERMUS ET AL.
quasi-stationary segments can vary from a few ms to several hundreds of ms).
K
s(n)
k=1
(1)
Audio signals often contain noise-like segments and transient sounds that are not eciently modeled by eq. (1) [1]. In particular, transient sounds can cause pre-echo distortions. Preecho originates from the fact that, in order to reduce the number of modeling parameters, sinusoidal coders normally produce signals with a fairly high degree of pseudo-stationarity. Thus, the sharpness of transient attack segments will actually be building up gradually, before the attack. As a possible way to overcome these problems, we consider modeling and encoding audio signals using a superposition of slowly time-varying exponentially weighted sinusoids and quasi-stationary noise :
K
s(n)
k=1
(s(n) s (n))2
(6)
(2) The exponential amplitude terms in ESM provide the capability for modeling fast amplitude uctuations. The outline of this paper is as follows. In the next section we briey explain how TLS algorithms can be used to estimate the parameters of an exponential sinusoidal model (ESM). In section 3 the frequency domain aspects of TLS-ESM modeling in a fullband scheme are described. Section 4 discusses two strategies to distribute the modeling components based on information from a psycho-acoustic model. Experiments on some audio tracks are discussed in section 5. Finally, the conclusions can be found in section 6. 2 ESM AND TOTAL LEAST SQUARES
This SNR can be used to measure the modeling accuracy. It can be noted from (3) that SNR is actually the quality measure that is maximized in TLS problems. Unfortunately the SNR measure is not always successful in minimizing the audibility of distortions. 3 FULLBAND ESM
Total Least Squares (TLS) algorithms form a natural extension of the basic LS algorithms, deriving the parameters of an Auto-Regressive (AR) model that exactly matches a (slightly) perturbed version s (n) of the input segment s(n), n = 1,. . .,N . The TLS modeling problem of order L is to nd the model parameters b(l), l = 1,. . .,L and s(n), n = 1,. . .,N that minimize
N
Fig. 1 illustrates how TLS operates in the spectral domain. From an example audio frame (sampling frequency = 44.1 kHz), we plot the original spectrum (gray) and the spectra of the TLS modeled versions (black), for 64 (top) and 128 (bottom) complex exponentials in the model1 . Observe that TLS models the original spectrum with spectrally localized contributions. There is an apparent contrast with traditional LS modeling that uses broad resonances to approximate the spectral envelope. TLS operates more locally, and keeps spending a lot of components to model a high energetic region, until this local t is good enough, before it goes on to model another high energetic spectral region. It is clear that this modeling behavior is in accordance with the SNR optimization criterion. Up to a signicant number of damped sinusoids, spectral gaps are present in the modeled audio and this is often incompatible with the criterion of optimal perceptual quality. As we already mentioned, the optimization of the SNR is clearly not a perceptual criterion. 4 PERCEPTUAL ESM
(s(n))2
n=1
(3)
subject to
L
(4)
for n = (L + 1),. . .,N . Since s (n) is purely autoregressive, it can be written as a sum of exponentially damped sinusoids :
K
TLS is capable of closely tting audio frames by optimizing the signal to modeling noise ratio (eq. 6). The original segment can always be approximated with an arbitrarily high precision, at the expense of a (very) high model order. However, in audio modeling applications we care less about the degree of time resemblance between the original and the TLS modeled time signals. What we want to optimize in the
1 Throughout this paper we used an implementation of TLS analysis where the number of complex exponentials is specied, instead of the number of damped sinusoids. The number of sinusoids is about half the number of complex exponentials : 2 cos() = exp(j) + exp(j), but cos(0) = exp(j 0).
s (n) =
k=1
ak edk n sin(2fk n + k )
(5)
HERMUS ET AL.
80
magnitude (dB)
magnitude (dB)
0 5 10 15 20 25
60 40 20 0 20
60 40 20 0 20 40 0 5 10 15 20 25
freq (kHz)
freq (kHz)
80
80
magnitude (dB)
magnitude (dB)
0 5 10 15 20 25
60 40 20 0 20
60 40 20 0 20 40 0 5 10 15 20 25
freq (kHz)
freq (kHz)
Fig. 1: Illustration of TLS in the spectral domain. Original spectra (gray) and TLS modeled spectra (black) for an audio segment.
rst place, given a certain model complexity, is the perceived quality of the modeled audio. Starting from the plain fullband TLS as described so far, we need the following extensions : (1) the possibility to control the spectral distribution of the dierent damped sinusoids (2) information about the optimal location in terms of perceptual quality of the damped sinusoids In the next subsection we explain how the rst feature can be achieved by implementing TLS in a subband scheme. Thereafter we will nd that a psycho-acoustic model is appropriate to fulll the second requirement. 4.1 Subband TLS-ESM
Fig. 2: Spectral representation of basic subband TLSESM. Original spectra (gray) and modeled spectra (black).
stability [6]. Indeed, if a fully decimated lter bank can be implemented, the length of the subband signals can be kept small. In our implementation, the computational complexity is O(mn2 ). Thus, for an M -channel fully decimated system, M TLS problems in N/M unknowns must be solved, which requires only M . O( m n2 mn2 ) O( ) M M2 M2 (7)
calculations. This means a reduction by a factor M 2 compared to fullband TLS-ESM. Since the individual subband TLS problems cannot model across the frequency range of their own channel, more damped sinusoids (in total) are required to achieve the same SNR as a corresponding fullband model. While the total number of required model components increases with the number of channels in the lter bank, the overall decrease in computation time is very signicant. With a given number of modeling components, the SNR will obviously be smaller than with fullband TLS, but this is not an issue here. For more information, see [6]. 4.2 Psycho-acoustic model
Using subband signal representations enables us to control the distribution of the damped sinusoids over the whole frequency range. For every subband signal a separate TLS problem is solved, each with its own specic model order. Hence, the location of every damped sinusoid is a priori known with a resolution equal to the subband width. To illustrate the subband TLS scheme, we took the same audio frame as in gure 1, sent it through a fully decimated uniform QMF analysis lter bank (32 channels), and modeled each subband with the same number of complex exponentials. The modeled audio was then obtained as the output of the QMF synthesis lter bank. The impact on the frequency spectrum is illustrated in gure 2, for 2 complex exponentials per subband (top) and 4 complex exponentials per subband (bottom). The dierence in modeling behavior compared to gure 1 is apparent : the subband approach leads to a more uniform coverage of the spectral range. Besides the added exibility of controlling the location of the components, subband TLS also leads to a substantial reduction in computational complexity and improves the numerical
Our second objective was to nd information about how to distribute the available components over the dierent subbands. A xed distribution of the components as in the previous example, is far from optimal due to the changing spectral contents of consecutive frames : one could only rely on the (statistically modeled) average spectral contents of audio signals. This statistical average can of course be computed, but it will (largely) depend on the type of audio under consideration, which makes this approach inecient. What we need is an automatically derived, frame-dependent distribution of the damped sinusoids that optimizes the perceived quality. Assigning components in accordance with the perceptual relevance of each subband is an obvious way to reach our goal.
HERMUS ET AL.
The necessary information to quantify that perceptual relevance of the dierent subbands can be derived from a psychoacoustic analysis. Based on masking phenomena of our hearing system, the psycho-acoustic model will calculate a framespecic (and hence time varying) frequency dependent masking curve. Spectral components of modeling distortions that fall below a masking threshold are perceptually irrelevant. In our current implementation we use MPEG 1, psycho-acoustic model I [7], which updates its masking curve every 384 samples based on a 512 samples wide window. Figure 3 shows the frequency spectrum of an audio fragment together with its masking curve (solid black). Those parts of the spectrum that extend high above the masking curve are perceptually critical and should be modeled carefully. The spectrum of the highest subbands mostly falls below the masking curve (due to the high hearing threshold) and does not need to be assigned components to. Of course, the masking
Spectrum with Masking Curve
70
60
50
magnitude (dB)
40
30
20
10
10
10
freq (kHz)
15
20
25
Experiments were carried out in order to obtain proof of concept for psycho-acoustic based ESM. As test material we took a selection of audio tracks from the EBU SQAM Compact Disc [8], together with a fragment from Joe Cockers pop song Manhattan. All test material is in a 44.1 kHz - 16 bit format. A QMF lterbank with 32 channels was used ; for the HTLS modeling we opted for a frame size of 384 samples.
HERMUS ET AL.
80
magnitude (dB)
60 40 20 0 20
10
freq (kHz)
15
20
25
80
magnitude (dB)
60 40 20 0 20
10
freq (kHz)
15
20
25
Fig. 4: Spectra of a psycho-acoustic based TLS modeled audio segment. Original spectra (gray) and modeled spectra (black).
5.1 Plain subband ESM reference
Our primary goal of introducing the psycho-acoustic model is to reduce the number of complex exponentials compared to the plain subband ESM. As a reference, we rst checked the number of complex exponentials needed in a subband ESM to obtain transparent quality. Since each subband contains the same number of complex exponentials, the total number of components is a multiple of 32. Informal listening tests pointed out that no audible distortions are heard from 192 components onwards (= 6 complex exponentials per subband). We found the same threshold of 192 components for every audio track. This means that every audio track contains one or more subbands that need at least 6 complex exponentials. 5.2 Threshold for the psycho-acoustic model
We already mentioned in section 4.2 that we can stop assigning components to a subband when the NMR in that subband has become smaller than a threshold. A NMR of 0 dB would be a logical choice, because in that case the modeling distortions fall below the masking threshold. How obvious this might be, this strategy is only valid if the psycho-acoustic model is accurate in calculating the masking curve. In [9] we described the results of an A-B preference test with a panel of 10 listeners. In that experiment we found that no modeling distortions are heard already before the NMR has become zero in all subbands, which indicates that one can safely rely on the psycho-acoustic model. 5.3 Results of psycho-acoustic ESM
For the illustration and comparison of the two methods for psycho-acoustic ESM, we will concentrate on two audio fragments, namely the popsong and the glockenspiel. For more detailed experimental results (violin, harp, trumpet, soprano, . . . ) and for HTLS audio samples, see our website [10]. Figure 5 gives the spectrogram of the popsong (left) and the
HERMUS ET AL.
Popsong
20
15
15
freq (kHz)
10
freq (kHz)
0.5 1 1.5 2 2.5 3
10
time (sec)
0.5
time (sec)
1.5
2.5
Popsong
160
Glockenspiel
160
140
140
120
120
100
100
80
80
60
60
40
40
20
20
Frame Number
200
400
Frame Number
600
800
1000
1200
Fig. 6: Number of complex exponentials needed to obtain a NMR of zero dB in all subbands.
of method 2 which makes method 1 a good choice in most situations. 6 CONCLUSION 7 ACKNOWLEDGMENT trated that our psycho-acoustic ESM can achieve the same audio quality as plain ESM but with signicantly less modeling components.
The extension of traditional sinusoidal modeling to exponential sinusoidal modeling is especially valuable for the modeling of the transitional segments in audio. Basic Total Least Squares (TLS) algorithms as they are readily found in other signal processing domains, are not ecient to extract the model parameters in audio modeling due to their high computational load and the SNR optimization criterion. In our approach we optimize a perceptual criterion by integrating psycho-acoustic information in a subband TLS-ESM algorithm, with the added advantage of a reduced computational load. We presented two strategies for assigning damped sinusoids to the dierent subbands. Extensive tests on SQAM data showed the eectiveness of both strategies, and illus-
Support is acknowledged from the Flemish Community - IWT and from the Research Fund K.U. Leuven.
REFERENCES
[1] J. Jensen, S.H. Jensen, and E. Hansen, Exponential sinusoidal m odeling of transitional speech segments, in Proc. International Conference on Acoustics, Speech and Signal Processing, Phoenix, U.S.A., Mar. 1999, vol. I, pp. 473476.
HERMUS ET AL.
Popsong
30
40
35 25 30 20
maximal NMR
15
maximal NMR
0 20 40 60 80 100 120 140 160
25
20
15
10 10 5 5
20
40
60
80
100
120
140
160
Fig. 7: Maximal NMR as a function of the number of complex exponentials in the psycho-acoustic ESM-TLS modeling.
4
Popsong
22
Glockenspiel
6 24 8
10
26
mean NMR
12
mean NMR
0 20 40 60 80 100 120 140 160
28
14
16
30
18 32 20
22
34
20
40
60
80
100
120
140
160
Fig. 8: Mean NMR as a function of the number of complex exponentials in the psycho-acoustic ESM-TLS modeling.
[2] P. Lemmerling, Structured total least squares: analysis, algorithms and applications, Ph.D. thesis, Faculty of Applied Sciences, K.U. Leuven, Belgium, 1999. [3] S. Van Huel, H. Park, and J.B. Rosen, Formulation and solution of structured total least norm problems for parameter estimation, IEEE Transactions on Signal Processing, vol. 44, no. 10, pp. 24642474, 1996. [4] S. Van Huel, H. Chen, C. Decanniere, and P. Van Hecke, Algorithm for time-domain NMR data tting based on total least squares, Journal of Magnetic Resonance, vol. A110, pp. 228237, 1994. [5] D. Calvetti, L. Reichel, and D. C. Sorensen, An implicitly restarted lanczos method for large symmetric eigenvalue problems, Electron. Trans. Numerical Analysis, vol. 2, pp. 121, Mar. 1994. [6] K. Hermus, W. Verhelst, P. Wambacq, and P. Lemmerling, Total least squares based subband modelling for scalable speech representations with damped sinusoids, in Proc. International Conference on Spoken Language Processing, Beijing, China, Oct. 2000, vol. III, pp. 1129 1132. [7] ISO/IEC 11172-3, Information technology - Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s - Part 3: Audio, ISO/IEC, 1993. [8] Sound Quality Assessment Material CD (1988), issued by the EBU http://www.ebu.ch/ . [9] K. Hermus, W. Verhelst, and P. Wambacq, Psychoacoustic modeling of audio with exponentially damped sinusoids, in Proc. International Conference on Acoustics, Speech and Signal Processing, Orlando, U.S.A., May 2002. [10] ESAT-PSI Speech Group website http://www.esat.kuleuven.ac.be/~spch/ .