Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Audio Engineering Society

Convention Paper 5571


Presented at the 112th Convention 2002 May 1013 Munich, Germany
This convention paper has been reproduced from the authors advance manuscript, without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio Engineering Society, 60 East 42nd Street, New York, New York 10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering Society.

Perceptual Audio Modeling based on Total Least Squares Algorithms


Kris Hermus1 , Werner Verhelst2 and Patrick Wambacq1
1 Katholieke 2 Vrije

Universiteit Leuven, dept. ESAT - div. PSI, Leuven, BELGIUM

Universiteit Brussel, dept. ETRO - div. DSSP, Brussels, BELGIUM

Correspondence should be addressed to Kris Hermus (hermus@esat.kuleuven.ac.be)

ABSTRACT Total Least Squares (TLS) algorithms automatically decompose (audio) frames into a number of exponentially damped sinusoids. This can provide for more ecient modeling than plain sinusoidal modeling, especially in the case of transitional frames. Straightforward implementations of TLS optimize a SNR criterion. In our implementation we apply TLS in a subband scheme in which the number of damped sinusoids is both frame and subband dependent. This is made possible through the use of perceptual information provided by the MPEG-I psycho-acoustic model I. Experiments on dierent audio tracks provide proof of concept for our perceptual ESM, and illustrate the signicant reduction in modeling components compared to a non-perceptual ESM.

INTRODUCTION

A fairly general model that is often used to represent speech and audio signals is based on an AM/FM representation, as shown in eq. (1). In this model, the signal is represented as a sum of sinusoidal components with time-varying amplitude, phase and

frequency. While this model can obviously generate any desired signal s(n), it is required for eective modeling and ecient encoding that amplitude, frequency and phase are only slowly time varying functions of n. In practice these parameters are therefore often considered constant for the duration of an analysis frame (depending on the signal the length of

HERMUS ET AL.
quasi-stationary segments can vary from a few ms to several hundreds of ms).
K

TLS BASED EXPONENTIAL SINUSOIDAL MODELING


Thus, TLS algorithms can be readily applied to audio frames in order to estimate the parameter values of the ESM model (2). The proper arrangement of the equations (4) into structured matrices (e.g. Hankel, Toeplitz) leads to a Structured TLS (STLS) problem [2, 3]. Several approaches exist for solving the STLS problem. Since the latter is a nonlinear optimization problem, all optimal approaches are implemented using iterative algorithms. As a consequence, it is hard to predict the CPU time necessary to process a particular speech frame. Therefore, the suboptimal non-iterative Hankel TLS (HTLS) [4] approach was used for the experiments that will be described in this paper. HTLS is a subspace based approach in which the N data points are stored in a mxn data matrix with m + n 1 = N . A straightforward implementation of HTLS requires O(mn2 ) ops. Several methods exist to reduce the complexity, see e.g. [5]. Using the notation from (4) we dene the signal to modeling noise ratio (SNR) as SNR = 10 log10
N 2 n=1 s (n) N n=1

s(n)
k=1

ak (n) sin (2fk (n) n + k (n))

(1)

Audio signals often contain noise-like segments and transient sounds that are not eciently modeled by eq. (1) [1]. In particular, transient sounds can cause pre-echo distortions. Preecho originates from the fact that, in order to reduce the number of modeling parameters, sinusoidal coders normally produce signals with a fairly high degree of pseudo-stationarity. Thus, the sharpness of transient attack segments will actually be building up gradually, before the attack. As a possible way to overcome these problems, we consider modeling and encoding audio signals using a superposition of slowly time-varying exponentially weighted sinusoids and quasi-stationary noise :
K

s(n)
k=1

ak (n)edk (n) n sin (2fk (n) n + k (n)) + (n)

(s(n) s (n))2

(6)

(2) The exponential amplitude terms in ESM provide the capability for modeling fast amplitude uctuations. The outline of this paper is as follows. In the next section we briey explain how TLS algorithms can be used to estimate the parameters of an exponential sinusoidal model (ESM). In section 3 the frequency domain aspects of TLS-ESM modeling in a fullband scheme are described. Section 4 discusses two strategies to distribute the modeling components based on information from a psycho-acoustic model. Experiments on some audio tracks are discussed in section 5. Finally, the conclusions can be found in section 6. 2 ESM AND TOTAL LEAST SQUARES

This SNR can be used to measure the modeling accuracy. It can be noted from (3) that SNR is actually the quality measure that is maximized in TLS problems. Unfortunately the SNR measure is not always successful in minimizing the audibility of distortions. 3 FULLBAND ESM

Total Least Squares (TLS) algorithms form a natural extension of the basic LS algorithms, deriving the parameters of an Auto-Regressive (AR) model that exactly matches a (slightly) perturbed version s (n) of the input segment s(n), n = 1,. . .,N . The TLS modeling problem of order L is to nd the model parameters b(l), l = 1,. . .,L and s(n), n = 1,. . .,N that minimize
N

Fig. 1 illustrates how TLS operates in the spectral domain. From an example audio frame (sampling frequency = 44.1 kHz), we plot the original spectrum (gray) and the spectra of the TLS modeled versions (black), for 64 (top) and 128 (bottom) complex exponentials in the model1 . Observe that TLS models the original spectrum with spectrally localized contributions. There is an apparent contrast with traditional LS modeling that uses broad resonances to approximate the spectral envelope. TLS operates more locally, and keeps spending a lot of components to model a high energetic region, until this local t is good enough, before it goes on to model another high energetic spectral region. It is clear that this modeling behavior is in accordance with the SNR optimization criterion. Up to a signicant number of damped sinusoids, spectral gaps are present in the modeled audio and this is often incompatible with the criterion of optimal perceptual quality. As we already mentioned, the optimization of the SNR is clearly not a perceptual criterion. 4 PERCEPTUAL ESM

(s(n))2
n=1

(3)

subject to
L

s (n) = s(n) + s(n) =


l=1

b(l) (s(n l) + s(n l))

(4)

for n = (L + 1),. . .,N . Since s (n) is purely autoregressive, it can be written as a sum of exponentially damped sinusoids :
K

TLS is capable of closely tting audio frames by optimizing the signal to modeling noise ratio (eq. 6). The original segment can always be approximated with an arbitrarily high precision, at the expense of a (very) high model order. However, in audio modeling applications we care less about the degree of time resemblance between the original and the TLS modeled time signals. What we want to optimize in the
1 Throughout this paper we used an implementation of TLS analysis where the number of complex exponentials is specied, instead of the number of damped sinusoids. The number of sinusoids is about half the number of complex exponentials : 2 cos() = exp(j) + exp(j), but cos(0) = exp(j 0).

s (n) =
k=1

ak edk n sin(2fk n + k )

(5)

where the damping factors dk can be positive, negative, or zero.

AES 112TH CONVENTION, MUNICH, GERMANY, 2002 MAY 1013

HERMUS ET AL.
80

TLS BASED EXPONENTIAL SINUSOIDAL MODELING


80

magnitude (dB)

magnitude (dB)
0 5 10 15 20 25

60 40 20 0 20

60 40 20 0 20 40 0 5 10 15 20 25

freq (kHz)

freq (kHz)

80

80

magnitude (dB)

magnitude (dB)
0 5 10 15 20 25

60 40 20 0 20

60 40 20 0 20 40 0 5 10 15 20 25

freq (kHz)

freq (kHz)

Fig. 1: Illustration of TLS in the spectral domain. Original spectra (gray) and TLS modeled spectra (black) for an audio segment.
rst place, given a certain model complexity, is the perceived quality of the modeled audio. Starting from the plain fullband TLS as described so far, we need the following extensions : (1) the possibility to control the spectral distribution of the dierent damped sinusoids (2) information about the optimal location in terms of perceptual quality of the damped sinusoids In the next subsection we explain how the rst feature can be achieved by implementing TLS in a subband scheme. Thereafter we will nd that a psycho-acoustic model is appropriate to fulll the second requirement. 4.1 Subband TLS-ESM

Fig. 2: Spectral representation of basic subband TLSESM. Original spectra (gray) and modeled spectra (black).

stability [6]. Indeed, if a fully decimated lter bank can be implemented, the length of the subband signals can be kept small. In our implementation, the computational complexity is O(mn2 ). Thus, for an M -channel fully decimated system, M TLS problems in N/M unknowns must be solved, which requires only M . O( m n2 mn2 ) O( ) M M2 M2 (7)

calculations. This means a reduction by a factor M 2 compared to fullband TLS-ESM. Since the individual subband TLS problems cannot model across the frequency range of their own channel, more damped sinusoids (in total) are required to achieve the same SNR as a corresponding fullband model. While the total number of required model components increases with the number of channels in the lter bank, the overall decrease in computation time is very signicant. With a given number of modeling components, the SNR will obviously be smaller than with fullband TLS, but this is not an issue here. For more information, see [6]. 4.2 Psycho-acoustic model

Using subband signal representations enables us to control the distribution of the damped sinusoids over the whole frequency range. For every subband signal a separate TLS problem is solved, each with its own specic model order. Hence, the location of every damped sinusoid is a priori known with a resolution equal to the subband width. To illustrate the subband TLS scheme, we took the same audio frame as in gure 1, sent it through a fully decimated uniform QMF analysis lter bank (32 channels), and modeled each subband with the same number of complex exponentials. The modeled audio was then obtained as the output of the QMF synthesis lter bank. The impact on the frequency spectrum is illustrated in gure 2, for 2 complex exponentials per subband (top) and 4 complex exponentials per subband (bottom). The dierence in modeling behavior compared to gure 1 is apparent : the subband approach leads to a more uniform coverage of the spectral range. Besides the added exibility of controlling the location of the components, subband TLS also leads to a substantial reduction in computational complexity and improves the numerical

Our second objective was to nd information about how to distribute the available components over the dierent subbands. A xed distribution of the components as in the previous example, is far from optimal due to the changing spectral contents of consecutive frames : one could only rely on the (statistically modeled) average spectral contents of audio signals. This statistical average can of course be computed, but it will (largely) depend on the type of audio under consideration, which makes this approach inecient. What we need is an automatically derived, frame-dependent distribution of the damped sinusoids that optimizes the perceived quality. Assigning components in accordance with the perceptual relevance of each subband is an obvious way to reach our goal.

AES 112TH CONVENTION, MUNICH, GERMANY, 2002 MAY 1013

HERMUS ET AL.
The necessary information to quantify that perceptual relevance of the dierent subbands can be derived from a psychoacoustic analysis. Based on masking phenomena of our hearing system, the psycho-acoustic model will calculate a framespecic (and hence time varying) frequency dependent masking curve. Spectral components of modeling distortions that fall below a masking threshold are perceptually irrelevant. In our current implementation we use MPEG 1, psycho-acoustic model I [7], which updates its masking curve every 384 samples based on a 512 samples wide window. Figure 3 shows the frequency spectrum of an audio fragment together with its masking curve (solid black). Those parts of the spectrum that extend high above the masking curve are perceptually critical and should be modeled carefully. The spectrum of the highest subbands mostly falls below the masking curve (due to the high hearing threshold) and does not need to be assigned components to. Of course, the masking
Spectrum with Masking Curve

TLS BASED EXPONENTIAL SINUSOIDAL MODELING


a new component can be assigned. The iteration stops when the total number of assigned components reaches a predened number OR when the NMR is below a given threshold (e.g. 0 dB) in all subbands. (2) This method has a similar strategy as method 1, but in this case an extra component is assigned to that subband that can make most eectively use of it. The term eective should be interpreted as maximally lowering the NMR (contributions below threshold are ignored since these dont improve the overall perceptual quality). Since we cant predict this best subband, we need an exhaustive search over all subbands with a NMR above threshold. Although very similar at rst sight, both methods have some clear dierences. In method 1, the main objective is to bring down the maximal audible distortion (in the subband with the highest NMR) in every iteration step, regardless of the eort (i.e. extra components) this may take. In some cases we may continue to assign components to a subband that is dicult to model, while this may not improve the overall perceptual quality. For this reason one could constrain the maximum number of components per subband. Method 2 avoids that we get stuck in one dicult subband and guarantees that the overall perceptual quality is maximally improved in every step. Spectral properties that are perceptually relevant and easy to model will be tackled rst, properties that are relevant but hard to model are modeled last. In very rare cases, some subbands may only be modeled when all other subband have a NMR below the threshold. This may lead to a (much) higher number of modeling components before the NMR is below the threshold in every subband. A more important drawback of this second method is the high computational load due the exhaustive search in every iteration step. The experiments will indicate whether these extra calculations are worth it or not. Note that in both methods we work with the masking curve of the original audio segment throughout the whole iteration procedure. The NMR update in each iteration is only based on the new modeling error signal. This procedure of working with a constant masking curve is allowed in the case where we keep on assigning components till we have transparent perceptual quality. If not, e.g. in the case where HTLS modeling is combined with some kind of residual coding, we should ideally work with the masking curve of the HTLS signal that we will end with. Figure 4 illustrates how the psycho-acoustic model leads the TLS algorithm to those spectral regions that are perceptually most relevant (= the parts of the spectrum in g 3 that extend most above the masking curve). We again used a total of 64 (top) and 128 (bottom) complex exponentials to model the audio frame. Method 1 was used to model the audio segment. 5 EXPERIMENTS

70

60

50

magnitude (dB)

40

30

20

10

10

10

freq (kHz)

15

20

25

Fig. 3: Frequency spectrum (gray) and corresponding masking curve (black).


curve by itself is not sucient to decide a priori how many damped sinusoids are needed for each subband in order to obtain a desired perceptual quality per subband. It is clear that the number of required components will largely depend on the exact shape of the subband signal. Note that in every single subband we still optimize a SNR criterion. The higher the spectral dynamic range in a subband, the more components will be needed to cover the whole spectrum in that subband (if necessary). To obtain perceptually motivated distributions of the damped sinusoids, we developed the following two iterative strategies. For both methods, we rst calculate the masking curve of the original audio segment. (1) Assign an extra component to the subband with the largest ratio of modeling error noise to masking level (NMR - Noise to Mask Ratio). During initialization every subband has zero components assigned to it and the noise is equal to the original subband signal. In each iteration step, the NMR in the subbands is updated before

Experiments were carried out in order to obtain proof of concept for psycho-acoustic based ESM. As test material we took a selection of audio tracks from the EBU SQAM Compact Disc [8], together with a fragment from Joe Cockers pop song Manhattan. All test material is in a 44.1 kHz - 16 bit format. A QMF lterbank with 32 channels was used ; for the HTLS modeling we opted for a frame size of 384 samples.

AES 112TH CONVENTION, MUNICH, GERMANY, 2002 MAY 1013

HERMUS ET AL.
80

TLS BASED EXPONENTIAL SINUSOIDAL MODELING


glockenspiel fragment (right). While the former has a rapidly changing rich spectral content, the latter has its spectral information much more localized in time and frequency. We know that using subbands is benecial for modeling spectrally localized contributions ; the psycho-acoustic model on the other hand will be eective to point out the time-varying location of the relevant spectral information. It is to be expected that the psycho-acoustic ESM will need less modeling components than a plain subband or fullband ESM scheme. Figure 6 gives, as a function of the frame number, the number of complex exponentials that is needed to bring the NMR below threshold (in every subband). The data that were obtained with method 1 are put in black ; a gray line is used for method 2. The plots illustrate how the number of modeling components changes as a function of time depending on the spectral content. Some frames with a very sparse and/or low energetic spectrum need very few modeling components, whereas high-energetic frames with a rich spectrum need more components. For all frames the number of modeling components is signicantly smaller than the 192 components that were needed in a plain subband ESM scheme. In a fullband ESM scheme, this number would be even higher. From the plots we also notice that method 2 needs more components than method 1 to bring down the NMR in all subbands. One reason could be the tendency of the algorithm to postpone the modeling of dicult subbands. Another reason could be that in method 1 we limited the maximum number of components per subband to six. The above gure tells us how many components we need in order to have no audible modeling errors, but it does not reveal how the NMR, which is indicative of the perceptual quality, evolves as a function of the number of components. This information can be found in gures 7 and 8. In gure 7 we plot the average (over all frames) of the maximum (over all subbands) of the NMR as a function of the number of complex exponentials in the TLS model. A black line is used for the method 1 data ; a gray line for method 2. For small numbers of complex exponentials the NMR drops rapidly. In this region the ESM algorithm is very eective in capturing the tonal components that are present in the spectrum. With increasing number of components the perceptual quality continues to improve, but the slope of the curve is dierent. The more components in the model, the more noise-like the residual will be and the more dicult the modeling will be. Since method 1 concentrates on lowering the maximal NMR (i.e. it improves the HTLS model for the subband with the maximal audible distortion), its curve stays below that of method 2. In gure 8 the mean NMR (over all subbands) and averaged out over all frames is plotted as a function of the number of complex exponentials in the TLS model. A black line is used for the method 1 data ; a gray line for method 2. The above remarks about the general form of the curves remain valid. We notice that in this case the curve corresponding to method 2 gives the best results. This is quite logical since method 2 always assigns an extra component to that subband that lowers the NMR most. The presented results suggest that the overall dierence in performance between the methods is rather small. The main dierence lies in the much higher computational load

magnitude (dB)

60 40 20 0 20

10

freq (kHz)

15

20

25

80

magnitude (dB)

60 40 20 0 20

10

freq (kHz)

15

20

25

Fig. 4: Spectra of a psycho-acoustic based TLS modeled audio segment. Original spectra (gray) and modeled spectra (black).
5.1 Plain subband ESM reference

Our primary goal of introducing the psycho-acoustic model is to reduce the number of complex exponentials compared to the plain subband ESM. As a reference, we rst checked the number of complex exponentials needed in a subband ESM to obtain transparent quality. Since each subband contains the same number of complex exponentials, the total number of components is a multiple of 32. Informal listening tests pointed out that no audible distortions are heard from 192 components onwards (= 6 complex exponentials per subband). We found the same threshold of 192 components for every audio track. This means that every audio track contains one or more subbands that need at least 6 complex exponentials. 5.2 Threshold for the psycho-acoustic model

We already mentioned in section 4.2 that we can stop assigning components to a subband when the NMR in that subband has become smaller than a threshold. A NMR of 0 dB would be a logical choice, because in that case the modeling distortions fall below the masking threshold. How obvious this might be, this strategy is only valid if the psycho-acoustic model is accurate in calculating the masking curve. In [9] we described the results of an A-B preference test with a panel of 10 listeners. In that experiment we found that no modeling distortions are heard already before the NMR has become zero in all subbands, which indicates that one can safely rely on the psycho-acoustic model. 5.3 Results of psycho-acoustic ESM

For the illustration and comparison of the two methods for psycho-acoustic ESM, we will concentrate on two audio fragments, namely the popsong and the glockenspiel. For more detailed experimental results (violin, harp, trumpet, soprano, . . . ) and for HTLS audio samples, see our website [10]. Figure 5 gives the spectrogram of the popsong (left) and the

AES 112TH CONVENTION, MUNICH, GERMANY, 2002 MAY 1013

HERMUS ET AL.
Popsong
20

TLS BASED EXPONENTIAL SINUSOIDAL MODELING


Glockenspiel
20

15

15

freq (kHz)

10

freq (kHz)
0.5 1 1.5 2 2.5 3

10

time (sec)

0.5

time (sec)

1.5

2.5

Fig. 5: Spectrogram of popsong and glockenspiel audio tracks.


180

Popsong

160

Glockenspiel

160

140

140

120

Number of Compl. Exp.

120

Number of Compl. Exp.


0 100 200 300 400 500 600 700 800 900 1000

100

100

80

80

60

60

40

40

20

20

Frame Number

200

400

Frame Number

600

800

1000

1200

Fig. 6: Number of complex exponentials needed to obtain a NMR of zero dB in all subbands.
of method 2 which makes method 1 a good choice in most situations. 6 CONCLUSION 7 ACKNOWLEDGMENT trated that our psycho-acoustic ESM can achieve the same audio quality as plain ESM but with signicantly less modeling components.

The extension of traditional sinusoidal modeling to exponential sinusoidal modeling is especially valuable for the modeling of the transitional segments in audio. Basic Total Least Squares (TLS) algorithms as they are readily found in other signal processing domains, are not ecient to extract the model parameters in audio modeling due to their high computational load and the SNR optimization criterion. In our approach we optimize a perceptual criterion by integrating psycho-acoustic information in a subband TLS-ESM algorithm, with the added advantage of a reduced computational load. We presented two strategies for assigning damped sinusoids to the dierent subbands. Extensive tests on SQAM data showed the eectiveness of both strategies, and illus-

Support is acknowledged from the Flemish Community - IWT and from the Research Fund K.U. Leuven.

REFERENCES

[1] J. Jensen, S.H. Jensen, and E. Hansen, Exponential sinusoidal m odeling of transitional speech segments, in Proc. International Conference on Acoustics, Speech and Signal Processing, Phoenix, U.S.A., Mar. 1999, vol. I, pp. 473476.

AES 112TH CONVENTION, MUNICH, GERMANY, 2002 MAY 1013

HERMUS ET AL.
Popsong

TLS BASED EXPONENTIAL SINUSOIDAL MODELING


Glockenspiel

30

40

35 25 30 20

maximal NMR

15

maximal NMR
0 20 40 60 80 100 120 140 160

25

20

15

10 10 5 5

Number of Compl. Exp.

20

40

Number of Compl. Exp.

60

80

100

120

140

160

Fig. 7: Maximal NMR as a function of the number of complex exponentials in the psycho-acoustic ESM-TLS modeling.
4

Popsong

22

Glockenspiel

6 24 8

10

26

mean NMR

12

mean NMR
0 20 40 60 80 100 120 140 160

28

14

16

30

18 32 20

22

34

Number of Compl. Exp.

20

40

Number of Compl. Exp.

60

80

100

120

140

160

Fig. 8: Mean NMR as a function of the number of complex exponentials in the psycho-acoustic ESM-TLS modeling.
[2] P. Lemmerling, Structured total least squares: analysis, algorithms and applications, Ph.D. thesis, Faculty of Applied Sciences, K.U. Leuven, Belgium, 1999. [3] S. Van Huel, H. Park, and J.B. Rosen, Formulation and solution of structured total least norm problems for parameter estimation, IEEE Transactions on Signal Processing, vol. 44, no. 10, pp. 24642474, 1996. [4] S. Van Huel, H. Chen, C. Decanniere, and P. Van Hecke, Algorithm for time-domain NMR data tting based on total least squares, Journal of Magnetic Resonance, vol. A110, pp. 228237, 1994. [5] D. Calvetti, L. Reichel, and D. C. Sorensen, An implicitly restarted lanczos method for large symmetric eigenvalue problems, Electron. Trans. Numerical Analysis, vol. 2, pp. 121, Mar. 1994. [6] K. Hermus, W. Verhelst, P. Wambacq, and P. Lemmerling, Total least squares based subband modelling for scalable speech representations with damped sinusoids, in Proc. International Conference on Spoken Language Processing, Beijing, China, Oct. 2000, vol. III, pp. 1129 1132. [7] ISO/IEC 11172-3, Information technology - Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s - Part 3: Audio, ISO/IEC, 1993. [8] Sound Quality Assessment Material CD (1988), issued by the EBU http://www.ebu.ch/ . [9] K. Hermus, W. Verhelst, and P. Wambacq, Psychoacoustic modeling of audio with exponentially damped sinusoids, in Proc. International Conference on Acoustics, Speech and Signal Processing, Orlando, U.S.A., May 2002. [10] ESAT-PSI Speech Group website http://www.esat.kuleuven.ac.be/~spch/ .

AES 112TH CONVENTION, MUNICH, GERMANY, 2002 MAY 1013

You might also like