Workload Estimation For Low-Delay Segmented Convolution

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Audio Engineering Society

Convention e-Brief
Presented at the 134th Convention
2013 May 4–7 Rome, Italy

This Engineering Brief was selected on the basis of a submitted synopsis. The author is solely responsible for its presentation,
and the AES takes no responsibility for the contents. All rights reserved. Reproduction of this paper, or any portion thereof, is not
permitted without direct permission from the Audio Engineering Society.

Workload estimation for low-delay


segmented convolution
Malte Spiegelberg, Robert Mores

HAW Hamburg, Hamburg, 22081, Germany


Robert.Mores@haw-hamburg.de

ABSTRACT

Zero-delay convolution usually follows a hybrid approach with convolution processing steps in both, the time and
the frequency domain (Gardner, J. AES 43 (1995) [1]). Implementations are likely to ask for dynamic coding, and
related workload estimations are focused on efficiency and are limited to the hybrid approach. This paper considers
simpler implementations of segmented convolution that work in the frequency domain only and that achieve
acceptable low delay for real-time applications when processing several seconds of impulse-response in FIR mode.
Workload and memory demand are estimated for this approach in the context of likely application parameters.

h[n] Impulse response


1. INTRODUCTION
Lh Length of h[n]
This paper considers the task of convolving an input P Length of a segment hi[n]
signal with another signal, say a room impulse response
y[n] Output signal
of considerable length. To solve this computational task,
here, the input and the impulse response are divided into Ly Length of y[n]
segments and are zero-padded to prepare for
convolution. The segment size is chosen in context of Capital letters are used for signals in the frequency
application. domain, so X[f] is the Fourier transformed input signal
x[n].
2. NOTATION
3. SEGMENTED CONVOLUTION
The following variables are assigned:
In signal processing tasks impulse responses of
x[n] Input signal
considerable length are occasionally convolved with
Lx Length of x[n] other signals. The related computational effort can be
lowered when processing large signal segments in the
N Length of a segment xk[n]
frequency domain. Real-time applications, however,
Spiegelberg, Mores <Workload Est. for Segmented Convolution>

may dictate use of small segments. Therefore, input P −1


6474 8
y11 [ n ] 1 3 7 13 14 14 8
P −1
64748 P −1
6474 8
signal and impulse response are divided into segments. y12 [ n ] 4 12 22 33 24 11 4
In this way, the output can start before the whole input y 21 [ n ] 5 11 23 37 34 30 16
signal is known. y 22 [ n ] 20 44 62 77 52 23 8
y31 [ n ] 9 19 39 61 54 46 24
When a segment xk[n] is loaded, it is transformed into y32 [ n ] 36 76 102 121 80 35 12
the frequency domain by FFT. The transformed segment y [ n ] 1 3 7 13 23 37 53 70 87 104 121 138 ...
Xk[f] is then multiplied with every segment Hi[f] of the
transformed impulse response. h[n] can be segmented
and transformed during system boot so that transformed 4. WORKLOAD ESTIMATION
segments Hi[f] are present in memory. The result of the
multiplication Yki[f] is transformed back into time This section models computation effort to finally
domain by the inverse FFT, IFFT. Finally, an overlap- suggest optimal N = P in the context of application
add of all segments yki[n] results in the output signal demand. The process of segmented convolution takes
y[n] [Fig. 1][2]. four steps of processing:

• FFT of an input segment xk[n]

• Multiplication with the segments Hi[f] of the


transformed impulse response

• IFFT of the resulting segments Yki[f]

• Overlap-add of the segments yki[n]

4.1. FFT of input segments xk[n]

Every input signal xk[n] of length N has to be


Figure 1 The process of segmented convolution transformed with an FFT of length N + P - 1. For this
workload estimation, the Radix-2 Algorithm will be
The size P of the segments hi[n] is an important
used. The effort for one segment xk[n] will be
parameter in the calculation. When P is smaller than N,
the number of terms in the final overlap-add calculation
N + P −1
increases unfortunately, computational effort will WM 1 = ⋅ log 2 ( N + P − 1) (2 )
increase. When P is larger than N, a segment xk[n] and 2
its resulting segments will stay in the process longer
than necessary, memory demand will increase. Hence, Multiplications and
the most efficient method is using N = P.
A tutorial example with very short segment length
illustrates input, segmentation and overlap-add
calculation of convolved segments (1).
WA1 = ( N + P − 1) ⋅ log 2 ( N + P − 1) (3 )
x [ n ] = [1, 2, 3, ..., 31, 32 ]
Additions [3].
h [ n ] = [1,1, 2, 2, 4, 4, 2,1]
4.2. Multiplication with the segments Hi[f]
N =P=4 (1 ) The impulse response of length Lh is split into Lh / P
 x [n] if 4 ( k − 1) ≤ n ≤ 3k segments with the same amount of segments after
xk [ n ] =  transformation. The result of every of the FFTs
 0 else
calculated before is multiplied with every segment Hi[f].
 h [ n ] , if 4 ( i − 1) ≤ n ≤ 4i − 1 This will need Lh / P Multiplications for one segment
hi [ n − ( i − 1) 4 ] = 
Xk[f] to obtain the result Yki[n].
 0 else

AES 134th Convention, Rome, Italy, 2013 May 4–7


Page 2 of 4
Spiegelberg, Mores <Workload Est. for Segmented Convolution>

4.3. IFFT of the resulting segments Yki[n] 4.5. Estimation Results


All segments Yki[n] have to be transformed back into The following table provides data for different sample
time domain with the IFFT operation. IFFT and FFT rates and impulse response lengths see Table 1. Lx is set
have the same effort. So for every segment Yki[n] there to one second according to the sample rate. Therefore,
will be: the numbers represent operations per second. The line
with N = P = 1 represents discrete convolution, the last
Lh  N + P − 1 two lines represent the convolution in the spectral
WM 2 = ⋅

⋅ log 2 ( N + P − 1)  (4 ) domain with Fourier transformations across the entire
P  2  sound.

Multiplications and Lh = 2 Sec. @ 44.1 kHz Lh = 0.5 Sec. @ 48 kHz


N=P WM WA + Wover WM WA + Wover
Lh
WA 2 = ⋅ ( ( N + P − 1) ⋅ log 2 ( N + P − 1) ) (5 ) 1 5.83 * 109 5.83 * 109 3.46 * 109 3.46 * 109
P 8 1.84 * 109 6.30 * 109 5.46 * 108 1.46 * 109
9 9 8
16 1.18 * 10 3.75 * 10 3.50 * 10 9.01 * 108
Additions. 32 7.19 * 10 8
2.15 * 10 9
2.13 * 10 8
5.30 * 108
8 9 8
64 4.23 * 10 1.20 * 10 1.25 * 10 3.04 * 108
The calculations shown have to be done for each 8 8 7
128 2.43 * 10 6.66 * 10 7.21 * 10 1.71 * 108
segment of x[n]. When x[n] is split into segments of 8 8 7
length N, there are Lx / N segments. Adding this factor 256 1.37 * 10 3.64 * 10 4.09 * 10 9.50 * 107
6 6
and taking the estimations before, the workload 44100 2.17 * 10 4.48 * 10
estimation before the overlap-add can be calculated. It 48000 1.19 * 106 2.38 * 106
will be:
Table 1 Workload estimation for two commonly
used sample rates, with the number of multiplications
WM = x ⋅  h +  h + 1 ⋅ 
L L L N + P −1
⋅ log 2 ( N + P − 1)  (6 )
WM, additions WA and additions for overlap Wover,
N P P  2 
N = P = 1 represents convolution in the time domain
Multiplications and 1 sek. @ 22 kHz
2 sek. @ 22 kHz
10
10 1 sek. @ 44.1 kHz
2 sek. @ 44.1 kHz
Lx   Lh  
+ 1  ⋅ ( N + P − 1) ⋅ log 2 ( N + P − 1)  (7 )
1 sek. @ 48 kHz
WA = ⋅
Workload

2 sek. @ 48 kHz

N  P   9
10

Additions. 8
10
Segment Length (N = P)
1 2 4 8 16 32 64 128

4.54e−5 9.09e−5 1.82e−4 3.63e−4 7.27e−4 1.46e−3 2.91e−3 5.81e−3 Sec. (Fs = 22 kHz)
4.4. Overlap-add of the segments yki[n] 2.27e−5 4.54e−5 9.07e−5 1.81e−4 3.63e−4 7.26e−4 1.45e−3 2.9e−3 Sec. (Fs = 44.1 kHz)

2.08e−5 4.17e−5 8.33e−5 1.67e−4 3.33e−4 6.67e−4 1.33e−3 2.67e−3 Sec. (Fs = 48 kHz)
An overlap-add calculation only requires additions and
the amount of terms depends on the number of segments Figure 2 Workload estimation versus segmentation
in x[n] and h[n]. This number grows from the beginning length N=P, workload represents the sum of all
of the operation to a maximum of Lx + Lh – 1 - 2P terms multiplications and all additions, amended scales
for each segment y[n], see examples. The workload of represent processing delay in relation to segmentation
the overlap-add calculation can be described for the length and sample rate.
whole calculation. For the estimation, the whole
calculation is considered to be at maximum overlap.
The total amount of additions for the overlap therefore 5. MEMORY ESTIMATION
is:
Implementation of segmented convolution will require
memory in a predictable way. Memory estimation takes
Wover =  x + h − 2  ⋅   
L L Lh Lh
 2 ⋅ − 1  ⋅ ( P − 1) + − 1  (8 ) into account that memory can be overwritten as soon as
N P   P  P  a segment xk[n] is no longer part of the calculation. The
transformed segments Hi[n] of the impulse response will

AES 134th Convention, Rome, Italy, 2013 May 4–7


Page 3 of 4
Spiegelberg, Mores <Workload Est. for Segmented Convolution>

be needed for the whole calculation so these cannot be examples of common FFT-Algorithms that can reduce
overwritten. workload. An optimized FFT method for low latency
convolution is presented in [5].
It is possible to estimate the memory needed for the
whole calculation. As maximum memory should be Concerning data format, working with floating point is
estimated, the sectors with maximum overlap are recommended as coding will be simpler compared to
considered. At this point of the calculation 2 * Lh / P fixed point usage.
segments are active in the calculation. Another Lh / P
segments have to be prepared for the next sector, so
7. SUMMARY
memory is also required for these segments. The total
amount of samples to be stored in memory is:
Workload and memory demand are estimated for the
segmented convolution in the spectral domain. The
  approach translates parameter size such as sample size
 
 2 ⋅ ( 2 P − 1) ⋅ Lh  +  2 ⋅ 2 N − 1 ⋅ Lh  and sample rate to suggest window size and related
  1424 ( ) delays for individual applications. Feasibility can be
144244 P
3  segment X [ n]
3 { P  evaluated before implementation on specific processors.
H [n]  k
next 
 
loading
i
segments (9 )
  8. REFERENCES
 2 ⋅ Lh  ⋅ S Bit
+ 2 ⋅ ( 2 N − 1) ⋅ [1] William G. Gardner, “Efficient Convolution
 1424 3 P 
 Segment Y [n ] maximum{
j 
overlap segments 
without Input-Output Delay” in Journal of the
Audio Engineering Society, Vol. 43 (3), 127-136
(1995)
And for N = P:
[2] Steven W. Smith, The Scientist and Engineer's
8 Lh Guide to Digital Signal Processing, California
( 2 P − 1) ⋅ ⋅ S Bit (10 ) Technical Publishing, 1997
P
[3] Eleanor Chu and Alan George, Inside the FFT
For instance, with the examples from section 4 [Tab. 1,
Black Box: Serial and Parallel Fast Fourier
Fig. 2], a second of impulse response at 44,1 kHz and N
Transform Algorithms, CRC Press, 2010
= P = 64 needs 7 · 105 samples in memory, and an
extreme case, two seconds at 48 kHz and N = P = 2
[4] Udo Zölzer, Digitale Audiosignalverarbeitung,
needs 1,15 · 106 samples.
Vieweg+Teubner Verlag, 2005

6. OPTIMIZATION [5] Jeffrey Hurchalla, “A time distributed FFT for


efficient low latency convolution,” Audio
There are further options to reduce processing power or
Engineering Society Convention 129, 2010
memory demand by proposed algorithms, even before
processor-specific optimization.

For instance, the calculation can optionally be optimized


by a factor of two when using an interleaving of the
input signal segments. Two segments x1[n] and x2[n]
form a complex value z1[n] that is transformed and
multiplied with all transformed segments of h[n]. The
result ZE1[f] is then transformed back into the time
domain. The segments yki[n] are the real and imaginary
part of the results zej[n] [4].

Or alternatively, the process can be improved by using


other FFT algorithms. Radix 4 and Split-Radix are two

AES 134th Convention, Rome, Italy, 2013 May 4–7


Page 4 of 4

You might also like