Ellis, Mandel - 2009 - Lecture 7 Audio Compression and Coding

EE E6820: Speech & Audio Processing & Recognition
Lecture 7:
Audio compression and coding
Dan Ellis <dpwe@ee.columbia.edu>

Michael Mandel <mim@ee.columbia.edu>
Columbia University Dept. of Electrical Engineering
http://www.ee.columbia.edu/∼dpwe/e6820
March 31, 2009
1 Information, Compression & Quantization

2 Speech coding
3 Wide-Bandwidth Audio Coding
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 1 / 37
Outline
2 Speech coding
Compression & Quantization
How big is audio data? What is the bitrate?
Fs frames/second (e.g. 8000 or 44100)
I
x C samples/frame (e.g. 1 or 2 channels)

x B bits/sample (e.g. 8 or 16)
→ Fs · C · B bits/second (e.g. 64 Kbps or 1.4 Mbps)
bits / frame
CD Audio 1.4 Mbps
32
8
Mobile 8000 44100
!13 Kbps frames / sec
Telephony 64 Kbps
How to reduce?
I lower sampling rate → less bandwidth (muffled)
I lower channel count → no stereo image
I lower sample size → quantization noise
Or: use data compression
Data compression:
Redundancy vs. Irrelevance
Two main principles in compression:
I remove redundant information
I remove irrelevant information
Redundant information is implicit in remainder
e.g. signal bandlimited to 20kHz,
I
but sample at 80kHz

→ can recover every other sample by interpolation:
In a bandlimited signal, the red samples
can be exactly recovered by interpolating
the blue samples
sample
time
Irrelevant info is unique but unnecessary

I e.g. recording a microphone signal at 80 kHz sampling rate
Irrelevant data in audio coding
For coding of audio signals,

irrelevant means perceptually insignificant
I an empirical property
Compact Disc standard is adequate:
I 44 kHz sampling for 20 kHz bandwidth
I 16 bit linear samples for ∼ 96 dB peak SNR
Reflect limits of human sensitivity:
I 20 kHz bandwidth, 100 dB intensity
I sinusoid phase, detail of noise structure
I dynamic properties - hard to characterize
Problem: separating salient & irrelevant
Quantization
Represent waveform with discrete levels

5
4 6
3 Q[x[n]] x[n]
2 4
1
2
Q[x] 0
-1 0
-2
-3 -2
-4 error e[n] = x[n] - Q[x[n]]
-5 0 5 10 15 20 25 30 35 40
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
Equivalent to adding error e[n]:

x[n] = Q [x[n]] + e[n]
e[n] ∼ uncorrelated, uniform white noise
p(e[n])
D2
→ variance σe2 = 12
-D/ +D/
2 2
Quantization noise (Q-noise)
Uncorrelated noise has flat spectrum
With a B bit word and a quantization step D
I max signal range (x) = −(2B−1 ) · D . . . (2B−1 − 1) · D
I quantization noise (e) = −D/2 . . . D/2
→ Best signal-to-noise ratio (power)
SNR = E [x 2 ]/E [e 2 ]
= (2B )2
.. or, in dB, 20 · log10 2 · B ≈ 6 · B dB

0
Quantized at 7 bits
level / dB
-20
-40
-60
-80
0 1000 2000 3000 4000 5000 6000 7000 freq / Hz
Redundant information
Redundancy removal is lossless

Signal correlation implies redundant information
I e.g. if x[n] = x[n − 1] + v [n]
x[n] has a greater amplitude range
→ uses more bits than v [n]
I sending v [n] = x[n] − x[n − 1] can reduce amplitude, hence
bitrate
x[n] - x[n-1]
I but: ‘white noise’ sequence has no redundancy ...

Problem: separating unique & redundant
Optimal coding
Shannon information:
An unlikely occurrence is more ‘informative’
p(A) = 0.5 p(B) = 0.5 p(A) = 0.9 p(B) = 0.1
ABBBBAAABBABBABBABB AAAAABBAAAAAABAAAAB
A, B equiprobable A is expected;
B is ‘big news’
Information in bits I = − log2 (probability )
I clearly works when all possibilities equiprobable
Optimal bitrate → av.token length = entropy H = E [I ]
I .. equal-length tokens are equally likely
How to achieve this?
I transform signal to have uniform pdf
I nonuniform quantization for equiprobable tokens
I variable-length tokens → Huffman coding
Quantization for optimum bitrate
Quantization should reflect pdf of signal:
p(x = x0) p(x < x0) x'
1.0
0.8
0.6
0.4
0.2
0
-0.02 -0.015 -0.01 -0.005 0 0.005 0.01 0.015 0.02 0.025 x
I cumulative pdf p(x < x0 ) maps to uniform x 0

Or, codeword length per Shannon log2 (p(x)):
p(x) Shannon info / bits Codewords
-0.02
111111111xx
-0.01 111101xx
111100xx
101xx
0 100xx
0xx
110xx
0.01 1110xx
111110xx
1111110xx
0.02 111111100xx
111111101xx
0.03
111111110xx
0 2 4 6 8
I Huffman coding: tree-structured decoder

Vector Quantization
Quantize mutually dependent values in joint space:
3
x1
2
-1
-2
x2
-6 -4 -2 0 2 4
May help even if values are largely independent

I larger space x1,x2 is easier for Huffman
Compression & Representation
As always, success depends on representation

Appropriate domain may be ‘naturally’ bandlimited
I e.g. vocal-tract-shape coefficients
I can reduce sampling rate without data loss
In right domain, irrelevance is easier to ‘get at’
I e.g. STFT to separate magnitude and phase
Aside: Coding standards
Coding only useful if recipient knows the code!
Standardization efforts are important
Federal Standards: Low bit-rate secure voice:
I FS1015e: LPC-10 2.4 Kbps
I FS1016: 4.8 Kbps CELP
ITU G.x series (also H.x for video)
I G.726 ADPCM
I G.729 Low delay CELP
MPEG
I MPEG-Audio layers 1,2,3 (mp3)
I MPEG 2 Advanced Audio Codec (AAC)
I MPEG 4 Synthetic-Natural Hybrid Codec
More recent ‘standards’
I proprietary: WMA, Skype...
I Speex ...
Outline
2 Speech coding
Speech coding
Standard voice channel:
I analog: 4 kHz slot (∼ 40 dB SNR)
I digital: 64 Kbps = 8 bit µ-law x 8 kHz
How to compress?
Redundant
I signal assumed to be a single voice,
not any possible waveform
Irrelevant
I need code only enough for intelligibility, speaker identification
(c/w analog channel)
Specifically, source-filter decomposition
I vocal tract & f0 change slowly
Applications:
I live communications
I offline storage
Channel Vocoder (1940s-1960s)
Basic source-filter decomposition
I filterbank breaks into spectral bands
I transmit slowly-changing energy in each band
Encoder Decoder
Bandpass Smoothed Downsample
filter 1 energy & encode E1
Bandpass
filter 1
Output
Input
Bandpass Smoothed Downsample EN
filter N energy & encode Bandpass
filter 1
Voicing V/UV
analysis Pulse Noise
Pitch generator source
I 10-20 bands, perceptually spaced

Downsampling?
Excitation?
I pitch / noise model
I or: baseband + ‘flattening’...
LPC encoding
The classic source-filter model

Encoder Filter coefficients {ai} Decoder
|1/A(ej!)|
Input Represent
f
s[n] & encode
LPC Output
analysis ^
e[n] ^
s[n]
Represent Excitation All-pole
Residual & encode generator filter
e[n]
H(z) = 1
t 1 - "aiz-i
Compression gains:
I filter parameters are ∼slowly changing
I excitation can be represented many ways
20 ms
Filter
parameters
Excitation/pitch
parameters
5 ms
Encoding LPC filter parameters
For ‘communications quality’:
I 8 kHz sampling (4 kHz bandwidth)
I ∼10th order LPC (up to 5 pole pairs)
I update every 20-30 ms → 300 - 500 param/s
Representation & quantization
ai
I {ai } - poor distribution,
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
can’t interpolate
ki
I reflection coefficients {ki }:
guaranteed stable -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
fLi
I LSPs - lovely!
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Bit allocation (filter):

I GSM (13 kbps): 8 LARs x 3-6 bits / 20 ms = 1.8 Kbps
I FS1016 (4.8 kbps): 10 LSPs x 3-4 bits / 30 ms = 1.1 Kbps
Line Spectral Pairs (LSPs)
LSPs encode LPC filter by a set of frequencies
Excellent for quantization & interpolation
Definition: zeros of
P(z) = A(z) + z −p−1 · A(z −1 )
Q(z) = A(z) − z −p−1 · A(z −1 )
I z = e jω → z −1 = e −jω → |A(z)| = |A(z −1 )| on u.circ.

I P(z), Q(z) have (interleaved) zeros when
∠{A(z)} = ±∠{z −p−1 A(zQ −1
)}
I reconstruct P(z), Q(z) = i (1 − ζi z −1 ) etc.
I A(z) = [P(z) + Q(z)]/2
A(z-1) = 0
Q(z) = 0
A(z) = 0
P(z) = 0
Encoding LPC excitation
Excitation already better than raw signal:
5000
Original signal
-5000
100
-100 LPC residual

1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 time / s
I save several bits/sample, but still > 32 Kbps

Crude model: U/V flag + pitch period
I ∼ 7 bits / 5 ms = 1.4 Kbps → LPC10 @ 2.4 Kbps
Pitch period values 16 ms frame boundaries
100
50
0
-50
1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 time / s
Band-limit then re-extend (RELP)
Encoding excitation
Something between full-quality residual (32 Kbps)
and pitch parameters (1.4 kbps)?
‘Analysis by synthesis’ loop:
Input LPC Filter coefficients
s[n] analysis A(z)
Excitation code
Pitch-cycle LPC filter

Synthetic predictor –
excitation + 1 +
c[n] b·z–NL
^
x[n] A(z)
control
‘Perceptual’
MSError weighting
minimization
W(z)= A(z) ^
x[n] *ha[n] - s[n]
A(z/!)
Excitation encoding
‘Perceptual’ weighting discounts peaks:

20
1/A(z) W(z)= A(z)
gain / dB
10 A(z/!)
0 1/A(z/!)
-10
-20
0 500 1000 1500 2000 2500 3000 3500 freq / Hz
Multi-Pulse Excitation (MPE-LPC)
Stylize excitation as N discrete pulses
15
10 original LPC
5 residual
0
-5
mi multi-pulse
5
excitation
0
-5
0 20 40 60 80 100 120 time / samps
ti
I encode as N × (ti , mi ) pairs

Greedy algorithm places one pulse at a time:

A(z) X (z)
Epcp = − S(z)
A(z/γ) A(z)
X (z) R(z)
= −
A(z/γ) A(z/γ)
I R(z) is residual of target waveform after inverse-filtering
I cross-correlate hγ and r ∗ hγ , iterate
CELP
Represent excitation with codebook
e.g. 512 sparse excitation vectors
Codebook
000 001 002 003
004 005 006 007
Search
index 008 009 010 011 excitation
012 013 014 015
0 60 time / samples
I linear search for minimum weighted error?

FS1016 4.8 Kbps CELP (30ms frame = 144 bits):
10 LSPs 4x4 + 6x3 bits = 34 bits
Pitch delay 4 x 7 bits = 28 bits
Pitch gain 4 x 5 bits = 20 bits I 138 bits
Codebk index 4 x 9 bits = 36 bits
Codebk gain 4 x 5 bits = 20 bits
Aside: CELP for nonspeech?
CELP is sometimes called a ‘hybrid’ coder:
I originally based on source-filter voice model
I CELP residual is waveform coding (no model)
Original (mrZebra-8k)
4
freq / kHz
3
CELP does not 1
break with multiple 4

5 kbps buzz-hiss
voices etc. 3
I just does the 1
best it can 4
4.8 kbps CELP
0
0 1 2 3 4 5 6 7 8 time / sec
LPC filter models vocal tract;

also matches auditory system?
I i.e. the ‘source-filter’ separation is good for
relevance as well as redundancy?
Outline
2 Speech coding
Wide-Bandwidth Audio Coding
Goals:
I transparent coding i.e. no perceptible effect
I general purpose - handles any signal
Simple approaches (redundancy removal)
I Adaptive Differential PCM (ADPCM)
X[n] + D[n] Adaptive C[n]
+ quantizer
– C[n] = Q[D[n]]
Predictor X'[n] + Dequantizer

Xp[n] +
Xp[n] = F[X'[n-i]] D'[n] D'[n] = Q-1[C[n]]
+
I as prediction gets smarter, becomes LPC

e.g. shorten - lossless LPC encoding
Larger compression gains needs irrelevance
I hide quantization noise with psychoacoustic masking
Noise shaping
Plain Q-noise sounds like added white noise
I actually, not all that disturbing
I .. but worst-case for exploiting masking
1
mu-law
quantization
Have Q-noise scale with 0.8
signal level
0.6
mu(x)
I i.e. quantizer step gets
larger with amplitude 0.4
I minimum distortion for 0.2

x - mu(x)
some center-heavy pdf
0
0 0.2 0.4 0.6 0.8 1
Or: put Q-noise around peaks in spectrum
I key to getting benefit of perceptual masking
20
level / dB
0
Signal
Linear Q-noise
-20
-40
-60
0 Transform Q-noise 1500 2000 2500 3000 3500 freq / Hz
Subband coding
Idea: Quantize separately in separate bands
Encoder Decoder
Bandpass Downsample Quantize
Q[•] Q-1[•] M f
f M
Input Analysis Reconstruction Output
(M channels) +
filters filters
M Q[•] Q-1[•] M
f
I Q-noise stays within band, gets masked

‘Critical sampling’ → 1/M of spectrum per band
0
gain / dB
-10
alias energy
-20
-30
-40
-50
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 normalized freq 1
I some aliasing inevitable

Trick is to cancel with alias of adjacent band
→ ‘quadrature-mirror’ filters
MPEG-Audio (layer I, II)
Basic idea: Subband coding
plus psychoacoustic masking model
to choose dynamic Q-levels in subbands
Scale indices
Scale Quantize
normalize
Input 32 band Format Bitstream
polyphase & pack
filterbank bitstream
Data frame
Scale Control & scales
normalize Quantize
32 chans
x 36 samples
Psychoacoustic
masking Per-band
model masking margins
24 ms
I 22 kHz ÷ 32 equal bands = 690 Hz bandwidth

I 8 / 24 ms frames = 12 / 36 subband samples
I fixed bitrates 32 - 256 Kbps/chan (1-6 bits/samp)
I scale factors are like LPC envelope?
MPEG Psychoacoustic model
Based on simultaneous masking experiments
Difficulties:
I noise energy masks ∼10 dB better than tones
I masking level nonlinear in frequency & intensity
I complex, dynamic sounds not well understood
60 Tonal
50 peaks
Procedure Non-tonal
SPL / dB
40 energy
30
20
Signal
spectrum
I pick ‘tonal peaks’ in NB FFT 10
0
-10
spectrum 1 3 5 7 9 11 13 15 17 19 21 23 25
60
50
Masking
I remaining energy → ‘noisy’ spread
SPL / dB
40
30
peaks 20
10
0
-10
I apply nonlinear ‘spreading 1 3 5 7 9 11 13 15 17 19 21 23 25
function’ 60
50
Resultant
masking
SPL / dB
40
30
I sum all masking & threshold in 20
10
0
power domain -10
1 3 5 7 9 11 13 15 17 19 21 23 25
freq / Bark
MPEG Bit allocation
Result of psychoacoustic model is
maximum tolerable noise per subband
Masking tone
level
SNR ~ 6·B
Masked
threshold
Safe noise level
Quantization noise freq

Subband N
I safe noise level → required SNR → bits B

Bit allocation procedure (fixed bit rate):
I pick channel with worst noise-masker ratio
I improve its quantization by one step
I repeat while more bits available for this frame
Bands with no signal above masking curve can be skipped
MPEG Audio Layer III
‘Transform coder’ on top of ‘subband coder’
Digital Audio Coded
Signal (PCM) Subband Line
Distortion Audio Signal
(768 kbit/s) 31 575
Control Loop 192 kbit/s
Filterbank
Huffman
MDCT
Bitstream Formatting
32 Subbands Nonuniform Encoding 32 kbit/s
Quantization
CRC-Check
0 0
Rate
Control Loop
Window
Switching
Coding of
Side-
information
Psycho-
FFT acoustic
1024 Points
Model
External Control
Blocks of 36 subband time-domain samples become 18 pairs

of frequency-domain samples
I more redundancy in spectral domain
I finer control e.g. of aliasing, masking
I scale factors now in band-blocks
Fixed Huffman tables optimized for audio data
Power-law quantizer
Adaptive time window
Time window relies on temporal masking
I single quantization level over 8-24 ms window
‘Nightmare’ scenario:
Pre-echo distortion
I ‘backward masking’ saves in most cases

Adaptive switching of time window:
normal transition short
1
window level
0.8
0.6
0.4
0.2
0
0 20 40 60 80 100 time / ms
The effects of MP3
Before & after:
Josie - direct from CD
freq / kHz
20
15
10
0
After MP3 encode (160 kbps) and decode
freq / kHz
20
15
10
0
Residual (after aligning 1148 sample delay)
freq / kHz
20
15
10
0
0 2 4 6 8 10 time / sec
I chop off high frequency (above 16 kHz)

I occasional other time-frequency ‘holes’
I quantization noise under signal
MP3 & Beyond
MP3 is ‘transparent’ at ∼ 128 Kbps for stereo

(11x smaller than 1.4 Mbps CD rate)
I only decoder is standardized:
better psychological models → better encoders
MPEG2 AAC
I rebuild of MP3 without backwards compatibility
I 30% better (stereo at 96 Kbps?)
I multichannel etc.
MPEG4-Audio
I wide range of component encodings
I MPEG Audio, LSPs, ...
SAOL
I ‘synthetic’ component of MPEG-4 Audio
I complete DSP/computer music language!
I how to encode into it?
Summary
For coding, every bit counts

I take care over quantization domain & effects
I Shannon limits...
Speech coding
I LPC modeling is old but good
I CELP residual modeling can go beyond speech
Wide-band coding
I noise shaping ‘hides’ quantization noise
I detailed psychoacoustic models are key
References
Ben Gold and Nelson Morgan. Speech and Audio Signal Processing: Processing and
Perception of Speech and Music. Wiley, July 1999. ISBN 0471351547.
D. Pan. A tutorial on MPEG/audio compression. IEEE Multimedia, 2(2):60–74, 1995.
Karlheinz Brandenburg. MP3 and AAC explained. In Proc. AES 17th International
Conference on High Quality Audio Coding, pages 1–12, 1999.
T. Painter and A. Spanias. Perceptual coding of digital audio. Proc. IEEE, 80(4):
451–513, Apr. 2000.

Ellis, Mandel - 2009 - Lecture 7 Audio Compression and Coding

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ellis, Mandel - 2009 - Lecture 7 Audio Compression and Coding

Uploaded by

Copyright:

Available Formats

EE E6820: Speech & Audio Processing & Recognition

Dan Ellis <dpwe@ee.columbia.edu>

March 31, 2009

1 Information, Compression & Quantization

1 Information, Compression & Quantization

3 Wide-Bandwidth Audio Coding

x C samples/frame (e.g. 1 or 2 channels)

but sample at 80kHz

Irrelevant info is unique but unnecessary

For coding of audio signals,

Represent waveform with discrete levels

Equivalent to adding error e[n]:

.. or, in dB, 20 · log10 2 · B ≈ 6 · B dB

Redundancy removal is lossless

I but: ‘white noise’ sequence has no redundancy ...

I cumulative pdf p(x < x0 ) maps to uniform x 0

I Huffman coding: tree-structured decoder

May help even if values are largely independent

As always, success depends on representation

1 Information, Compression & Quantization

3 Wide-Bandwidth Audio Coding

I 10-20 bands, perceptually spaced

The classic source-filter model

Bit allocation (filter):

I z = e jω → z −1 = e −jω → |A(z)| = |A(z −1 )| on u.circ.

-100 LPC residual

I save several bits/sample, but still > 32 Kbps

Band-limit then re-extend (RELP)

Pitch-cycle LPC filter

‘Perceptual’ weighting discounts peaks:

I encode as N × (ti , mi ) pairs

004 005 006 007

012 013 014 015

I linear search for minimum weighted error?

CELP does not 1

break with multiple 4

I just does the 1

LPC filter models vocal tract;

1 Information, Compression & Quantization

3 Wide-Bandwidth Audio Coding

Predictor X'[n] + Dequantizer

I as prediction gets smarter, becomes LPC

I minimum distortion for 0.2

I Q-noise stays within band, gets masked

I some aliasing inevitable

I 22 kHz ÷ 32 equal bands = 690 Hz bandwidth

Safe noise level

Quantization noise freq

I safe noise level → required SNR → bits B

Blocks of 36 subband time-domain samples become 18 pairs

I ‘backward masking’ saves in most cases

I chop off high frequency (above 16 kHz)

MP3 is ‘transparent’ at ∼ 128 Kbps for stereo

For coding, every bit counts

You might also like