Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

EE E6820: Speech & Audio Processing & Recognition

Lecture 7:
Audio compression and coding

Dan Ellis <dpwe@ee.columbia.edu>


Michael Mandel <mim@ee.columbia.edu>
Columbia University Dept. of Electrical Engineering
http://www.ee.columbia.edu/∼dpwe/e6820

March 31, 2009

1 Information, Compression & Quantization


2 Speech coding
3 Wide-Bandwidth Audio Coding

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 1 / 37
Outline

1 Information, Compression & Quantization

2 Speech coding

3 Wide-Bandwidth Audio Coding

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 2 / 37
Compression & Quantization
How big is audio data? What is the bitrate?
Fs frames/second (e.g. 8000 or 44100)
I

x C samples/frame (e.g. 1 or 2 channels)


x B bits/sample (e.g. 8 or 16)
→ Fs · C · B bits/second (e.g. 64 Kbps or 1.4 Mbps)

bits / frame
CD Audio 1.4 Mbps
32

8
Mobile 8000 44100
!13 Kbps frames / sec
Telephony 64 Kbps

How to reduce?
I lower sampling rate → less bandwidth (muffled)
I lower channel count → no stereo image
I lower sample size → quantization noise
Or: use data compression
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 3 / 37
Data compression:
Redundancy vs. Irrelevance
Two main principles in compression:
I remove redundant information
I remove irrelevant information
Redundant information is implicit in remainder
e.g. signal bandlimited to 20kHz,
I

but sample at 80kHz


→ can recover every other sample by interpolation:
In a bandlimited signal, the red samples
can be exactly recovered by interpolating
the blue samples
sample

time

Irrelevant info is unique but unnecessary


I e.g. recording a microphone signal at 80 kHz sampling rate

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 4 / 37
Irrelevant data in audio coding

For coding of audio signals,


irrelevant means perceptually insignificant
I an empirical property
Compact Disc standard is adequate:
I 44 kHz sampling for 20 kHz bandwidth
I 16 bit linear samples for ∼ 96 dB peak SNR
Reflect limits of human sensitivity:
I 20 kHz bandwidth, 100 dB intensity
I sinusoid phase, detail of noise structure
I dynamic properties - hard to characterize
Problem: separating salient & irrelevant

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 5 / 37
Quantization

Represent waveform with discrete levels


5
4 6
3 Q[x[n]] x[n]
2 4
1
2
Q[x] 0
-1 0
-2
-3 -2
-4 error e[n] = x[n] - Q[x[n]]
-5 0 5 10 15 20 25 30 35 40
-5 -4 -3 -2 -1 0 1 2 3 4 5
x

Equivalent to adding error e[n]:


x[n] = Q [x[n]] + e[n]
e[n] ∼ uncorrelated, uniform white noise
p(e[n])
D2
→ variance σe2 = 12
-D/ +D/
2 2

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 6 / 37
Quantization noise (Q-noise)
Uncorrelated noise has flat spectrum
With a B bit word and a quantization step D
I max signal range (x) = −(2B−1 ) · D . . . (2B−1 − 1) · D
I quantization noise (e) = −D/2 . . . D/2
→ Best signal-to-noise ratio (power)

SNR = E [x 2 ]/E [e 2 ]
= (2B )2

.. or, in dB, 20 · log10 2 · B ≈ 6 · B dB


0
Quantized at 7 bits
level / dB

-20

-40

-60

-80
0 1000 2000 3000 4000 5000 6000 7000 freq / Hz

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 7 / 37
Redundant information

Redundancy removal is lossless


Signal correlation implies redundant information
I e.g. if x[n] = x[n − 1] + v [n]
x[n] has a greater amplitude range
→ uses more bits than v [n]
I sending v [n] = x[n] − x[n − 1] can reduce amplitude, hence
bitrate
x[n] - x[n-1]

I but: ‘white noise’ sequence has no redundancy ...


Problem: separating unique & redundant

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 8 / 37
Optimal coding

Shannon information:
An unlikely occurrence is more ‘informative’
p(A) = 0.5 p(B) = 0.5 p(A) = 0.9 p(B) = 0.1
ABBBBAAABBABBABBABB AAAAABBAAAAAABAAAAB

A, B equiprobable A is expected;
B is ‘big news’
Information in bits I = − log2 (probability )
I clearly works when all possibilities equiprobable
Optimal bitrate → av.token length = entropy H = E [I ]
I .. equal-length tokens are equally likely
How to achieve this?
I transform signal to have uniform pdf
I nonuniform quantization for equiprobable tokens
I variable-length tokens → Huffman coding

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 9 / 37
Quantization for optimum bitrate
Quantization should reflect pdf of signal:
p(x = x0) p(x < x0) x'
1.0

0.8

0.6

0.4

0.2

0
-0.02 -0.015 -0.01 -0.005 0 0.005 0.01 0.015 0.02 0.025 x

I cumulative pdf p(x < x0 ) maps to uniform x 0


Or, codeword length per Shannon log2 (p(x)):
p(x) Shannon info / bits Codewords
-0.02
111111111xx
-0.01 111101xx
111100xx
101xx
0 100xx
0xx
110xx
0.01 1110xx
111110xx
1111110xx
0.02 111111100xx
111111101xx
0.03
111111110xx
0 2 4 6 8

I Huffman coding: tree-structured decoder


E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 10 / 37
Vector Quantization
Quantize mutually dependent values in joint space:
3
x1
2

-1

-2

x2

-6 -4 -2 0 2 4

May help even if values are largely independent


I larger space x1,x2 is easier for Huffman

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 11 / 37
Compression & Representation

As always, success depends on representation


Appropriate domain may be ‘naturally’ bandlimited
I e.g. vocal-tract-shape coefficients
I can reduce sampling rate without data loss
In right domain, irrelevance is easier to ‘get at’
I e.g. STFT to separate magnitude and phase

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 12 / 37
Aside: Coding standards
Coding only useful if recipient knows the code!
Standardization efforts are important
Federal Standards: Low bit-rate secure voice:
I FS1015e: LPC-10 2.4 Kbps
I FS1016: 4.8 Kbps CELP
ITU G.x series (also H.x for video)
I G.726 ADPCM
I G.729 Low delay CELP
MPEG
I MPEG-Audio layers 1,2,3 (mp3)
I MPEG 2 Advanced Audio Codec (AAC)
I MPEG 4 Synthetic-Natural Hybrid Codec
More recent ‘standards’
I proprietary: WMA, Skype...
I Speex ...

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 13 / 37
Outline

1 Information, Compression & Quantization

2 Speech coding

3 Wide-Bandwidth Audio Coding

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 14 / 37
Speech coding
Standard voice channel:
I analog: 4 kHz slot (∼ 40 dB SNR)
I digital: 64 Kbps = 8 bit µ-law x 8 kHz
How to compress?
Redundant
I signal assumed to be a single voice,
not any possible waveform
Irrelevant
I need code only enough for intelligibility, speaker identification
(c/w analog channel)
Specifically, source-filter decomposition
I vocal tract & f0 change slowly
Applications:
I live communications
I offline storage

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 15 / 37
Channel Vocoder (1940s-1960s)
Basic source-filter decomposition
I filterbank breaks into spectral bands
I transmit slowly-changing energy in each band
Encoder Decoder
Bandpass Smoothed Downsample
filter 1 energy & encode E1
Bandpass
filter 1

Output
Input
Bandpass Smoothed Downsample EN
filter N energy & encode Bandpass
filter 1

Voicing V/UV
analysis Pulse Noise
Pitch generator source

I 10-20 bands, perceptually spaced


Downsampling?
Excitation?
I pitch / noise model
I or: baseband + ‘flattening’...
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 16 / 37
LPC encoding

The classic source-filter model


Encoder Filter coefficients {ai} Decoder
|1/A(ej!)|

Input Represent
f
s[n] & encode
LPC Output
analysis ^
e[n] ^
s[n]
Represent Excitation All-pole
Residual & encode generator filter
e[n]
H(z) = 1
t 1 - "aiz-i

Compression gains:
I filter parameters are ∼slowly changing
I excitation can be represented many ways
20 ms

Filter
parameters
Excitation/pitch
parameters

5 ms

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 17 / 37
Encoding LPC filter parameters
For ‘communications quality’:
I 8 kHz sampling (4 kHz bandwidth)
I ∼10th order LPC (up to 5 pole pairs)
I update every 20-30 ms → 300 - 500 param/s
Representation & quantization
ai
I {ai } - poor distribution,
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
can’t interpolate
ki
I reflection coefficients {ki }:
guaranteed stable -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

fLi
I LSPs - lovely!
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Bit allocation (filter):


I GSM (13 kbps): 8 LARs x 3-6 bits / 20 ms = 1.8 Kbps
I FS1016 (4.8 kbps): 10 LSPs x 3-4 bits / 30 ms = 1.1 Kbps

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 18 / 37
Line Spectral Pairs (LSPs)
LSPs encode LPC filter by a set of frequencies
Excellent for quantization & interpolation
Definition: zeros of
P(z) = A(z) + z −p−1 · A(z −1 )
Q(z) = A(z) − z −p−1 · A(z −1 )

I z = e jω → z −1 = e −jω → |A(z)| = |A(z −1 )| on u.circ.


I P(z), Q(z) have (interleaved) zeros when
∠{A(z)} = ±∠{z −p−1 A(zQ −1
)}
I reconstruct P(z), Q(z) = i (1 − ζi z −1 ) etc.
I A(z) = [P(z) + Q(z)]/2
A(z-1) = 0

Q(z) = 0
A(z) = 0
P(z) = 0

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 19 / 37
Encoding LPC excitation
Excitation already better than raw signal:
5000

Original signal
-5000
100

-100 LPC residual


1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 time / s

I save several bits/sample, but still > 32 Kbps


Crude model: U/V flag + pitch period
I ∼ 7 bits / 5 ms = 1.4 Kbps → LPC10 @ 2.4 Kbps
Pitch period values 16 ms frame boundaries
100
50
0
-50
1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 time / s

Band-limit then re-extend (RELP)

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 20 / 37
Encoding excitation
Something between full-quality residual (32 Kbps)
and pitch parameters (1.4 kbps)?
‘Analysis by synthesis’ loop:
Input LPC Filter coefficients
s[n] analysis A(z)
Excitation code

Pitch-cycle LPC filter


Synthetic predictor –
excitation + 1 +
c[n] b·z–NL
^
x[n] A(z)
control

‘Perceptual’
MSError weighting
minimization
W(z)= A(z) ^
x[n] *ha[n] - s[n]
A(z/!)
Excitation encoding

‘Perceptual’ weighting discounts peaks:


20
1/A(z) W(z)= A(z)
gain / dB

10 A(z/!)
0 1/A(z/!)
-10
-20
0 500 1000 1500 2000 2500 3000 3500 freq / Hz

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 21 / 37
Multi-Pulse Excitation (MPE-LPC)
Stylize excitation as N discrete pulses
15
10 original LPC
5 residual
0
-5
mi multi-pulse
5
excitation
0
-5
0 20 40 60 80 100 120 time / samps
ti

I encode as N × (ti , mi ) pairs


Greedy algorithm places one pulse at a time:
 
A(z) X (z)
Epcp = − S(z)
A(z/γ) A(z)
X (z) R(z)
= −
A(z/γ) A(z/γ)
I R(z) is residual of target waveform after inverse-filtering
I cross-correlate hγ and r ∗ hγ , iterate
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 22 / 37
CELP
Represent excitation with codebook
e.g. 512 sparse excitation vectors
Codebook
000 001 002 003

004 005 006 007

Search
index 008 009 010 011 excitation

012 013 014 015

0 60 time / samples

I linear search for minimum weighted error?


FS1016 4.8 Kbps CELP (30ms frame = 144 bits):
10 LSPs 4x4 + 6x3 bits = 34 bits
Pitch delay 4 x 7 bits = 28 bits
Pitch gain 4 x 5 bits = 20 bits I 138 bits
Codebk index 4 x 9 bits = 36 bits
Codebk gain 4 x 5 bits = 20 bits
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 23 / 37
Aside: CELP for nonspeech?
CELP is sometimes called a ‘hybrid’ coder:
I originally based on source-filter voice model
I CELP residual is waveform coding (no model)

Original (mrZebra-8k)
4

freq / kHz
3

CELP does not 1

break with multiple 4


5 kbps buzz-hiss

voices etc. 3

I just does the 1

best it can 4
4.8 kbps CELP

0
0 1 2 3 4 5 6 7 8 time / sec

LPC filter models vocal tract;


also matches auditory system?
I i.e. the ‘source-filter’ separation is good for
relevance as well as redundancy?
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 24 / 37
Outline

1 Information, Compression & Quantization

2 Speech coding

3 Wide-Bandwidth Audio Coding

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 25 / 37
Wide-Bandwidth Audio Coding
Goals:
I transparent coding i.e. no perceptible effect
I general purpose - handles any signal
Simple approaches (redundancy removal)
I Adaptive Differential PCM (ADPCM)
X[n] + D[n] Adaptive C[n]
+ quantizer
– C[n] = Q[D[n]]

Predictor X'[n] + Dequantizer


Xp[n] +
Xp[n] = F[X'[n-i]] D'[n] D'[n] = Q-1[C[n]]
+

I as prediction gets smarter, becomes LPC


e.g. shorten - lossless LPC encoding
Larger compression gains needs irrelevance
I hide quantization noise with psychoacoustic masking

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 26 / 37
Noise shaping
Plain Q-noise sounds like added white noise
I actually, not all that disturbing
I .. but worst-case for exploiting masking
1
mu-law
quantization
Have Q-noise scale with 0.8

signal level
0.6
mu(x)
I i.e. quantizer step gets
larger with amplitude 0.4

I minimum distortion for 0.2


x - mu(x)
some center-heavy pdf
0
0 0.2 0.4 0.6 0.8 1
Or: put Q-noise around peaks in spectrum
I key to getting benefit of perceptual masking
20
level / dB

0
Signal
Linear Q-noise
-20

-40

-60
0 Transform Q-noise 1500 2000 2500 3000 3500 freq / Hz

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 27 / 37
Subband coding
Idea: Quantize separately in separate bands

Encoder Decoder
Bandpass Downsample Quantize
Q[•] Q-1[•] M f
f M
Input Analysis Reconstruction Output
(M channels) +
filters filters

M Q[•] Q-1[•] M
f

I Q-noise stays within band, gets masked


‘Critical sampling’ → 1/M of spectrum per band
0
gain / dB

-10
alias energy
-20
-30
-40
-50
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 normalized freq 1

I some aliasing inevitable


Trick is to cancel with alias of adjacent band
→ ‘quadrature-mirror’ filters
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 28 / 37
MPEG-Audio (layer I, II)
Basic idea: Subband coding
plus psychoacoustic masking model
to choose dynamic Q-levels in subbands
Scale indices
Scale Quantize
normalize
Input 32 band Format Bitstream
polyphase & pack
filterbank bitstream
Data frame
Scale Control & scales
normalize Quantize

32 chans
x 36 samples
Psychoacoustic
masking Per-band
model masking margins
24 ms

I 22 kHz ÷ 32 equal bands = 690 Hz bandwidth


I 8 / 24 ms frames = 12 / 36 subband samples
I fixed bitrates 32 - 256 Kbps/chan (1-6 bits/samp)
I scale factors are like LPC envelope?

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 29 / 37
MPEG Psychoacoustic model
Based on simultaneous masking experiments
Difficulties:
I noise energy masks ∼10 dB better than tones
I masking level nonlinear in frequency & intensity
I complex, dynamic sounds not well understood
60 Tonal
50 peaks
Procedure Non-tonal

SPL / dB
40 energy
30
20
Signal
spectrum
I pick ‘tonal peaks’ in NB FFT 10
0
-10
spectrum 1 3 5 7 9 11 13 15 17 19 21 23 25

60
50
Masking
I remaining energy → ‘noisy’ spread

SPL / dB
40
30
peaks 20
10
0
-10
I apply nonlinear ‘spreading 1 3 5 7 9 11 13 15 17 19 21 23 25

function’ 60
50
Resultant
masking
SPL / dB

40
30
I sum all masking & threshold in 20
10
0
power domain -10
1 3 5 7 9 11 13 15 17 19 21 23 25
freq / Bark

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 30 / 37
MPEG Bit allocation
Result of psychoacoustic model is
maximum tolerable noise per subband
Masking tone

level

SNR ~ 6·B
Masked
threshold

Safe noise level

Quantization noise freq


Subband N

I safe noise level → required SNR → bits B


Bit allocation procedure (fixed bit rate):
I pick channel with worst noise-masker ratio
I improve its quantization by one step
I repeat while more bits available for this frame
Bands with no signal above masking curve can be skipped

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 31 / 37
MPEG Audio Layer III
‘Transform coder’ on top of ‘subband coder’
Digital Audio Coded
Signal (PCM) Subband Line
Distortion Audio Signal
(768 kbit/s) 31 575
Control Loop 192 kbit/s
Filterbank
Huffman
MDCT

Bitstream Formatting
32 Subbands Nonuniform Encoding 32 kbit/s
Quantization

CRC-Check
0 0
Rate
Control Loop
Window
Switching
Coding of
Side-
information
Psycho-
FFT acoustic
1024 Points
Model
External Control

Blocks of 36 subband time-domain samples become 18 pairs


of frequency-domain samples
I more redundancy in spectral domain
I finer control e.g. of aliasing, masking
I scale factors now in band-blocks
Fixed Huffman tables optimized for audio data
Power-law quantizer
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 32 / 37
Adaptive time window
Time window relies on temporal masking
I single quantization level over 8-24 ms window
‘Nightmare’ scenario:

Pre-echo distortion

I ‘backward masking’ saves in most cases


Adaptive switching of time window:
normal transition short
1
window level

0.8
0.6
0.4
0.2
0
0 20 40 60 80 100 time / ms

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 33 / 37
The effects of MP3
Before & after:
Josie - direct from CD
freq / kHz

20

15

10

0
After MP3 encode (160 kbps) and decode
freq / kHz

20

15

10

0
Residual (after aligning 1148 sample delay)
freq / kHz

20

15

10

0
0 2 4 6 8 10 time / sec

I chop off high frequency (above 16 kHz)


I occasional other time-frequency ‘holes’
I quantization noise under signal

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 34 / 37
MP3 & Beyond

MP3 is ‘transparent’ at ∼ 128 Kbps for stereo


(11x smaller than 1.4 Mbps CD rate)
I only decoder is standardized:
better psychological models → better encoders
MPEG2 AAC
I rebuild of MP3 without backwards compatibility
I 30% better (stereo at 96 Kbps?)
I multichannel etc.
MPEG4-Audio
I wide range of component encodings
I MPEG Audio, LSPs, ...
SAOL
I ‘synthetic’ component of MPEG-4 Audio
I complete DSP/computer music language!
I how to encode into it?

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 35 / 37
Summary

For coding, every bit counts


I take care over quantization domain & effects
I Shannon limits...
Speech coding
I LPC modeling is old but good
I CELP residual modeling can go beyond speech
Wide-band coding
I noise shaping ‘hides’ quantization noise
I detailed psychoacoustic models are key

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 36 / 37
References

Ben Gold and Nelson Morgan. Speech and Audio Signal Processing: Processing and
Perception of Speech and Music. Wiley, July 1999. ISBN 0471351547.
D. Pan. A tutorial on MPEG/audio compression. IEEE Multimedia, 2(2):60–74, 1995.
Karlheinz Brandenburg. MP3 and AAC explained. In Proc. AES 17th International
Conference on High Quality Audio Coding, pages 1–12, 1999.
T. Painter and A. Spanias. Perceptual coding of digital audio. Proc. IEEE, 80(4):
451–513, Apr. 2000.

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 37 / 37

You might also like