Professional Documents
Culture Documents
Ellis, Mandel - 2009 - Lecture 7 Audio Compression and Coding
Ellis, Mandel - 2009 - Lecture 7 Audio Compression and Coding
Lecture 7:
Audio compression and coding
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 1 / 37
Outline
2 Speech coding
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 2 / 37
Compression & Quantization
How big is audio data? What is the bitrate?
Fs frames/second (e.g. 8000 or 44100)
I
bits / frame
CD Audio 1.4 Mbps
32
8
Mobile 8000 44100
!13 Kbps frames / sec
Telephony 64 Kbps
How to reduce?
I lower sampling rate → less bandwidth (muffled)
I lower channel count → no stereo image
I lower sample size → quantization noise
Or: use data compression
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 3 / 37
Data compression:
Redundancy vs. Irrelevance
Two main principles in compression:
I remove redundant information
I remove irrelevant information
Redundant information is implicit in remainder
e.g. signal bandlimited to 20kHz,
I
time
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 4 / 37
Irrelevant data in audio coding
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 5 / 37
Quantization
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 6 / 37
Quantization noise (Q-noise)
Uncorrelated noise has flat spectrum
With a B bit word and a quantization step D
I max signal range (x) = −(2B−1 ) · D . . . (2B−1 − 1) · D
I quantization noise (e) = −D/2 . . . D/2
→ Best signal-to-noise ratio (power)
SNR = E [x 2 ]/E [e 2 ]
= (2B )2
-20
-40
-60
-80
0 1000 2000 3000 4000 5000 6000 7000 freq / Hz
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 7 / 37
Redundant information
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 8 / 37
Optimal coding
Shannon information:
An unlikely occurrence is more ‘informative’
p(A) = 0.5 p(B) = 0.5 p(A) = 0.9 p(B) = 0.1
ABBBBAAABBABBABBABB AAAAABBAAAAAABAAAAB
A, B equiprobable A is expected;
B is ‘big news’
Information in bits I = − log2 (probability )
I clearly works when all possibilities equiprobable
Optimal bitrate → av.token length = entropy H = E [I ]
I .. equal-length tokens are equally likely
How to achieve this?
I transform signal to have uniform pdf
I nonuniform quantization for equiprobable tokens
I variable-length tokens → Huffman coding
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 9 / 37
Quantization for optimum bitrate
Quantization should reflect pdf of signal:
p(x = x0) p(x < x0) x'
1.0
0.8
0.6
0.4
0.2
0
-0.02 -0.015 -0.01 -0.005 0 0.005 0.01 0.015 0.02 0.025 x
-1
-2
x2
-6 -4 -2 0 2 4
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 11 / 37
Compression & Representation
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 12 / 37
Aside: Coding standards
Coding only useful if recipient knows the code!
Standardization efforts are important
Federal Standards: Low bit-rate secure voice:
I FS1015e: LPC-10 2.4 Kbps
I FS1016: 4.8 Kbps CELP
ITU G.x series (also H.x for video)
I G.726 ADPCM
I G.729 Low delay CELP
MPEG
I MPEG-Audio layers 1,2,3 (mp3)
I MPEG 2 Advanced Audio Codec (AAC)
I MPEG 4 Synthetic-Natural Hybrid Codec
More recent ‘standards’
I proprietary: WMA, Skype...
I Speex ...
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 13 / 37
Outline
2 Speech coding
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 14 / 37
Speech coding
Standard voice channel:
I analog: 4 kHz slot (∼ 40 dB SNR)
I digital: 64 Kbps = 8 bit µ-law x 8 kHz
How to compress?
Redundant
I signal assumed to be a single voice,
not any possible waveform
Irrelevant
I need code only enough for intelligibility, speaker identification
(c/w analog channel)
Specifically, source-filter decomposition
I vocal tract & f0 change slowly
Applications:
I live communications
I offline storage
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 15 / 37
Channel Vocoder (1940s-1960s)
Basic source-filter decomposition
I filterbank breaks into spectral bands
I transmit slowly-changing energy in each band
Encoder Decoder
Bandpass Smoothed Downsample
filter 1 energy & encode E1
Bandpass
filter 1
Output
Input
Bandpass Smoothed Downsample EN
filter N energy & encode Bandpass
filter 1
Voicing V/UV
analysis Pulse Noise
Pitch generator source
Input Represent
f
s[n] & encode
LPC Output
analysis ^
e[n] ^
s[n]
Represent Excitation All-pole
Residual & encode generator filter
e[n]
H(z) = 1
t 1 - "aiz-i
Compression gains:
I filter parameters are ∼slowly changing
I excitation can be represented many ways
20 ms
Filter
parameters
Excitation/pitch
parameters
5 ms
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 17 / 37
Encoding LPC filter parameters
For ‘communications quality’:
I 8 kHz sampling (4 kHz bandwidth)
I ∼10th order LPC (up to 5 pole pairs)
I update every 20-30 ms → 300 - 500 param/s
Representation & quantization
ai
I {ai } - poor distribution,
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
can’t interpolate
ki
I reflection coefficients {ki }:
guaranteed stable -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
fLi
I LSPs - lovely!
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 18 / 37
Line Spectral Pairs (LSPs)
LSPs encode LPC filter by a set of frequencies
Excellent for quantization & interpolation
Definition: zeros of
P(z) = A(z) + z −p−1 · A(z −1 )
Q(z) = A(z) − z −p−1 · A(z −1 )
Q(z) = 0
A(z) = 0
P(z) = 0
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 19 / 37
Encoding LPC excitation
Excitation already better than raw signal:
5000
Original signal
-5000
100
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 20 / 37
Encoding excitation
Something between full-quality residual (32 Kbps)
and pitch parameters (1.4 kbps)?
‘Analysis by synthesis’ loop:
Input LPC Filter coefficients
s[n] analysis A(z)
Excitation code
‘Perceptual’
MSError weighting
minimization
W(z)= A(z) ^
x[n] *ha[n] - s[n]
A(z/!)
Excitation encoding
10 A(z/!)
0 1/A(z/!)
-10
-20
0 500 1000 1500 2000 2500 3000 3500 freq / Hz
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 21 / 37
Multi-Pulse Excitation (MPE-LPC)
Stylize excitation as N discrete pulses
15
10 original LPC
5 residual
0
-5
mi multi-pulse
5
excitation
0
-5
0 20 40 60 80 100 120 time / samps
ti
Search
index 008 009 010 011 excitation
0 60 time / samples
Original (mrZebra-8k)
4
freq / kHz
3
voices etc. 3
best it can 4
4.8 kbps CELP
0
0 1 2 3 4 5 6 7 8 time / sec
2 Speech coding
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 25 / 37
Wide-Bandwidth Audio Coding
Goals:
I transparent coding i.e. no perceptible effect
I general purpose - handles any signal
Simple approaches (redundancy removal)
I Adaptive Differential PCM (ADPCM)
X[n] + D[n] Adaptive C[n]
+ quantizer
– C[n] = Q[D[n]]
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 26 / 37
Noise shaping
Plain Q-noise sounds like added white noise
I actually, not all that disturbing
I .. but worst-case for exploiting masking
1
mu-law
quantization
Have Q-noise scale with 0.8
signal level
0.6
mu(x)
I i.e. quantizer step gets
larger with amplitude 0.4
0
Signal
Linear Q-noise
-20
-40
-60
0 Transform Q-noise 1500 2000 2500 3000 3500 freq / Hz
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 27 / 37
Subband coding
Idea: Quantize separately in separate bands
Encoder Decoder
Bandpass Downsample Quantize
Q[•] Q-1[•] M f
f M
Input Analysis Reconstruction Output
(M channels) +
filters filters
M Q[•] Q-1[•] M
f
-10
alias energy
-20
-30
-40
-50
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 normalized freq 1
32 chans
x 36 samples
Psychoacoustic
masking Per-band
model masking margins
24 ms
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 29 / 37
MPEG Psychoacoustic model
Based on simultaneous masking experiments
Difficulties:
I noise energy masks ∼10 dB better than tones
I masking level nonlinear in frequency & intensity
I complex, dynamic sounds not well understood
60 Tonal
50 peaks
Procedure Non-tonal
SPL / dB
40 energy
30
20
Signal
spectrum
I pick ‘tonal peaks’ in NB FFT 10
0
-10
spectrum 1 3 5 7 9 11 13 15 17 19 21 23 25
60
50
Masking
I remaining energy → ‘noisy’ spread
SPL / dB
40
30
peaks 20
10
0
-10
I apply nonlinear ‘spreading 1 3 5 7 9 11 13 15 17 19 21 23 25
function’ 60
50
Resultant
masking
SPL / dB
40
30
I sum all masking & threshold in 20
10
0
power domain -10
1 3 5 7 9 11 13 15 17 19 21 23 25
freq / Bark
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 30 / 37
MPEG Bit allocation
Result of psychoacoustic model is
maximum tolerable noise per subband
Masking tone
level
SNR ~ 6·B
Masked
threshold
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 31 / 37
MPEG Audio Layer III
‘Transform coder’ on top of ‘subband coder’
Digital Audio Coded
Signal (PCM) Subband Line
Distortion Audio Signal
(768 kbit/s) 31 575
Control Loop 192 kbit/s
Filterbank
Huffman
MDCT
Bitstream Formatting
32 Subbands Nonuniform Encoding 32 kbit/s
Quantization
CRC-Check
0 0
Rate
Control Loop
Window
Switching
Coding of
Side-
information
Psycho-
FFT acoustic
1024 Points
Model
External Control
Pre-echo distortion
0.8
0.6
0.4
0.2
0
0 20 40 60 80 100 time / ms
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 33 / 37
The effects of MP3
Before & after:
Josie - direct from CD
freq / kHz
20
15
10
0
After MP3 encode (160 kbps) and decode
freq / kHz
20
15
10
0
Residual (after aligning 1148 sample delay)
freq / kHz
20
15
10
0
0 2 4 6 8 10 time / sec
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 34 / 37
MP3 & Beyond
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 35 / 37
Summary
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 36 / 37
References
Ben Gold and Nelson Morgan. Speech and Audio Signal Processing: Processing and
Perception of Speech and Music. Wiley, July 1999. ISBN 0471351547.
D. Pan. A tutorial on MPEG/audio compression. IEEE Multimedia, 2(2):60–74, 1995.
Karlheinz Brandenburg. MP3 and AAC explained. In Proc. AES 17th International
Conference on High Quality Audio Coding, pages 1–12, 1999.
T. Painter and A. Spanias. Perceptual coding of digital audio. Proc. IEEE, 80(4):
451–513, Apr. 2000.
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 37 / 37