Professional Documents
Culture Documents
Ic23 Unit06 Script
Ic23 Unit06 Script
MI
A
1 2
Recently on Image Compression ...
3 4
5 6
What did we learn in the last learning unit? 7 8
Dictionary coding: Assign codes to large words. 9 10
15 16
Open Questions: 17 18
Can we combine the benefits of adaptive and higher-order coding? 19 20
23 24
25 26
27 28
29 30
31 32
33 34
33 34
Outline MI
A
1 2
Learning Unit 06:
3 4
PPM, and PAQ 5 6
7 8
9 10
11 12
Contents 13 14
1. Prediction by Partial Matching 15 16
2. PAQ 17 18
19 20
3. Comparison of Compression Algorithms
21 22
23 24
25 26
27 28
29 30
31 32
c 2023 Christian Schmaltz, Pascal Peter
33 34
Goal: predicting English sentences with human predictors (his wife Mary) 11 12
15 16
• emitter and receiver both have someone who can do the same guesses
17 18
• experimental limits for coding efficiency based on predictions
19 20
We will repeat some of their experiments in the live meetings.
21 22
Result: Humans are suprisingly good at predictions!
23 24
• Coding costs between 0.6 and 1.3 bits per letter on average. 25 26
• The can use grammar, semantic context, idioms, ... 27 28
• Not easily reproducable with a computer. 29 30
31 32
33 34
Shannon’s Prediction Experiments MI
A
Towards a more General Prediction 1 2
3 4
Idea: We have already defined higher order coding!
5 6
• lower entropy due to conditional probabilities for already encoded symbols 7 8
Problems: |S|k contexts of order k, (≈ 1.1 · 1012), huge overhead. 9 10
19 20
Idea: If next symbol was not seen before in current context, reduce order.
21 22
23 24
25 26
27 28
29 30
31 32
33 34
5 6
current context before.
7 8
Idea: Replace NYT-symbol by an escape symbol indicating that we should
9 10
consider a smaller context.
11 12
For now, we fix the counter of the escape symbol to 1. (This will change later!)
13 14
Problem: A symbol that was never seen cannot be encoded at all.
15 16
Idea: Add a −1-th order context in which each symbol occurs with equal
17 18
probability.
19 20
→ Combining all these ideas results in the 21 22
prediction by partial matching (PPM) coding scheme.
23 24
25 26
27 28
29 30
31 32
33 34
Prediction by Partial Matching MI
A
Algorithm: k-th order PPM 1 2
3 4
9 10
SC contains all symbols that have occured in context C.
11 12
S−1 contains symbols occurring in −1-th order context.
13 14
For the next symbol X of the source word 15 16
17 18
1. Set ` ← k.
19 20
2. If ` = −1 set P (X) = 1/|S−1|, goto 4.
21 22
3. Check if `-th order context C of X has been seen before:
23 24
P
• YES: Set P (X) = cX,C / Y ∈SC cY,C . 25 26
c
• NO: Encode ESC with P (ESC) = P ESC,C , set ` ← ` − 1, goto 2. 27 28
Y ∈S cY,C
C
29 30
4. Encode X with P (X), update counters for all relevant contexts.
31 32
5. Encode next symbol (or terminate if none are left).
33 34
5 6
The order 5 context BANAN has not been seen previously. Nothing must be 7 8
stored.
9 10
The order 4 context ANAN has not been seen previously.
11 12
The order 3 context NAN has not been seen previously. 13 14
The order 2 context AN was seen once, followed by A. Thus, the counter for the 15 16
symbol A (and the escape symbol) is 1, all other counters are zero. Thus, the 17 18
symbol “A” can be encoded with P (A) = 21 (using arithmetic coding).
19 20
21 22
23 24
25 26
27 28
29 30
31 32
33 34
Prediction by Partial Matching MI
A
Example – 5th order PPM 1 2
3 4
Input word: BANANABOAT. Already encoded: BANANA. Next symbol: B
5 6
The order 5 context ANANA has not been seen previously. 7 8
The order 3 context ANA has been seen once before, followed by N. The symbol 11 12
17 18
encode is different so we encode an ”escape” symbol with P (ESC) = 12 .
19 20
The order 1 context is A. This was seen twice, in both cases followed by N. The
21 22
symbol we want to encode is different so we encode an ”escape” symbol with
P (ESC) = 13 . 23 24
The order 0 context was seen 6 times with values B (once), A (3 times) and N 25 26
5 6
The order 5-2 contexts have not been seen previously. 7 8
The order 1 context is B. This was seen once, followed by an “A”. The symbol 9 10
15 16
(twice). The symbol we want to encode is different, so we encode an ”escape”
symbol with P (ESC) = 18 . 17 18
In the order −1 context, as all symbols are equally likely, there is no escape 19 20
5 6
• Input word: BANANABOAT. Already encoded: BANANA. Next symbol: B 7 8
• The order 3 context ANA has been seen once before, followed by N. The 9 10
symbol we want to encode is different so we encode an ”escape” symbol 11 12
with P (ESC) = 12 .
13 14
• The order 2 context NA was seen once, followed by N. The symbol we want
15 16
to encode is different so we encode an ”escape” symbol with P (ESC) = 12 .
17 18
By sending the first escape symbol, the decoder already knows that the next
23 24
This exclusion principle is used by PPM to improve the compression ratios.
25 26
27 28
29 30
31 32
33 34
5 6
The order 5-4 contexts have not been seen previously. 7 8
The order 3 context ANA has been seen once before, followed by N. The symbol 9 10
15 16
considered, we exclude it. Since there are no other predictions, there is nothing
to encode. 17 18
The order 1 context is A. This was seen twice, in both cases followed by N which 19 20
5 6
The order 5-2 contexts have not been seen previously. 7 8
The order 1 context is B. This was seen once, followed by an A. The symbol we 9 10
15 16
(twice). The symbol we want to encode is different, so we encode an ”escape”
symbol with P (ESC) = 15 . 17 18
In the order −1 context, as symbols are equally likely, there is no escape symbol. 19 20
Since A, B, and N are excluded, we can encode the symbol O with P (O) = 12 , 21 22
assuming S = {A, B, N, O, T } is known. 23 24
25 26
27 28
29 30
31 32
33 34
5 6
1 1 1
• with exclusion principle: 2 · 5 = 10 7 8
• without exclusion principle: 21 · 12 · 13 · 17 = 84
1
9 10
11 12
⇒ log2 84
10 ≈ 3.07 bits saved through exclusion principle.
13 14
Total probability of symbol O after encoding BANANAB:
15 16
1
• with exclusion principle: · 15 · 12 = 1
2 20 17 18
1 1 1 1
• without exclusion principle: · · = 2 8 5 80 19 20
⇒ log2 80 21 22
20 = 2 bits saved through exclusion principle.
23 24
25 26
27 28
29 30
31 32
33 34
Prediction by Partial Matching MI
A
Estimating the Probability of the Escape Symbol 1 2
3 4
So far, we used the PPMa algorithm which fixes ESC counter to 1.
5 6
However, many different approaches have been proposed: 7 8
25 26
Remark: In the literature, the variants might be named differently.
27 28
29 30
31 32
33 34
5 6
However, this only holds if the context was seen sufficiently often. 7 8
Thus, choosing the context too large will result in many escape symbols. 9 10
13 14
15 16
17 18
19 20
21 22
23 24
25 26
27 28
29 30
Bits required per symbol depending on the maximal context length. Source: J. Cleary 31 32
and W. Teahan: Unbounded Length Contexts for PPM. 33 34
Prediction by Partial Matching MI
A
PPM∗ 1 2
3 4
A context is called deterministic if it was always followed by the same symbol
5 6
and seen at least once, i.e. if there is exactly one prediction.
7 8
PPM* algorithm:
9 10
• Look for shortest matching deterministic context and try to encode the next 11 12
symbol using this context.
13 14
• If that did not work, continue with a “normal” PPM algorithm, i.e. with a
15 16
“standard” maximal context length.
17 18
Remarks 19 20
21 22
PPM algorithms are quite good, but also slow and memory consuming.
23 24
Often, PPM algorithms are used to estimate the next bit (instead of the
complete next symbol). Then, the context are the last n bits. 25 26
27 28
PPM algorithms combining SEE and deterministic contexts (such as PPMz)
Outline MI
A
1 2
Learning Unit 06:
3 4
PPM, and PAQ 5 6
7 8
9 10
11 12
Contents 13 14
1. Prediction by Partial Matching 15 16
2. PAQ 17 18
19 20
3. Comparison of Compression Algorithms
21 22
23 24
25 26
27 28
29 30
31 32
c 2023 Christian Schmaltz, Pascal Peter
33 34
PAQ MI
A
1 2
PAQ
3 4
5 6
7 8
9 10
Matt Mahoney, an IBM researcher, started examining the relations
between machine learning and compression in 2002. 11 12
13 14
15 16
The result was the open source project PAQ.
17 18
PAQ is an highly evolved version of PPM with two core differences: 19 20
To this day, PAQ is one of the top performes in the Hutter challenge. 25 26
27 28
29 30
31 32
33 34
PAQ MI
A
The Hutter Challenge 1 2
3 4
11 12
Z(L − S) 13 14
L
15 16
(S new record, L old record, Z price fund) 17 18
Test set: 100 MB from Wikipedia 19 20
Rules 21 22
23 24
• memory limit: 1GB RAM, 10GB HDD during compression.
25 26
• time limit: ≈ 8 hours on test machine, no GPU, single core
27 28
• standalone program with no additional input
29 30
You can still take part, but competition is fierce! 31 32
33 34
Context Mixing MI
A
Context Mixing 1 2
3 4
Most important feature of PAQ is context mixing.
5 6
Motivation: 7 8
9 10
For the PPM algorithms, we always use consecutive previous symbols as contexts.
11 12
In some situations, other contexts are better:
13 14
• When storing an image, a good context are the neighbouring pixels that are
15 16
already stored. These are not always the last encoded symbols.
17 18
• When storing the channels of an RGB image as R1G1B1 R2G2B2 . . ., we
might want to use only values from the same colour channel as context. 19 20
21 22
• For text files, one might consider case-insensitive words as contexts.
23 24
25 26
27 28
29 30
31 32
33 34
Context Mixing MI
A
Context Mixing 1 2
3 4
Idea: Allow non-consecutive contexts as arbitrary functions of the known data.
5 6
Problems: 7 8
Get a prediction from each context, and combine them to a single prediction. 19 20
→ The resulting encoding schemes are called context mixing algorithms.
21 22
23 24
25 26
27 28
29 30
31 32
33 34
Context Mixing MI
A
How to Combine Predictions? 1 2
3 4
Assume P (y = 0|A) and P (y = 0|B) are known. What is P (y = 0|A ∩ B)?
5 6
Problem: Math does not answer this question. 7 8
11 12
P (y = 0|B), which should be taken into account.
13 14
Idea: Make an educated guess.
15 16
17 18
19 20
21 22
23 24
25 26
27 28
29 30
31 32
33 34
Context Mixing MI
A
Context Mixing by Weighted Averaging 1 2
3 4
Use variables n0i and n1i to count how often a 0 and a 1 have been encountered
5 6
in the i-th context.
7 8
Furthermore, since some contexts are more useful than others, each context is
9 10
assigned a weight wi.
11 12
Then, the estimated probability po that the next symbol is a zero is given by:
13 14
S0 X 15 16
p0 = , Sj = ε + winj,i, j ∈ {0, 1}
S0 + S1 i 17 18
19 20
where the sum Sj computes the weighted average over all currently occurring
contexts, and ε is a small number to prevent divisions by zero. 21 22
23 24
• S0 is the weighted evidence for 0.
25 26
• S1 is the weighted evidence for 1.
27 28
• S0 + S1 is the total evidence.
29 30
31 32
33 34
Context Mixing MI
A
Example 1 2
3 4
We consider two contexts: Context A was seen 9 times (3 times followed by a 0,
5 6
6 times by a 1), while context B was seen 3 times (always followed by a 0).
Furthermore, we assume that w1 = 31 , w2 = 32 . 7 8
9 10
Then, we estimate the probability that a 0 follows as:
11 12
1 2
ε+ ·3+ ·3
3 3 13 14
p0 =
ε + 13 · 3 + 32 · 3 + ε + 13 · 6 + 32 · 0 15 16
ε + 13 · 3 + 23 · 3 17 18
=
2ε + 13 · 9 + 23 · 3 19 20
ε+3 3
= ≈ = 60% 21 22
2ε + 5 5
23 24
25 26
27 28
29 30
31 32
33 34
PAQ MI
A
Context Mixing in PAQ 1 2
3 4
Up to version 3, the PAQ compression algorithm uses a context mixing algorithm
5 6
as described before to estimate the next bit.
7 8
Problem: The importance of different contexts might vary in different parts of a
9 10
file, but the weights wi are static.
11 12
Solution: In PAQ4 to 6, the weights are adapted dynamically during the coding
• The weights are updated by moving along this cost gradient in weight space: 19 20
21 22
(S0 + S1)ni,1 − S1ni,0 23 24
wi ← max 0, wi + (x − p1)
S0S1
25 26
29 30
31 32
33 34
PAQ MI
A
PAQ 4-6 1 2
3 4
Additionally, whenever a bit is observed and the count for the opposite bit is
5 6
more than 2, the excess is halved (to adapt to changing probabilities).
7 8
Example 9 10
11 12
If n0 = 0, n1 = 10 and several zeros are observed next, the following counters
are used: 13 14
15 16
• After 1 zero : n0 = 1, n1 = 10 − 10−22 =6
6−2 17 18
• After 2 zeros: n0 = 2, n1 = 6 − 2 = 4
19 20
• After 3 zeros: n0 = 3, n1 = 4 − 4−2 =3
2 21 22
• After 4 zeros: n0 = 4, n1 = 3 − 3−22 =2 23 24
• After 5 zeros: n0 = 5, n1 = 2 25 26
• After 6 zeros: n0 = 6, n1 = 2 27 28
29 30
31 32
33 34
PAQ MI
A
PAQ - Version 7 1 2
3 4
Starting from version 7, neural networks are responsible for estimating weights.
5 6
Basic idea of simple neural networks with backpropagation: 7 8
19 20
The output is the prediction, which we can check against the actual next symbol.
21 22
As in previous verisons of PAQ, we want to minimise the coding cost, which is 23 24
5 6
• We define the operations stretch and squash as: 7 8
9 10
p −1 1
stretch(p) = ln squash(x) = stretch (x) =
1−p 1 + e−x 11 12
13 14
• The network inputs are the stretched probabilities
15 16
ti := stretch(p1,i). 17 18
19 20
• The output probability is computed according to 21 22
! 23 24
X
p1 = squash witi 25 26
i
27 28
The gradient descent step with learning rate η and next bit x becomes
29 30
wi ← wi + ηti(x − p1) 31 32
33 34
PAQ MI
A
Types of Contexts in PAQ 1 2
3 4
k-th order contexts as in PPM: Consider the last k bytes.
5 6
sparse contexts: Do not consider continuous sequences of previous bytes, but 7 8
11 12
convert to lowercase.
13 14
2D-contexts for images and tables: Search for repeating byte patterns to
15 16
convert 1-D sequence into multiple rows. Also consider colour channels if
available. 17 18
21 22
• x86 executables
23 24
• BMP, TIFF, or JPEG images
25 26
• WAV audio files
27 28
• ... 29 30
many more 31 32
33 34
PAQ MI
A
Dirty Tricks in PAQ 1 2
3 4
It does not just use one neural network. Lower order contexts determine which
5 6
neural network out of hundreds should be used.
7 8
For the Hutter challenge (text-based):
9 10
• Dictionary-based preprocessing 11 12
• Replace common words with codes. 13 14
JPEG: 15 16
23 24
• convert relative to absolute memory addresses
25 26
27 28
29 30
31 32
33 34
PAQ MI
A
Advantages of PAQ 1 2
3 4
(Currently) best compression ratios
5 6
Believed to be patent free 7 8
11 12
Disadvantages of PAQ
13 14
Very slow 15 16
21 22
23 24
25 26
27 28
29 30
31 32
33 34
Outline MI
A
1 2
Learning Unit 06:
3 4
PPM, and PAQ 5 6
7 8
9 10
11 12
Contents 13 14
1. Prediction by Partial Matching 15 16
2. PAQ 17 18
19 20
3. Comparison of Compression Algorithms
21 22
23 24
25 26
27 28
29 30
31 32
c 2023 Christian Schmaltz, Pascal Peter
33 34
images, log files, HTML files, MS Word files, source code, databases, windows 15 16
help files, precompressed chess-databases, . . . (46 in total)
17 18
Allowed memory consumption: 800 MB memory 19 20
23 24
25 26
27 28
29 30
31 32
33 34
Comparison - Lossless Entropy Coding MI
A
The Competition 1 2
3 4
WinRK: commercial, discontinued, PPMd + PPMz + context modeling
5 6
NanoZip: free, discontinued, LZ + BWT + context modeling 7 8
15 16
GZIP: free, Deflate (LZ77 + Huffman)
17 18
ARJ32: commercial, now free under GPL, modified LZ77
19 20
21 22
23 24
25 26
27 28
29 30
31 32
33 34
5 6
Test Original PAQ8PX WinRK 3.1.2 7 8
Size (bytes) Size (bytes) Size (bytes) 9 10
logfile 20.617.071 257.193 271.628
English text 2.988.578 352.722 330.571 11 12
sorted wordlist 4.067.439 386.032 393.704 13 14
OCX help file 4.121.418 400.112 415.522
15 16
MS-word doc file 4.168.192 482.864 688.237
bitmap 4.149.414 539.003 569.053 17 18
jpg/jpeg 842.468 637.124 812.700 19 20
executable 3.870.784 909.161 896.365
21 22
dll (executable) 3.782.416 1.292.869 1.236.643
pdf 4.526.946 3.556.044 3.549.197 23 24
25 26
Comparison of the two best compression programs on different file types. Source:
http://www.maximumcompression.com/data/summary sf.php (from 11.12.2012) 27 28
29 30
31 32
33 34
Summary MI
A
1 2
Summary
3 4
PPM combines the best of higher order and adaptive coding: 5 6
13 14
• more complex contexts than PPM
15 16
• combines many contexts with context mixing
17 18
• drawback: very slow and complex 19 20
Outlook 21 22
23 24
So far, we have only looked at generic input data.
25 26
How to address the specific task of image compression? 27 28
29 30
31 32
33 34
References MI
A
References 1 2
3 4
M. Mahoney, Data Compression Explained, 2012. Available at
5 6
http://mattmahoney.net/dc/dce.html
(Explains PPM, context mixing, and PAQ) 7 8
9 10
K. Sayood. Introduction to Data Compression. Morgan Kaufmann, 2006.
(Explains PPM) 11 12
J. Cleary and I. Witten, Data Compression using adaptive coding and partial 13 14
Journal, 1997
(Introduced PPM*) 21 22
23 24
25 26
27 28
29 30
31 32
33 34