Professional Documents
Culture Documents
Unit9 - Huffman Code With Exercises Reported Speech
Unit9 - Huffman Code With Exercises Reported Speech
Unit9 - Huffman Code With Exercises Reported Speech
This will be a somewhat long unit for those who have not ever done Huffman
codes. While static Huffman and Hamming Codes will be taught, this unit will
be a good reference for those who wish to study more on their own.
The two most common codes for representation on computers are ASCII and
EBCDIC. ASCII is a 7-bit pattern used to represent the characters in a
computer. It is the most common code. EBCDIC is an 8-bit pattern that is
used primarily by IBM.
Research has shown that the majority of the data in a database and
messages are spaces. In addition, certain alpha’s, such as ‘a’ and ‘t’, occur
more often than others, such as ‘q’ and ‘x’. To store data with so many
repeated characters is extremely inefficient. Programs for compressing the
data before it is stored (called zipping) or transferred and decompressing it
(called unzip) were developed using various techniques. The most common
is called Huffman Code, developed by David Huffman in 1954 while he was a
graduate student at MIT.
Huffman Code
Huffman Code is a character frequency code. That is, it is based upon how
often a particular character occurs in a selected text. If there is a wide
variance in the frequency of characters, compression can reach up to 75%. If
not, there will be very little compression. If you have ever “zipped” a file,
especially a JPEG, and noticed that there is little difference between the pre-
and post-zip sizes of the file, that is because of the minor differences in the
patterns associated with the pixels in the JPEG.
72
Unit 9 – Huffman Code/Climbing the Tree
Fine, you say, but what is a “sub-tree”? Let’s look at the following Huffman
Tree that has no frequency values (leaves up, root down):
A B C D
The two branches containing ‘A’ and ‘B’ is “sub-tree”, as are the branches
containing ‘C’ and ‘D’. In addition, the branches containing ‘C’, ‘D’, and ‘E’ are
a “sub-tree”
Character Frequency
STX 1.1
ETX 1.2
C 7.4
I 8.3
S 10.6
E 10.7
T 12.9
A 22.5
Space 25.3
73
Unit 9 – Huffman Code/Climbing the Tree
Looking at my list, the two lowest values are STX and ETX. Following the
rules, you get the following:
ETX STX
0 1
New 2.3
Value to use
We take the two lowest values, which is the sub-tree with 2.3 and C with 7.4,
combine them in a sub-tree, high value left, low value right, high value 0, low
value 1, add the frequencies for a new value to use:
ETX STX
0 1
C
2.3
0 1
New 9.7
Value to use
74
Unit 9 – Huffman Code/Climbing the Tree
We take the two lowest values, the sub-tree with 9.7 and I with 8.3, join them
in a sub-tree, high value left, low value right, high value 0, low value 1, add
the frequencies for a new value to use:
ETX STX
0 1
C
2.3
0 1
9.7 I
0 1
75
Unit 9 – Huffman Code/Climbing the Tree
When we get ready to use our next two lowest values, we find that they are S
with 10.6 and T with 10.7. However, all we do is follow our rules. We join
them in a sub-tree, high value left, low value right, high value 0, low value 1,
add the frequencies for a new value to use:
ETX STX
0 1
C
2.3
0 1
9.7 I E S
0 1 0 1
18.0 21.3
Continue to build the tree and then look at the Huffman Solutions in this
Module.
A problem occurs when the frequencies that are the two lowest values are the
same value. To handle these problems, we use the following conventions:
• If the two lowest values are in the character set (they haven’t been
used yet), then we compare ASCII values, and join them in a sub-tree
as with the regular rules.
• If the two lowest values are a sub-tree and a character, the character is
always considered the HIGHEST of the two values, so when the sub-
tree is formed, the character branch is assigned the 0, the original sub-
tree branch a 1.
• If the two lowest values are sub-trees, you go up the zero branches
until you reach a character in each sub-tree. You compare ASCII
values, and join them in a sub-tree as with the regular rules.
Once we have the tree built, we are ready to assign Huffman patterns to each
of the characters. To assign patterns (as well as read existing patterns), you
76
Unit 9 – Huffman Code/Climbing the Tree
ALWAYS start at the ROOT and read the 0 or 1 off the branch until you reach
a leaf, which is a character.
STX 000011
ETX 000010
Make sure that you see how those codes are made. Once you have done so,
make the codes for the other characters.
Let’s take a sentence using our character set and translate it into our Huffman
Code:
If we were using plain ASCII to represent this message, it would have taken
56 bits (7 characters with 8-bits per character). It has taken us only 26...a
compression of 54%.
However, we are not quite finished. The problem with the pattern is that it is
very difficult to read binary, and even more so to find and correct a problem.
To make it easier for us, we convert that binary Huffman pattern into
hexadecimals something called Huffman Couplets or Huffman Doublets or
just Huffman Hex. You will need to know Binary to Hexadecimal
conversions, and vice versa.
First, we take the pattern and group the binary numbers by four bits (half a
byte, called a ‘nibble’):
77
Unit 9 – Huffman Code/Climbing the Tree
We notice that the last set has only two bits, so we PAD WITH ZEROES to
bring it up to 4 bits
0 E 4 1x 1 0x 8 0x
If you notice, some of the numbers are subscripted with an ‘X’, while 0E is
not. 41, 10, and 80 all look like decimal numbers therefore they must be
subscripted with an ‘X’ to indicate that they are hexadecimal numbers, not
decimal numbers. 0E does not need one since the ‘E’ shows that it is hex.
You can, however, subscript all of them if you prefer.
00001110010000010001000010000000
Starting at the ROOT, you go up the branch indicated by the 0 or 1 until you
reach a character. That rakes care of that section of the binary code.
Beginning with the NEXT number, you start again at the ROOT, and go up the
branch indicated by the 0 or 1 until you reach a character. As for the 0’s after
you have reached the ETX, the machine discards them:
Please note that in reality, the computer doesn’t have a Huffman Tree, it uses
a Huffman Sieve that works in the same way as the tree.
If you are not using a standard Huffman, but have developed your own, then
your Huffman can be used as a form of encryption.
78
Unit 9 – Huffman Code/Climbing the Tree
As you can see, all it takes is for one bit to be wrong and you have garbage.
To help counteract that problem, you need some way to detect that there is
something wrong. The most common is to use a parity bit. A parity bit is a
single bit that is added to the data stream to let you know if there has been an
odd bit error, that is, an error has occurred in an ODD number of bits. If the
error has occurred in an EVEN number of bits, the parity bit will not catch it.
Let’s look at an example:
Parity, in itself, is either ODD or EVEN. What the means is that if I’m using
odd parity, there are an ODD number of one’s in the entire data steam,
including the parity bit:
101111011011
Parity bit
If I’m using even parity, there are an EVEN number of one’s in the entire data
stream, including the parity bit.
001111011011
Parity bit
I’m going to take the ODD PARITY data stream and introduce an odd
number of errors (3):
Correct Stream:
101111011011
Parity bit
101110101011
Parity bit
If you notice, there are now an EVEN number of one’s. Since I’m using ODD
parity, I know there is an error in my data stream.
Let’s take the same data stream and introduce an even number of errors (2):
Correct Stream:
101111011011
Parity bit
79
Unit 9 – Huffman Code/Climbing the Tree
101111101011
Parity bit
As you can see, there are still an odd number of one’s in the data stream, so
the error is not picked up by the simple parity bit. What’s more, even if we
knew there was an error, there is not way to correct it and thus we would be
getting garbage.
In order to take care of that problem, there have been error detection and
correction techniques developed. One of the most common, and the one we
will use here, is Hamming Code, developed by Richard Hamming. While
used primarily in telecommunications, it also has use in those cases where
accuracy of the data is of utmost importance. We will cover Hamming Codes
in the next Unit.
80
Unit 9 – Huffman Code/Climbing the Tree
VOCABULARY
TECHNICAL NON-TECHNICAL
code/coding (n, v) – код, кодування assumption (n) – припущення /
/ код, кодирование допущение
compress (v) – change plaintext convert (v) – конвертувати /
using Huffman Code – стиснути - конвертировать
змінити відкритий текст за
допомогою Huffman Code / сжимать
– изменять открытый текст с
помощью кода сжатия данных
decompress (v) – change Huffman even (adj) – навіть; парне число /
Code to plaintext – розпаковувати - даже; чётное число
змінити код Huffman на відкритий
текст / разворачивать – изменить
код Хаффмана на открытый текст
encode/decode (v) – кодувати / inefficient (adj) – неефективний /
розшифровувати / / шифровать/ неэффективный;
расшифровывать малопроизводительный
encrypt/decrypt – шифрувати / introduce (v) – put in –
дешифрувати // шифровать / представлятися / представлять -
расшифровывать вводить в эксплуатацию
encryption (n) – шифрування / odd (adj) – незвичайний; непарний /
зашифровывание нечётное число
even parity (n) – even number of represent (v) – представляти /
ones in the binary pattern – изображать или представлять
парність - парне число в
двійковому шаблоні / чётное число;
контроль на чётность
key (n) – ключ / ключ subscript (v) – записаний під
рядком, нижній індекс /
подстрочный; с нижним индексом
odd parity (n) – odd number of technique (n) – спосіб виконання;
ones in the binary pattern - технічний прийом / техника
непарне співвідношення - непарне действий; способ выполнения
число в двійковому шаблоні /
нечётность; контроль по
нечётности
pad (v) – add zeroes or ones to transfer (n, v) - трансферт /
complete a byte – заповнювати; трансферт
додавати нулі або одиниці, щоб
завершити байт / заполнять;
прибавлять нули или единицы для
завершения байта
81
Unit 9 – Huffman Code/Climbing the Tree
ACTIVITIES:
1. What do ASCII and EBCDIC stand for? What can the eighth bit in
ASCII used for?
2. What would it mean if only Check Bit 2’s pattern was odd parity?
What would it mean if Check Bit 2’s, 4’s, and 8’s patterns were odd
parity?
82
Unit 9 – Huffman Code/Climbing the Tree
Vocabulary exercises
83
Unit 9 – Huffman Code/Climbing the Tree
Exercise 2. Fill in the blanks with the words/phrases from the Unit.
1) ASCII
2) EBCDIC
3) IBM
4) MIT
5) JPEG
6) STX
7) ETX
84
Unit 9 – Huffman Code/Climbing the Tree
Grammar
Your
Direct Speech Reported Speech own
“I study a lot,” she said. She said (that) she studied a lot.
“I am studying a lot,” she said. She said (that) she was studying a
lot.
“I have studied a lot,” she said. She said (that) she had studied a lot.
“I studied a lot,” she said. She said (that) she had studied a lot.
“I will study a lot,” she said. She said (that) she would study a lot.
“I have been studying a lot,” she She said (that) she had been
said. studying a lot.
“I am going to study more,” she She said (that) she was gong to
said. study more.
“I can study more,” she said. She said (that) she could study more.
“I may study more,” she said. She said (that) she might study more.
“I must study more,” she said. She said (that) she had to / must
study more.
“I should study more,” she said. She said (that) she should study
more.
“I ought to study a lot,” she said. She said (that) she ought study more.
“Do you study a lot?” he said to He asked her if she studied a lot.
her.
“Study more,” he said to her. He told her to study more.
85