Unit9 - Huffman Code With Exercises Reported Speech

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Unit 9 – Huffman Code/Climbing the Tree

This will be a somewhat long unit for those who have not ever done Huffman
codes. While static Huffman and Hamming Codes will be taught, this unit will
be a good reference for those who wish to study more on their own.

When we speak of codes, we differentiate between Coding and Encryption.


When we encode and decode something, the assumption is that the code
that we use is common knowledge and is available to whoever wants it (e.g.
ASCII, EBCDIC, Morse Code, Huffman Code, etc). When we encrypt and
decrypt something, the assumption is that the encryption that we use is NOT
common knowledge and is available ONLY to certain people with a key.
Coding is used to facilitate communication while encryption is used to hide
communication.

The two most common codes for representation on computers are ASCII and
EBCDIC. ASCII is a 7-bit pattern used to represent the characters in a
computer. It is the most common code. EBCDIC is an 8-bit pattern that is
used primarily by IBM.

Research has shown that the majority of the data in a database and
messages are spaces. In addition, certain alpha’s, such as ‘a’ and ‘t’, occur
more often than others, such as ‘q’ and ‘x’. To store data with so many
repeated characters is extremely inefficient. Programs for compressing the
data before it is stored (called zipping) or transferred and decompressing it
(called unzip) were developed using various techniques. The most common
is called Huffman Code, developed by David Huffman in 1954 while he was a
graduate student at MIT.

Huffman Code

Huffman Code is a character frequency code. That is, it is based upon how
often a particular character occurs in a selected text. If there is a wide
variance in the frequency of characters, compression can reach up to 75%. If
not, there will be very little compression. If you have ever “zipped” a file,
especially a JPEG, and noticed that there is little difference between the pre-
and post-zip sizes of the file, that is because of the minor differences in the
patterns associated with the pixels in the JPEG.

There are standardized Character Frequency Codes already developed for


most languages. Those codes underlie most “zip” programs. There are both
static and dynamic Huffman Codes. For our purposes, we will use static.
To see how these codes work, we will use Huffman on a simple character set.
But first we must look at the rules that will govern the development of the
Huffman Tree from which we will build our Huffman codes.

72
Unit 9 – Huffman Code/Climbing the Tree

HUFFMAN TREE RULES:


1. The sum of the frequencies of ALL the characters in the character set
MUST add up to 100 (100% of the characters)
2. Always take the two LOWEST values (frequencies) and combine
those two characters, character and sub-tree, or two sub-trees into a
sub-tree.
3. For clarity, place the HIGH value on the LEFT, and the LOW value on
the RIGHT
4. Assign the HIGH value branch a ZERO (0) and the LOW value branch
a ONE (1). THIS IS WHAT IS IMPORTANT!

Fine, you say, but what is a “sub-tree”? Let’s look at the following Huffman
Tree that has no frequency values (leaves up, root down):

A B C D

The two branches containing ‘A’ and ‘B’ is “sub-tree”, as are the branches
containing ‘C’ and ‘D’. In addition, the branches containing ‘C’, ‘D’, and ‘E’ are
a “sub-tree”

Here is a small character set with frequencies (that we have assigned to


them). STX (Start of TeXt) and ETX (End of TeXt) are characters that are
placed in a complete text to indicate the beginning and end of the text.
Although they both occur ONCE in a text, we are giving them different
frequencies to make it easier to make the tree. The characters are listed in
ascending order of frequency to also help make the Huffman Tree. If your
character set is NOT thus organized, organize it thusly before starting to build
the tree.

Character Frequency
STX 1.1
ETX 1.2
C 7.4
I 8.3
S 10.6
E 10.7
T 12.9
A 22.5
Space 25.3

73
Unit 9 – Huffman Code/Climbing the Tree

To review, the rules are:


1. Take two lowest values and place them in a sub-tree, high value left,
low value right
2. Assign the high value branch a zero (0), the low value branch a one (1)
3. Add the two values together and assign it to the sub-tree. You now
have a new value to work with

Looking at my list, the two lowest values are STX and ETX. Following the
rules, you get the following:

ETX STX

0 1

New 2.3
Value to use

We take the two lowest values, which is the sub-tree with 2.3 and C with 7.4,
combine them in a sub-tree, high value left, low value right, high value 0, low
value 1, add the frequencies for a new value to use:

ETX STX

0 1

C
2.3

0 1

New 9.7
Value to use

74
Unit 9 – Huffman Code/Climbing the Tree

We take the two lowest values, the sub-tree with 9.7 and I with 8.3, join them
in a sub-tree, high value left, low value right, high value 0, low value 1, add
the frequencies for a new value to use:

ETX STX

0 1

C
2.3

0 1

9.7 I

0 1

New Value 18.0


to use

75
Unit 9 – Huffman Code/Climbing the Tree

When we get ready to use our next two lowest values, we find that they are S
with 10.6 and T with 10.7. However, all we do is follow our rules. We join
them in a sub-tree, high value left, low value right, high value 0, low value 1,
add the frequencies for a new value to use:

ETX STX

0 1

C
2.3

0 1

9.7 I E S

0 1 0 1

18.0 21.3

Continue to build the tree and then look at the Huffman Solutions in this
Module.

A problem occurs when the frequencies that are the two lowest values are the
same value. To handle these problems, we use the following conventions:
• If the two lowest values are in the character set (they haven’t been
used yet), then we compare ASCII values, and join them in a sub-tree
as with the regular rules.
• If the two lowest values are a sub-tree and a character, the character is
always considered the HIGHEST of the two values, so when the sub-
tree is formed, the character branch is assigned the 0, the original sub-
tree branch a 1.
• If the two lowest values are sub-trees, you go up the zero branches
until you reach a character in each sub-tree. You compare ASCII
values, and join them in a sub-tree as with the regular rules.

Once we have the tree built, we are ready to assign Huffman patterns to each
of the characters. To assign patterns (as well as read existing patterns), you

76
Unit 9 – Huffman Code/Climbing the Tree

ALWAYS start at the ROOT and read the 0 or 1 off the branch until you reach
a leaf, which is a character.

Let’s look at STX and ETX.

Their Huffman Codes, based on our tree, are

STX 000011
ETX 000010

Make sure that you see how those codes are made. Once you have done so,
make the codes for the other characters.

Let’s take a sentence using our character set and translate it into our Huffman
Code:

STX A space CAT ETX

would translate into

000011 10 01 00000 10 001 000010

If we were using plain ASCII to represent this message, it would have taken
56 bits (7 characters with 8-bits per character). It has taken us only 26...a
compression of 54%.

However, we are not quite finished. The problem with the pattern is that it is
very difficult to read binary, and even more so to find and correct a problem.
To make it easier for us, we convert that binary Huffman pattern into
hexadecimals something called Huffman Couplets or Huffman Doublets or
just Huffman Hex. You will need to know Binary to Hexadecimal
conversions, and vice versa.

First, we take the pattern and group the binary numbers by four bits (half a
byte, called a ‘nibble’):

0000 1110 0100 0001 0001 0000 10

77
Unit 9 – Huffman Code/Climbing the Tree

We notice that the last set has only two bits, so we PAD WITH ZEROES to
bring it up to 4 bits

0000 1110 0100 0001 0001 0000 1000


We now take those sets and organize them into sets of eight bits (a byte)
retaining the “set of four” organization. Once again, we notice that the last set
only has one set of four, so we pad with four zeroes to bring it up to eight.

0000 1110 0100 0001 0001 0000 1000 0000


We are now ready to convert them to Huffman Hex:

0 E 4 1x 1 0x 8 0x
If you notice, some of the numbers are subscripted with an ‘X’, while 0E is
not. 41, 10, and 80 all look like decimal numbers therefore they must be
subscripted with an ‘X’ to indicate that they are hexadecimal numbers, not
decimal numbers. 0E does not need one since the ‘E’ shows that it is hex.
You can, however, subscript all of them if you prefer.

Problem: Using the Huffman patterns we have developed, convert the


following message into Huffman Hex:

STX A space CAT space SAT ETX

The next step is to DECODE the Huffman CODE.

First, you start with the binary data stream:

00001110010000010001000010000000
Starting at the ROOT, you go up the branch indicated by the 0 or 1 until you
reach a character. That rakes care of that section of the binary code.
Beginning with the NEXT number, you start again at the ROOT, and go up the
branch indicated by the 0 or 1 until you reach a character. As for the 0’s after
you have reached the ETX, the machine discards them:

000011 10 01 00000 10 001 000010 000000


STX A sp C A T ETX discard

Please note that in reality, the computer doesn’t have a Huffman Tree, it uses
a Huffman Sieve that works in the same way as the tree.

If you are not using a standard Huffman, but have developed your own, then
your Huffman can be used as a form of encryption.

78
Unit 9 – Huffman Code/Climbing the Tree

As you can see, all it takes is for one bit to be wrong and you have garbage.
To help counteract that problem, you need some way to detect that there is
something wrong. The most common is to use a parity bit. A parity bit is a
single bit that is added to the data stream to let you know if there has been an
odd bit error, that is, an error has occurred in an ODD number of bits. If the
error has occurred in an EVEN number of bits, the parity bit will not catch it.
Let’s look at an example:

Parity, in itself, is either ODD or EVEN. What the means is that if I’m using
odd parity, there are an ODD number of one’s in the entire data steam,
including the parity bit:

101111011011
Parity bit

If I’m using even parity, there are an EVEN number of one’s in the entire data
stream, including the parity bit.

001111011011
Parity bit

I’m going to take the ODD PARITY data stream and introduce an odd
number of errors (3):

Correct Stream:
101111011011
Parity bit

Stream with Errors: Bits in Error

101110101011
Parity bit

If you notice, there are now an EVEN number of one’s. Since I’m using ODD
parity, I know there is an error in my data stream.

Let’s take the same data stream and introduce an even number of errors (2):

Correct Stream:

101111011011
Parity bit

79
Unit 9 – Huffman Code/Climbing the Tree

Stream with Errors: Bits in Error

101111101011
Parity bit

As you can see, there are still an odd number of one’s in the data stream, so
the error is not picked up by the simple parity bit. What’s more, even if we
knew there was an error, there is not way to correct it and thus we would be
getting garbage.

In order to take care of that problem, there have been error detection and
correction techniques developed. One of the most common, and the one we
will use here, is Hamming Code, developed by Richard Hamming. While
used primarily in telecommunications, it also has use in those cases where
accuracy of the data is of utmost importance. We will cover Hamming Codes
in the next Unit.

80
Unit 9 – Huffman Code/Climbing the Tree

VOCABULARY
TECHNICAL NON-TECHNICAL
code/coding (n, v) – код, кодування assumption (n) – припущення /
/ код, кодирование допущение
compress (v) – change plaintext convert (v) – конвертувати /
using Huffman Code – стиснути - конвертировать
змінити відкритий текст за
допомогою Huffman Code / сжимать
– изменять открытый текст с
помощью кода сжатия данных
decompress (v) – change Huffman even (adj) – навіть; парне число /
Code to plaintext – розпаковувати - даже; чётное число
змінити код Huffman на відкритий
текст / разворачивать – изменить
код Хаффмана на открытый текст
encode/decode (v) – кодувати / inefficient (adj) – неефективний /
розшифровувати / / шифровать/ неэффективный;
расшифровывать малопроизводительный
encrypt/decrypt – шифрувати / introduce (v) – put in –
дешифрувати // шифровать / представлятися / представлять -
расшифровывать вводить в эксплуатацию
encryption (n) – шифрування / odd (adj) – незвичайний; непарний /
зашифровывание нечётное число
even parity (n) – even number of represent (v) – представляти /
ones in the binary pattern – изображать или представлять
парність - парне число в
двійковому шаблоні / чётное число;
контроль на чётность
key (n) – ключ / ключ subscript (v) – записаний під
рядком, нижній індекс /
подстрочный; с нижним индексом
odd parity (n) – odd number of technique (n) – спосіб виконання;
ones in the binary pattern - технічний прийом / техника
непарне співвідношення - непарне действий; способ выполнения
число в двійковому шаблоні /
нечётность; контроль по
нечётности
pad (v) – add zeroes or ones to transfer (n, v) - трансферт /
complete a byte – заповнювати; трансферт
додавати нулі або одиниці, щоб
завершити байт / заполнять;
прибавлять нули или единицы для
завершения байта

81
Unit 9 – Huffman Code/Climbing the Tree

parity bit (n) - біт парності / бит utmost (adj) – максимально;


чётности; бит проверки на чётность; найбільше / высочайшая степень
бит проверки по чётности возможного
pattern (n) – the zeroes and ones in a
byte or set of bits – шаблон - нулі та
одиниці в байті або у наборі бітів /
шаблон - нули и единицы в байте
или наборе битов
plaintext (n) – a message that is
written in a normal alphabet in a
readable common language -
простий текст - повідомлення, яке
написане за допомогою літер
звичайного алфавіту і яке може
прочитати будь-хто /
незашифрованный текст -
сообщение, написанное обычным
алфавитом
zip/unzip (v) – see
compress/decompress –
заархівувати/ розпакувати; див.
компресувати/декомпресувати /
архивировать / разархивировать;
‘раззиповать’, распаковывать

ACTIVITIES:
1. What do ASCII and EBCDIC stand for? What can the eighth bit in
ASCII used for?

2. What would it mean if only Check Bit 2’s pattern was odd parity?
What would it mean if Check Bit 2’s, 4’s, and 8’s patterns were odd
parity?

3. Develop a Huffman Tree with a Character Set of at least 10 characters


(plus SPACE, ETX, and STX) that will be used in exercises for this unit.

4. Give your CHARACTER SET WITH FREQUENCIES to other group(s) to


use for activities #4 and #5.

5. Come up with a message in plaintext, convert to Huffman Code. Give


your PLAINTEXT to another group to convert to Huffman Code.

6. Come up with a message in plaintext, convert to Huffman Code. Give


your HUFFMAN CODE to another group to convert back to plaintext.

82
Unit 9 – Huffman Code/Climbing the Tree

Vocabulary exercises

Exercise 1. Match the word / phrase with its definition

1) compress a) to return to the original size, or to cause something to do


this
2) decompress b) a word, letter, number, or symbol written or printed just
below another word, letter, number, or symbol, usually in
a smaller size
3) encode c) to fix a machine, piece of equipment, or system in a
place and make it ready to use
4) decode d) to move from one place, person, or position to another,
or to cause someone or something to move
5) encryption e) If you pad a document or report, you add something
extra that is unnecessary or not correct
6) key f) to return a computer file to its original size after it has
been zipped (= reduced in size so that it can be easily
sent or stored)
7) parity g) to compress (= reduce the size of) a computer file so
that it uses less space, and can be more easily sent or
stored
8) pattern h) a particular way in which something is done, is
organized, or happens
9) zip i) equality, especially of pay or position
10) unzip j) a list of the symbols used in a map, chart,
or book with explanations of what they mean
11) pad k) the process of putting information into a special form so
that most people cannot read it
12) transfer l) to discover the meaning of information given in a
secret or complicated way
13) subscript m) to put information into a form in which it can be stored,
and which can only be read using special
technology or knowledge
14) put in n) to make a computer file use less space when it is stored
in the memory of a computer or on a disk, by using a
special program

83
Unit 9 – Huffman Code/Climbing the Tree

Exercise 2. Fill in the blanks with the words/phrases from the Unit.

code odd even introduce represent utmost technique

1) We have been learning to _____________programs.


2) The houses on this side of the street all have
_________ numbers.
3) It's a very challenging project - it might _________ take a year
to finish it.
4) The company ______________ a job share scheme last year.
5) Students were well ________________ at the conference.
6) The situation needs to be handled with the ____________ care.
7) Yoga is a very effective ________________
for combating stress.

Exercise 3. What does the abbreviation/acronym stand for?

1) ASCII
2) EBCDIC
3) IBM
4) MIT
5) JPEG
6) STX
7) ETX

84
Unit 9 – Huffman Code/Climbing the Tree

Grammar

Exercise 4. Focus on the Reported Speech. Study the tables


below and provide your own examples.

Direct - is the exact words ‘’Providing info Your


Speech someone said. security is vital own
- Quotation marks are nowadays,” our
used in Direct Speech. professor says.
Reported - is the exact meaning of He said that providing
Speech what someone said but not info security is/was vital
the exact words. nowadays.
- Quotation marks are not
used in Reported Speech.

Your
Direct Speech Reported Speech own

“I study a lot,” she said. She said (that) she studied a lot.
“I am studying a lot,” she said. She said (that) she was studying a
lot.
“I have studied a lot,” she said. She said (that) she had studied a lot.
“I studied a lot,” she said. She said (that) she had studied a lot.
“I will study a lot,” she said. She said (that) she would study a lot.
“I have been studying a lot,” she She said (that) she had been
said. studying a lot.
“I am going to study more,” she She said (that) she was gong to
said. study more.
“I can study more,” she said. She said (that) she could study more.
“I may study more,” she said. She said (that) she might study more.
“I must study more,” she said. She said (that) she had to / must
study more.
“I should study more,” she said. She said (that) she should study
more.
“I ought to study a lot,” she said. She said (that) she ought study more.
“Do you study a lot?” he said to He asked her if she studied a lot.
her.
“Study more,” he said to her. He told her to study more.

85

You might also like