Lecture 05

THE BUILDING BLOCKS:
ENCODING OTHER DATA
https://xkcd.com/1953/
1
LEARNING OBJECTIVES
• How do we encode text
• ASCII and Unicode (UTF-8)
• Text compression
• Huffman encoding
• Run-length encoding
• Representing sounds and images
• Lossy compression
2
CHARACTER ENCODING
• Originally there were many competing ways to encode the

characters of human languages into binary.
• Now we map characters onto binary numbers in a standard way

(sort of)
• ASCII (8-bit numbers for each character)
• American Standard Code for Information
Interchange
• Unicode (different related standards, e.g. UTF-8,
UTF-16, UTF-32) (Unicode Transformation Format)
• Universal Coded Character Set (UCS)
3
ASCII
• Developed from earlier codes
• 7-bit teleprinter code (there are

8-bit extensions), it is common
for people to call ASCII an 8-bit
encoding
• 128 different values
• first 32 are control codes (to

control printers and can be
used for other purposes) also
0x7F (DEL)
• all UPPERCASE letters come

before lowercase
Public Domain, https://commons.wikimedia.org/w/index.php?curid=225986
4
5
THINGS TO NOTE ABOUT ASCII
• Designed to help with sorting (unlike earlier codes)
• First 32 characters - control codes
• Digits '0' to '9' have hexadecimal encodings 0x30 to 0x39
• Uppercase letters 'A' to 'Z' are in order with encodings 0x41 to 0x5A
• Lower case letters 'a' to 'z' have encodings 0x61 to 0x7A (hence only one bit
difference between an upper and a lowercase letter)
• Symbols are scattered around (for reasons - see https://en.wikipedia.org/wiki/

ASCII#Internal_organization), and the keyboard you use on your computer has all of
those symbols (and possibly no others)
• ASCII has been superseded, mostly by UTF-8 which keeps the ASCII code for 8-bit
values from 0x00 to 0x7F
6
UTF-8
• The most commonly used Unicode character set encoding.

Unicode Transformation Format 8-bit.
• Variable width - 1 to 4 bytes
• 1,114,112 "code points" numerical values corresponding to

characters or formatting
• More frequently occurring code points are encoded in fewer

bytes.
7
EMOJIS HAVE UTF-8 ENCODINGS TOO
• 📱
• mobile phone
• Unicode: U+1F4F1, UTF-8: F0 9F 93 B1
• 🇳🇿
• flag of New Zealand
• Unicode: U+1F1F3 U+1F1FF, UTF-8: F0 9F 87 B3 F0 9F 87 BF
• this uses 8 bytes, it is a supplementary character (don’t worry)
8
COMPRESSION
• As we just saw UTF-8 doesn’t need to use 4 bytes for each

character. If the characters are ASCII they only take up 1 byte
each.
• We can try to do better with the smallest encoding of the most

common characters - this is the basis of Huffman coding.
• https://www.youtube.com/watch?v=JsTptu56GM8
9
HUFFMAN TREE
• This tree from Wikipedia is the Huffman tree for the text "this is
an example of a huffman tree".
Letter Code
e 000
• It can also be represented as a table a 010
space 111
n 0010
t 0110
m 0111
i 1000
h 1010
s 1011
f 1101
o 00110
u 00111
x 10010
p 10011
r 11000
l 11001
10
THE ENCODING
• Using the sentence: "this is an example of a huffman tree"
• This gets encoded as:
• 0110 1010 1000 1011 111 1000 1011 111 010 0010 111 000 10010
010 0111 10011 11001 000 111 00110 1101 111 010 111 1010
00111 1101 1101 0111 010 0010 111 0110 11000 000 000
• except there are no spaces between each pair of character

encodings
COMPRESSION RATIO
• How good is our compression?
• The compression ratio = uncompressed size

compressed size
• With the example on the previous slide the number of bits to represent the message is 135
and the uncompressed ASCII equivalent (one char per byte) is 36 × 8 = 288. This assumes
the tree or table doesn’t have to be transmitted.
• So the compression ratio is 288/135 ≅ 2.13
• But if we only had the 16 characters in the message to encode in a fixed length encoding
we could do that with 4 bits per character.
• This would give a compression ratio of (36 x 4) / 135 = 144/135 which is pretty close to not
being compressed at all.
Something to do - check the value of 135 above.
12
COMPRESSION WHY BOTHER?
• With so much memory available today, why do we bother with

compression?
https://xkcd.com/1381/
13
RUN LENGTH ENCODING (RLE)
• A totally different form of compression.
• Not really useful for text, but for certain types of graphics.
• Values repeated in a row are counted and we store the value and the count
• e.g. AAABBCCCC could be represented as A3B2C4
• a compression ratio of 9/6 = 1.5
• Does RLE help with the sentence "this is an example of a huffman tree"?
• Other forms of compression look at groupings of characters (this can give

much better compression ratios) https://en.wikipedia.org/wiki/Lempel–
Ziv–Welch
14
SOUNDS AND IMAGES
• Sounds and images require converting naturally analogue

representations to digital representations
• Sound waves characterized by:

• Amplitude: height of the wave at a moment in time
• Period: length of time until wave pattern repeats
• Frequency: number of cycles per unit time
15
SOUND WAVE
16
ANALOGUE TO DIGITAL
• Digitize: to convert to a digital form
• Sampling: record sound wave values at fixed, discrete intervals
• To reproduce sound, approximate using samples
• Quality is determined by
• Sampling rate: number of samples per second
• More samples ▶ more accurate waveform
• Bit depth: number of bits per sample
• More bits ▶ more accurate amplitude
17
SAMPLING
18
TOO MUCH DATA
• Storing all of the sampling data for a piece of music requires a

very large amount of memory
• We can compress the data in a similar way to the methods we

have already seen (lossless)
• Luckily if we don’t mind losing some of the data we can use

techniques which compress way more (lossy)
• In the case of audio we design our algorithms to throw away data

which humans can’t perceive
• https://www.youtube.com/watch?v=KGZ0een8vSE (1:20)
19
IMAGES
• Image sampling: record colour or intensity at fixed, discrete

intervals in two dimensions
• Pixels: individual recorded samples
• RGB encoding scheme:

• Colours are combinations of red, green, and blue
• One byte each for red, green, and blue
• Raster graphics store picture as two dimensional grid of pixel

values
20
DIGITIZED IMAGE
21
USE MORE BITS FOR MORE COLOURS
22
IMAGES AND MOVIES
• We can lose even more information in images and we don’t notice

it
• Down sample colour
• https://www.youtube.com/watch?v=n_uNPbdenRs 5:00 minutes

in
• if you are interested the presenter of the video above has a

really good explanation of the discrete cosine transform used in
the jpeg images (this is not part of this course, but it is really
interesting) https://www.youtube.com/watch?
v=Q2aEzeMDHMA
23
HOW MUCH DATA?
• What is the space necessary to store the following data? Uncompressed.

• 1000 integer values - depends on the size of the integer
• 10-page text paper - depends on how many characters on a
page
• 60-second sound file - depends on sampling rate and bit-
depth
• 480 by 640 image - depends on the colour depth
• Data compression: storing data in a reduced-size form to save space/time

• Lossless: data can be perfectly restored - Huffman, RLE
• Lossy: data cannot be perfectly restored - most audio and
video file types, e.g. mp3, jpeg
24

Lecture 05

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 05

Uploaded by

Copyright:

Available Formats

THE BUILDING BLOCKS:

ENCODING OTHER DATA

• How do we encode text

• ASCII and Unicode (UTF-8)

• Representing sounds and images

• Originally there were many competing ways to encode the

• Now we map characters onto binary numbers in a standard way

• Developed from earlier codes

• 7-bit teleprinter code (there are

• 128 different values

• first 32 are control codes (to

• all UPPERCASE letters come

Public Domain, https://commons.wikimedia.org/w/index.php?curid=225986

• Designed to help with sorting (unlike earlier codes)

• First 32 characters - control codes

• Digits '0' to '9' have hexadecimal encodings 0x30 to 0x39

• Symbols are scattered around (for reasons - see https://en.wikipedia.org/wiki/

• The most commonly used Unicode character set encoding.

• Variable width - 1 to 4 bytes

• 1,114,112 "code points" numerical values corresponding to

• More frequently occurring code points are encoded in fewer

• Unicode: U+1F4F1, UTF-8: F0 9F 93 B1

• flag of New Zealand

• Unicode: U+1F1F3 U+1F1FF, UTF-8: F0 9F 87 B3 F0 9F 87 BF

• this uses 8 bytes, it is a supplementary character (don’t worry)

• As we just saw UTF-8 doesn’t need to use 4 bytes for each

• We can try to do better with the smallest encoding of the most

• Using the sentence: "this is an example of a huffman tree"

• This gets encoded as:

• except there are no spaces between each pair of character

• How good is our compression?

• The compression ratio = uncompressed size

• So the compression ratio is 288/135 ≅ 2.13

Something to do - check the value of 135 above.

• With so much memory available today, why do we bother with

• A totally different form of compression.

• e.g. AAABBCCCC could be represented as A3B2C4

• a compression ratio of 9/6 = 1.5

• Other forms of compression look at groupings of characters (this can give

• Sounds and images require converting naturally analogue

• Sound waves characterized by:

• Digitize: to convert to a digital form

• Sampling: record sound wave values at fixed, discrete intervals

• To reproduce sound, approximate using samples

• Storing all of the sampling data for a piece of music requires a

• We can compress the data in a similar way to the methods we

• Luckily if we don’t mind losing some of the data we can use

• In the case of audio we design our algorithms to throw away data

• Image sampling: record colour or intensity at fixed, discrete

• Pixels: individual recorded samples

• RGB encoding scheme:

• Raster graphics store picture as two dimensional grid of pixel

• We can lose even more information in images and we don’t notice

• Down sample colour

• https://www.youtube.com/watch?v=n_uNPbdenRs 5:00 minutes

• if you are interested the presenter of the video above has a

• What is the space necessary to store the following data? Uncompressed.

• Data compression: storing data in a reduced-size form to save space/time

You might also like