These Notes Are Designed To Provide An Introductory-Level Knowledge Appropriate To Understanding The Basics of Digital Data Formats

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

A brief guide to binary data

Mike Sandiford, March 2001

These notes are designed to provide an introductory-level knowledge


appropriate to understanding the basics of digital data formats.

The problem with characters!

In the earth sciences we mostly deal with spatially located digital data that may
vary with time. A typical spatial data set can be represented by x, y and z values
where x maybe longitude, y latitude and z the spatial variable such as elevation,
magnetic intensity or gravity. Data sets can be very large! For example, a global
data set consisting of elevation readings at ~30 second spacing (about 1 km at
the equator) requires approximately 1billion individual elevation estimates. The
fundamental question in regard to computing is how should we store this
information most efficiently?

One way is to store the numbers in text format as x, y, z characters on separate


lines. A data point near Melbourne might look like:

145.705E 37.755S 1023

with each x-y-z pair requiring about 19 characters to describe. To store this
information line by line we would additionally need another character to
represent the line-ending. The full data set is therefore of the order of 20 billion
times the storage requirements of each character. In the ASCII text format
convention, characters are encoded by 8 bits of information (= 1 byte), implying
our data set requires of the order of 20 Gigabytes of data. Fortunately, data file
sizes can be greatly reduced by storing data in binary format, especially when
the data is grided (ie on a regularly spaced array of x- and y- values such that
the x- and y- values do not need to be explicitly recorded for each z-value. For
example, the grided binary representation of this dataset in short integer format
is ~ 2 Gbytes.

However, binary data are difficult to deal with because we have to know the
form of the data before we can read it. The notes below provide some
background to understanding the binary format representation of digital data.

Bits and bytes

Computers deal only in 1s and 0s, which can be stored in a single bit.
Numbers greater than 1 therefore have to be represented in binary form (i.e.,
base 2), as a series of 1 and 0s.
Sets of 8 bits, termed a byte, form the basic functional data processing unit on
most computers. A byte can be used to describe 256 different numbers or
characters (0000 0000-1111 1111). Unsigned bytes correspond to the decimal
range 0-255.

Column 8th 7th 6th 5th 4th 3rd 2nd 1st


7 6 5 4 3 2 1
Decimal 10 10 10 10 10 10 10 100
1 billion 1 million 100000 10000 1000 100 10 1
7 6 5 4 3 2 1
Binary 2 2 2 2 2 2 2 20
128 64 32 16 8 4 2 1

Thus, the decimal number 179 (1x102 + 7x101 + 9x100) is represented in binary
byte form as 1011 0011 (1x27 + 1x25 + 1x24 + 1x21 + 1x20 = 128 + 32 + 16 + 2 +
1).

Many graphics formats (eg. GIFs) store the colour index in arrays of bytes,
implying a maximum number of colours in any individual image is 256. A 512 x
512 pixel image in byte form with no compression occupies 262 kbytes (plus
whatever is required for the header information).

Signed bytes describe values in the decimal range -128 to +127 are
considered to be signed and are given a different binary representation. The
initial bit is set to 1 for negative numbers, 0 for positive numbers, the remaining
7 bits describe numbers from 0-127. Negative numbers are represented by
the twos complement convention in which all bits are flipped in comparison
with positive numbers and 1 added:

Decimal 127 is signed binary 0111 1111


Decimal 127 is signed binary 1000 0001 (note the added 1)

In order to read byte data you must know whether the data are signed or
unsigned. Thus the singed byte 1000 0000 for decimal 127 corresponds to
unsigned decimal value 128.

Byte arithmetic

Because byte data can only represent a limited number of values,


mathematical operations on byte data are subject to certain oddities. For
example, it easy to understand that :

1000 0000 + 0111 1111 = 1111 1111 (128b + 127b = 255b)

(the b signifies byte arithmetic ). However, the following is a bit more tricky ;
1111 1111 + 0000 0010 = 0000 0001 (255b + 2b = 1b)

Hexadecimal numbers

The hexadecimal numbering scheme (base 16) provides a convenient and


simplified form of representing binary numbers used by computers. Each byte
can be represented as two 4-bit numbers (a 4 bit number allows 24 or 16
different possibilities). In the hexadecimal form, the decimal numbers 0
through 15 are represented by 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F. Thus,
the binary number 1011 0011 (decimal 179) is represented by hexadecimal B3.

Integers

Sixteen bit (2-byte) binary numbers can represent 65532 different numbers and
thus can be used to represent unsigned integers in the (decimal) range 0 to
65531, or signed integers in the range 32,768 to32,767. The 2-byte integer
representation is sometimes referred to as short integers. Long integers are
32 bit (4-byte) and can be used to represent integers in the decimal range -
2,147,483,648 to 2,147,483,647.

A 512x512 array of short integers in binary format occupies 524 kbytes of


memory (1/2 Mbyte).

Byte ordering conventions (big-endian and little-endian).

The logical way of ordering bytes in integers and real numbers follows the
convention associated with ordering bits in bytes with the most significant
number first (as in the logic associated with decimal numbers where 1234
means 1x1000 + 2x100 + 3x10 + 4x1 and not 1x1 + 2x10 + 3x100 + 4x1000).

Systems originally based on Motorola processors (many Unix and Macintosh


computers), follow this big-endian, or most signiifcnat byte first, convention,
whereas Intel computers have reverse or little-endian byte ordering.

Real numbers (single precision and double precision floats)

Real numbers pose special problems, which are solved by the floating-point
format, which use the binary encoded form of the exponential notation. In the
exponential format the decimal number 0.00014376 is represented as 1.4376 x
10-4, consisting of the fractional component 1.4376 and the exponent 4. In the
IEEE single-precision (4-byte) format floating points are stored as two signed
(binary format) integers in the following order:

1 bit for the sign, 8 bits for the exponent and 23 bits for the fraction
Double precision (8-byte) floating point format is:

1 bit for the sign, 11 bits for the exponent and 52 bits for the fraction

Single precision floats encompass the decimal range of approximately 10-38


to 1038. In addition, the IEEE floating point convention defines additional special
values that cannot be represented by the above convention of Zero (for 0) Inf
(for infinity) and NaN (for not-a-number).

A 512x512 array of single precision floats in binary format occupies 1048


kbytes of memory (1 Mbyte).

Header files and header records

The nature of the data format, layout and geographic referencing, in a binary file
is often encoded within a header record or an associated header file (often
indicated with the file suffix .hdr). Header records usually occupy a specific
number of bytes at the beginning of the binary file, such that the first data entry
is offset from the beginning of the file, by a specific amount. To read the header
you need to understand the format of the header record. Header files are
usually in ASCII text format and thus are easier to read than header records.

Scientific data formats

There are a number of special scientific data formats used in earth science.
Two of the most common are the HDF and netCDF formats. These formats not
only store the data in a variety of binary forms, but also store attributes and
comments relating to the data. In addition, the HDF format can store
information relating to colour mapping used in the generation of images, and is
therefore commonly used to store scientific data visualisations.

The netCDF and HDF formats have the advantage of being machine-
independent and thus are easily transported between different computers

Multi-band files

Often more than one measurement is made at an individual station (or


geographic point). For example, airborne radiometric survey data record K2O,
Th and U concentrations. Satellite remote sensed data such as LANDSAT and
SPOT are examples of multi-band data. Measurements of each data type are
stored in discrete bands. Multi-band files may be stored on disk in one of 3
ways, namely:
BSQ or band-sequential in which each band is stored in full (i.e. band 1, band 2
band 3, );
BIL of band-interleaved by line, in which bands are stored as sequential line
data (i.e. line 1 band 1, line 1 band 2, line 1 band 3, , line 2 band 1, line 2
band 2, line 1 band 3, ); and
BIP or band interleaved by pixel, in which bands are stored point by point (i.e.
point 1 band 1, point 1 band 2, point 1 band 3, , point 2 band 1, point 2 band
2, point 1 band 3, );.

Programs for reading binary data

While it is relatively simple (given some basic programming knowledge) to


write programs that read data, most of us do not want to waste our time doing
this! Fortunately, there are numerous programs that are well adapted for this,
although the down side is that since reading binary data is a relatively
specialized activity, the programs are usually very expensive. The high-end
programs such as ERmapper and ENVI, combine the ability to import data
from almost all standard formats, with a vast arsenal of routines for geographic
registration, processing and visualising the data. As the course progresses we
will investigate some of these using ERmapper, which is probably the most
widely used application in Earth sciences for operating on large gridded binary
data-sets..

You might also like