Professional Documents
Culture Documents
These Notes Are Designed To Provide An Introductory-Level Knowledge Appropriate To Understanding The Basics of Digital Data Formats
These Notes Are Designed To Provide An Introductory-Level Knowledge Appropriate To Understanding The Basics of Digital Data Formats
These Notes Are Designed To Provide An Introductory-Level Knowledge Appropriate To Understanding The Basics of Digital Data Formats
In the earth sciences we mostly deal with spatially located digital data that may
vary with time. A typical spatial data set can be represented by x, y and z values
where x maybe longitude, y latitude and z the spatial variable such as elevation,
magnetic intensity or gravity. Data sets can be very large! For example, a global
data set consisting of elevation readings at ~30 second spacing (about 1 km at
the equator) requires approximately 1billion individual elevation estimates. The
fundamental question in regard to computing is how should we store this
information most efficiently?
with each x-y-z pair requiring about 19 characters to describe. To store this
information line by line we would additionally need another character to
represent the line-ending. The full data set is therefore of the order of 20 billion
times the storage requirements of each character. In the ASCII text format
convention, characters are encoded by 8 bits of information (= 1 byte), implying
our data set requires of the order of 20 Gigabytes of data. Fortunately, data file
sizes can be greatly reduced by storing data in binary format, especially when
the data is grided (ie on a regularly spaced array of x- and y- values such that
the x- and y- values do not need to be explicitly recorded for each z-value. For
example, the grided binary representation of this dataset in short integer format
is ~ 2 Gbytes.
However, binary data are difficult to deal with because we have to know the
form of the data before we can read it. The notes below provide some
background to understanding the binary format representation of digital data.
Computers deal only in 1s and 0s, which can be stored in a single bit.
Numbers greater than 1 therefore have to be represented in binary form (i.e.,
base 2), as a series of 1 and 0s.
Sets of 8 bits, termed a byte, form the basic functional data processing unit on
most computers. A byte can be used to describe 256 different numbers or
characters (0000 0000-1111 1111). Unsigned bytes correspond to the decimal
range 0-255.
Thus, the decimal number 179 (1x102 + 7x101 + 9x100) is represented in binary
byte form as 1011 0011 (1x27 + 1x25 + 1x24 + 1x21 + 1x20 = 128 + 32 + 16 + 2 +
1).
Many graphics formats (eg. GIFs) store the colour index in arrays of bytes,
implying a maximum number of colours in any individual image is 256. A 512 x
512 pixel image in byte form with no compression occupies 262 kbytes (plus
whatever is required for the header information).
Signed bytes describe values in the decimal range -128 to +127 are
considered to be signed and are given a different binary representation. The
initial bit is set to 1 for negative numbers, 0 for positive numbers, the remaining
7 bits describe numbers from 0-127. Negative numbers are represented by
the twos complement convention in which all bits are flipped in comparison
with positive numbers and 1 added:
In order to read byte data you must know whether the data are signed or
unsigned. Thus the singed byte 1000 0000 for decimal 127 corresponds to
unsigned decimal value 128.
Byte arithmetic
(the b signifies byte arithmetic ). However, the following is a bit more tricky ;
1111 1111 + 0000 0010 = 0000 0001 (255b + 2b = 1b)
Hexadecimal numbers
Integers
Sixteen bit (2-byte) binary numbers can represent 65532 different numbers and
thus can be used to represent unsigned integers in the (decimal) range 0 to
65531, or signed integers in the range 32,768 to32,767. The 2-byte integer
representation is sometimes referred to as short integers. Long integers are
32 bit (4-byte) and can be used to represent integers in the decimal range -
2,147,483,648 to 2,147,483,647.
The logical way of ordering bytes in integers and real numbers follows the
convention associated with ordering bits in bytes with the most significant
number first (as in the logic associated with decimal numbers where 1234
means 1x1000 + 2x100 + 3x10 + 4x1 and not 1x1 + 2x10 + 3x100 + 4x1000).
Real numbers pose special problems, which are solved by the floating-point
format, which use the binary encoded form of the exponential notation. In the
exponential format the decimal number 0.00014376 is represented as 1.4376 x
10-4, consisting of the fractional component 1.4376 and the exponent 4. In the
IEEE single-precision (4-byte) format floating points are stored as two signed
(binary format) integers in the following order:
1 bit for the sign, 8 bits for the exponent and 23 bits for the fraction
Double precision (8-byte) floating point format is:
1 bit for the sign, 11 bits for the exponent and 52 bits for the fraction
The nature of the data format, layout and geographic referencing, in a binary file
is often encoded within a header record or an associated header file (often
indicated with the file suffix .hdr). Header records usually occupy a specific
number of bytes at the beginning of the binary file, such that the first data entry
is offset from the beginning of the file, by a specific amount. To read the header
you need to understand the format of the header record. Header files are
usually in ASCII text format and thus are easier to read than header records.
There are a number of special scientific data formats used in earth science.
Two of the most common are the HDF and netCDF formats. These formats not
only store the data in a variety of binary forms, but also store attributes and
comments relating to the data. In addition, the HDF format can store
information relating to colour mapping used in the generation of images, and is
therefore commonly used to store scientific data visualisations.
The netCDF and HDF formats have the advantage of being machine-
independent and thus are easily transported between different computers
Multi-band files