Professional Documents
Culture Documents
Data Models: Feature or Cartogaphic Object in The GIS, and
Data Models: Feature or Cartogaphic Object in The GIS, and
2 Data Models
Introduction
Data in a GIS represent a simplified view in Figure 2-1, we may represent lakes in a
of the real world. Physical entities or phe- region by a set of polygons. These polygons
nomena are approximated by data in a GIS. are associated with a set of essential charac-
These data include information on the spatial teristics that define each lake. All other infor-
location and extent of the physical entities, mation for the area may be ignored, e.g.,
and information on their non-spatial proper- information on the roads, buildings, slope, or
ties. soil characteristics. Only lake boundaries and
Each entity is represented by a spatial essential lake characteristics have been saved
feature or cartogaphic object in the GIS, and in this example.
so there is an entity-object correspondence. Essential characteristics are defined by
Because every computer system has limits, the person, group, or organization that devel-
only a subset of the essential characteristics ops the spatial data or uses the GIS. The set
are represented for each entity. As illustrated of characteristics used to represent an entity
Figure 2-1: A physical entity is represented by a spatial object in a GIS. Here, the physical boundaries of
lakes are represented by lines.
26 GIS Fundamentals
Figure 2-2: Levels of abstraction in the representation of spatial entities. The real world is repre-
sented in successively more machine-compatible but humanly obscure forms.
Chapter 2: Data Models 27
Figure 2-3: Coordinate and attribute data are used to represent entities.
objects. A coordinate most often consists of attribute data record the non-spatial compo-
a pair of numbers that specify location in nents of an object, such as a name, color, pH,
relation to an origin. The coordinates quan- or cash value.
tify the distance from the origin when mea- Attribute data are linked with coordinate
sured along a standard direction. Single or data to help define each cartographic object
groups of coordinates are organized to repre- in the GIS. The attribute data are linked to
sent the shapes and boundaries that define the corresponding cartographic objects in the
the objects. Coordinate information is an spatial part of the GIS database. Keys,
important part of the data model, and models labels, or other indices are used so that the
differ in how they represent these coordi- spatial and attribute data may be viewed,
nates. Coordinates are usually expressed in related, and manipulated together.
one of many standard coordinate systems.
The coordinate systems are usually based Most conceptualizations view the world
upon standardized map projections (dis- as a set of layers (Figure 2-4). Each layer
cussed in Chapter 3) that unambiguously organizes the spatial and attribute data for a
define the coordinate values for every point given set of cartographic objects in the
in an area. region of interest. These are often referred to
as thematic layers. As an example consider a
Typically there are two distinct types of GIS database that includes a soils data layer,
data used to define cartographic objects a population data layer, an elevation data
(Figure 2-3). First, coordinate or geometric layer, and a roads data layer. The roads layer
data define the location and shape of the “contains” only roads data, including the
objects. Second, attribute data are collected location and properties of roads in the analy-
and referenced to each object. These sis area. There are no data regarding the
28 GIS Fundamentals
Coordinate Data
Coordinates define location in two or
three-dimensional space. Coordinate pairs,
e.g., x and y, or coordinate triples, x, y, and z,
are used to define the shape and location of
each spatial object or phenomenon.
Spatial data in a GIS most often use a
Figure 2-4: Spatial data are most often viewed Cartesian coordinate system, so named after
as a set of thematically distinct layers.
Rene Descartes, a French mathematician.
Cartesian systems define two or three
orthogonal (right-angle) axes. Two-dimen-
sional Cartesian systems define x and y axes
in a plane (Figure 2-5, left) Three-dimen-
sional Cartesian systems define a z axis,
orthogonal to both the x and y axes. An ori-
Figure 2-5: Two dimensional (left) and three dimensional (right) Cartesian coordinate systems.
Chapter 2: Data Models 29
type of line, even though the state highways areas as a uniform landcover type. A forest
may vary. Some may be paved with con- may share an edge with a pasture, and this
crete, others with bitumen. Some may have boundary is represented by lines. The
wide shoulders, others not, or dividing barri- boundaries between two polygons may not
ers of concrete, versus a broad vegetated be discrete on the ground, for example, a for-
median. We realize these differences can est edge may grade into a mix of trees and
exist within this conceptualization. grass, then to pasture; however in the vector
There are two main conceptualizations conceptualization, a line between two land-
used for digital spatial data. The first con- cover types will be drawn to indicate a dis-
ceptualization defines discrete objects using crete, abrupt transition between the two
a vector data model. Vector data models use types. Lines and points have coordinate
discrete elements such as points, lines, and locations, but points have no dimension, and
polygons to represent the geometry of real- lines have no dimension perpendicular to
world entities (Figure 2-8). their direction. Area features may be defined
by a closed, connected set of lines.
A farm field, a road, a wetland, cities,
and census tracts are examples of discrete The second common conceptualization
entities that may be represented by discrete identifies and represents grid cells for a
objects. Points are used to define the loca- given region of interest. This conceptualiza-
tions of “small” objects such as wells, build- tion employs a raster data model (Figure 2-
ings, or ponds. Lines may be used to 8). Raster cells are arrayed in a row and col-
represent linear objects, e.g., rivers or roads, umn pattern to provide “wall-to-wall” cover-
or to identify the boundary between what is a age of a study region. Cell values are used to
part of the object and what is not a part of represent the type or quality of mapped vari-
the object. We may map landcover for a ables. The raster model is used most com-
region of interest, and we categorize discrete monly with variables that may change
Figure 2-9: Coordinates define spatial location and shape. Attributes record the important non-spatial
characteristics of features in a vector data model.
Chapter 2: Data Models 33
lated irregular network (TIN) is an example Linear features, often referred to as arcs,
of such a data model. This model is most are represented as lines when using vector
often used to represent surfaces, such as ele- data models. Lines are most often repre-
vations, through a combination of point, sented as an ordered set of coordinate pairs.
line, and area features. Many consider this a Each line is made up of line segments that
special, admittedly well-developed, type of run between adjacent coordinates in the
vector data model. Variants or other repre- ordered set (Figure 2-10). A long, straight
sentations related to raster data models also line may be represented by two coordinate
exist. We choose two broad categories for pairs, one at the start and one at the end of
clarity in an introductory text, and introduce the line. Curved linear entities are most often
variants as appropriate later in this and other represented as a collection of short, straight,
chapters. line segments, although curved lines are at
Vector data models will be described in times represented by a mathematical equa-
the next section, including commonly found tion describing a geometric shape. Lines typ-
variants. Sections describing raster data ically have a starting point, an ending point,
models, TIN data models, and data structure and intermediate points to represent the
then follow. shape of the linear entity. Starting points and
ending points for a line are sometimes
referred to as nodes, while intermediate
Vector Data Models points in a line are referred to as vertices
(Figure 2-10). Attributes may be attached to
A vector data model uses sets of coordi-
the whole line, line segments, or to nodes
nates and associated attribute data to define
and vertices along the lines.
discrete objects. Groups of coordinates
define the location and boundaries of dis- Area entities are most often represented
crete objects, and these coordinate data plus by closed polygons. These polygons are
associated attributes are used to create vector formed by a set of connected lines, either
objects representing the real-world entities one line with an ending point that connects
(Figure 2-9). back to the starting point, or as a set of lines
connected start-to-end (Figure 2-10). Poly-
There are three basic types of vector
objects: points, lines, and polygons (Figure
2-10). A point uses a single coordinate pair
to represent the location of an entity that is
considered to have no dimension. Gas wells,
light poles, accident location, and survey
points are examples of entities often repre-
sented as point objects in a spatial database.
Some of these have real physical dimension,
but for the purposes of the GIS users they
may be represented as points. In effect, this
means the size or dimension of the entity is
not important spatial information, only the
central location. Attribute data are attached
to each point, and these attribute data record
the important non-spatial characteristics of
the point entities. When using a point to rep-
resent a light pole, important attribute infor-
mation might be the height of the pole, the
Figure 2-10: Points, nodes and vertices define
type of light and power source, and the last points, line, and polygon features in a vector
date the pole was serviced. data model.
34 GIS Fundamentals
gons have an interior region and may (Figure 2-11). Some applications may
entirely enclose other polygons in this require that only the location of some por-
region. Polygons may be adjacent to other tion of a feature be recorded, e.g., general
polygons and thus share “bordering” or building location, and select a point respre-
“edge” lines with other polygons. Attribute sentation as sufficient (Figure 2-11, left).
data may be attached to the polygons, e.g., Other users may be interested in the outline
area, perimeter, landcover type, or county of the feature and so require representation
name may be linked to each polygon. by lines, while polygon representations may
Note that there is no uniformly superior be preferred for another application (Figure
way to represent features. Some feature 2-11 mid and right, respectively). Our
types may appear to be more “naturally” rep- intended use often determines our concep-
resented as points, e.g., manhole covers as tual model of a feature, and hence the vector
points, roads as lines, and parks as polygons. type we use to represent the features.
However, an a very detailed data set, the
manhole covers may be represented as cir- The Spaghetti Vector Model
cles, and both edges of the roads may be
drawn and the roads represented as poly- The spaghetti model is an early vector
gons. The representation depends as much data model that was originally developed to
on the detail, accuracy, and intended use of organize and manipulate line data. Lines are
the data set as our common conception or captured individually with explicit starting
general shape of the objects. and ending nodes, and intervening vertices
used to define the shape of the line. The spa-
A set of features may be alternatively ghetti model records each line separately.
represented as points, lines, or polygons The model does not explicitly enforce or
Chapter 2: Data Models 35
record connections of line segments when speed, accuracy, and utility of many spatial
they cross, nor when two line ends meet data operations by enforcing strict connec-
(Figure 2-12a). A shared polygon boundary tivity, by recording connectivity and adja-
may be represented twice, with a line for cency, and by maintaining information on
each polygon on either side of the boundary. the relationships between and among points,
Data in this form are similar in some lines, and polygons in spatial data. These
respects to a plate of cooked spaghetti, with early developers found it useful to record
no ends connected and no intersections when information on the topological characteris-
lines cross. tics of data sets.
The spaghetti model severely limits spa- Topology is the study of geometric prop-
tial data analysis and is little used except for erties that do not change when the forms are
very basic data entry or translation. Rudi- bent, stretched or undergo similar transfor-
mentary data entry and editing software may mations. Polygon adjacency is an example
initally store data as “spaghetti”. Because of a topologically invariant property,
spaghetti data are unstructured, lines often because the list of neighbors to any given
do not connect when they should, many polygon does not change during geometric
common spatial analyses are inefficient or stretching or bending (Figure 2-12, b and c).
impossible, or the results incorrect. Area cal- Topological vector models explicitly record
culation, layer overlay, and many other anal- topological relationships such as adjacency
yses require “clean” spatial data in which all and connectivity in the data files. These rela-
polygons close and lines meet correctly. tionships may be recorded separately from
Spaghetti data are most often processed to the coordinate data and hence do not change
produce more structured, useful forms. when data are stretched or bent, e.g., when
converting between coordinate systems.
Topological Vector Models Topological vector models may also
enforce particular types of topological rela-
Topological vector models specifically tionships. Planar topology requires that all
address many of the shortcomings of spa- features occur on a two-dimensional surface.
ghetti data models. Early GIS developers There can be no overlaps among lines or
realized that they could greatly improve the
Figure 2-12: Spaghetti (a), topological (b), and topological-warped (c) vector data. Figures b and c
are topologically identical because they have the same connectivity and adjacency.
36 GIS Fundamentals
polygons in the same layer (Figure 2-13). occur one above the other. If topological pla-
When planar topology is enforced, lines may narity is enforced, these two polygons must
not cross over or under other lines. At each be resolved into three separate, non-overlap-
line crossing there must be an intersection. ping polygons. Nodes are placed at the inter-
The top left of Figure 2-13 shows a non-pla- sections of the polygon boundaries (lower
nar graph. Four line segments coincide. At right, Figure 2-13).
some locations the lines intersect and a node There are additional topological con-
is present, but at some locations a line passes structs besides planarity that may be
over or under another line segment. These recorded or enforced in topological data
lines are non-planar because if forced to be structures. For example, polygons may be
in the same plane, all line crossings would exhaustive, in that there are no gaps, holes or
intersect at a node. The top right of Figure 2- “islands” in a set of polygons. Line direction
13 shows planar topology enforced for these may be recorded, so that a “from” and “to”
same four line segments. Nodes, shown as node are identified in each line. Directional-
white-filled circles, are found at each line ity aids the representation of river or street
crossing. networks, where there may be a natural flow
Non-planarity may also occur for poly- direction.
gons, as shown at the bottom of Figure 2-13. There is no single, uniform set of topo-
Two polygons overlap slightly at an edge. logical relationships that are included in all
This may be due to an error, e.g., the two topological data models. Different research-
polygons share a boundary but have been ers or software vendors have incorporated
recorded with an overlap, or there may be different topological information in their
two areas that overlap in some way. On the data structures. Planar topology is often
left the polygons are non-planar, that is, they included, as are representations of adjacency
Chapter 2: Data Models 37
(which polygons are next to which) and con- crosses a property line. Most such instances
nectivity (which lines connect to which). occur as a result of small errors in data entry
However, much of this information can be or misalignment among data layers. Topo-
generated “on-the-fly”, during processing. logical restrictions between two data layers
Topological relationships may be con- avoid these inconsistencies. Exceptions may
structed only as needed, each time a data be granted in those few cases when a build-
layer is accessed. Some GIS software pack- ing truly does cross property lines.
ages create and maintain detailed topological There are many other types of topologi-
relationships in their data. This results in cal constraints that may be enforced, both
more complex and perhaps larger data struc- within and between layers. Dangles, lines
tures, but access is often faster, and topology that do not connect to other lines, may be
provides more consistent, “cleaner” data. proscribed, or limited to be greater or less
Other systems maintain little topological than some threshold length. Lines and points
information in the data structures, but com- may be required to coincide, for example,
pute and act upon topology as needed during water pumps in one data layer and water
specific processing. pipes in another, or lines in separate layers
Topology may also be specified between may be required to intersect or be coinci-
layers, because we may wish to enforce spa- dent. While these topological rules add com-
tial relationships between entitites that are plexity to vector data sets, they may also
stored in different data layers. As an exam- improve the logical consistency and value of
ple, consider a cadastral data layer that these data.
stores property lines, and a housing data Topological vector models often use
layer that stores building footprints (Figure codes and tables to record topology. As
2-14). Rules may be specified that prevent described above, nodes are the starting and
polygons in the housing data layer from ending points of lines. Each node and line is
crossing property lines in the cadastral data given a unique identifier. Sequences of
layer. This would indicate a building that nodes and lines are recorded as a list of iden-
tifiers, and point, line, and polygon topology
recorded in a set of tables. The vector fea-
tures and tables in Figure 2-15 illustrate one
form of this topological coding.
Many GIS software systems are written
such that the topological coding is not visi-
ble to users, nor directly accessable by the
users. Tools are provided to ensure the topol-
ogy is created and maintained, e.g., there
may be directives that require that polygons
in two layers do not overlap, or to ensure
planarity for all line crossings. However, the
topological tables these commands build are
often quite large, complex, and linked in an
obscure way, and therefore hidden from
users.
Point topology is often quite simple.
Points are typically independent of each
other, so they may be recorded as individual
Figure 2-14: Topological rules may be enforced
across data layers. Here, rules may be specified to identifiers, perhaps with coordinates
avoid overlap between objects in different layers. included, and in no particular order (Figure
2-15, top).
38 GIS Fundamentals
Line topology typically includes sub- form a closed loop, resulting in the starting
stantial structure, and identifies at a mini- node of the first line in the list that also
mum the beginning and ending points of serves as the ending node for the last line in
each line (Figure 2-15, middle). Variables the list. Note that there may be a “back-
record the topology and may be organized in ground” polygon defined by the outside
a table. These variables may include a line area. This background polygon is not a
identifier, the starting node, and the ending closed polygon as are all the rest, however it
node for each line. In addition, lines may be may be defined for consistency and to pro-
assigned a direction, and the polygons to the vide entries in the line topology table.
left and right of the lines recorded. In most Finally, note that there may be coordi-
cases left and right are defined in relation to nate tables (not shown in Figure 2-15) that
the direction of travel from the starting node record the identifiers and locations of each
to the ending node. node, and coordinates for each vertex within
Polygon topology may also be defined a line or polygon. Node locations are
by tables (Figure 2-15, bottom). The tables recorded with coordinate pairs for each
may record the polygon identifiers and the node, while line locations are represented by
list of connected lines that define the poly- an identifier and a list of vertex coordinates
gon. Edge lines are often recorded in for each line.
sequential order. The lines for a polygon
Figure 2-15: An example of possible vector feature topology and tables. Additional or different
tables and data may be recorded to store topological information.
Chapter 2: Data Models 39
Figure 2-15 illustrates the inter-related the topological tables. Computational costs
structure inherent in the tables that record are typically quite modest with current com-
topology. Point or node records may be puter technologies.
related to lines, which in turn may be related Second, the data must be very “clean”,
to polygons. All these may then be linked in in that all lines must begin and end with a
complex ways to coordinate tables that node, all lines must connect correctly, and all
record location. polygons must be closed. Unconnected lines
Topological vector models greatly or unclosed polygons will cause errors dur-
enhance many vector data operations. Adja- ing analyses. Significant human effort may
cency analyses are reduced to a table look be required to ensure clean vector data
up, an operation that is relatively simple to because each line and polygon must be
program and quick to execute in most soft- checked. Software may help by flagging or
ware systems. For example, an analyst may fixing “dangling” nodes that do not connect
want to identify all polygons adjacent to a to other nodes, and by automatically identi-
city. Assume the city is represented as a sin- fying all polygons. Each dangling node and
gle polygon. The operation reduces to 1) polygon may then be checked, and edited as
scanning the polygon topology table to find needed to correct errors.
the polygon labeled city and reading the list These limitations are far outweighed by
of lines that bound the polygon, and 2) scan- the gains in efficiency and analytical capa-
ning this list of lines for the city polygon, bilities provided by topological vector mod-
accumulating a list of all left and right poly- els. Many current vector GIS packages use
gons. Polygons adjacent to the city may be topological vector models in some form.
identified from this list. List searches on
topological tables are typically much faster
than searches involving coordinate data. Vector Features and Attribute
Topological vector models also enhance Tables
many other spatial data operations. Network Topological vector models are used to
and other connectivity analyses are con- define spatial features in a data layer. As we
cerned with the flow of resources through described earlier in this chapter, these fea-
defined pathways. Topological vector mod- tures are associated with non-spatial
els explicitly record the connections of a set attributes. Typically a table is used to orga-
of pathways and so facilitate network analy- nize the attributes, and there is a linkage
ses. Overlay operations are also enhanced between rows in the table and the spatial
when using topological vector models. The data in the topological data layer (Figure 2-
mechanics of overlay operations are dis- 16). The most common relationship is a one
cussed in greater detail in Chapter 9, how- to one linkage between each entry in the
ever we will state here that they involve attribute table and feature in the data layer.
identifying line adjacency, intersection, and This means for each feature in the data layer
resultant polygon formation. The interior there is one and only one entry in the table.
and exterior regions of existing and new Occaisionally there may be layers with a
polygons must be determined, and these many to one relationship between table
regions depend on polygon topology. Hence, entries and multiple features in a data layer.
topological data are useful in many spatial
analyses. The table typically has unique values in
one “identifier” column of the table, one
There are limitations and disadvantages value for each unique spatial feature in the
to topological vector models. First, there are layer. This ID column is often listed first in
computational costs in defining the topologi- the table, though not always. The item in the
cal structure of a vector data layer. Software ID column may be used to distinguish each
must determine the connectivity and adja- unique feature in the data layer, e.g., when
cency information, assign codes, and build
40 GIS Fundamentals
Figure 2-16: Features in a topological data layer typically have a one to one relationship with entries in an
associated attribute table. The attribute table typically contains an column with a unique identifier, or ID,
for each feature.
editing, in analysis, or when combining the feature. The ID may be used to tie these val-
spatial data in the layer to spatial data in ues to the specific feature, e.g., the area of a
other layers. Additional attributes are orga- polygon may be stored in an area column,
nized in respective columns, with values and associated with the polygon through the
appropriate for the corresponding spatial unique ID (Figure 2-16).
Figure 2-18: The number of cells in a raster data set depends on the cell size. For a given area, a linear
decrease in cell size cause an exponential increase in cell number, e.g., halving the cell size causes a
four-fold increase in cell number.
42 GIS Fundamentals
Table 2-1: Types of data represented by raster cell values. (from L. Usery, pers.
comm.)
may be considered to apply for the cell. In median, maximum, or other statistic for the
some instances the variable may be uniform cell area.
over the raster cell, and hence the value is An alternative interpretation of the ras-
correct over the cell. However, under most ter cell applies the value to the central point
conditions there is within-cell variation, and of the cell. Consider a raster grid containing
the raster cell value represents the average, elevation values. Cells may be specified that
central, or most common value found in the are 200 meters square, and an elevation
cell. Consider a raster data set representing value assigned to each square. A cell with a
annual weekly income with a cell dimension value of 8000 meters (26,200 feet) may be
that is 300 meters (980 feet) on a side. Fur- assumed to have that value at the center of
ther suppose that there is a raster cell with a the cell, and not for the entire cell.
value of 710. The entire 300 by 300 meters
area is considered to have this value of 710 A raster data model may also be used to
dollars per week. There may be many house- represent discrete data, e.g., to represent
holds within the raster cell, none of which landcover in an area. Raster cells typically
may earn exactly 710 dollars per week. hold numeric or single-letter alphabetic
However the 710 dollars may be the aver- characters, so some coding scheme must be
age, the highest point, or some other repre- defined to identify each discrete value. Each
sentative value for the area covered by the code may be found at many raster cells (Fig-
cell. While raster cells often represent the ure 2-19).
average or the value measured at the center Raster cell values may be assigned and
of the cell, they may also represent the interpreted in at least seven different ways
(Table 2-1). We have described three, a ras-
Chapter 2: Data Models 43
Second, differences in the assignment rows in the attribute table comes at some
rules may substantially alter the data layer, cost - the large size of the attribute table.
as shown in our simple example. More While the vector representation of these data
potential cell types in complex landscapes requires an attribute table with five rows,
may increase the assignment sensitivity. representing these same features using a ras-
Smaller cell sizes reduce the significance of ter data model results in 100 rows. As the
classes in the assignment rule, but at the cost raster data set grows in edge dimension, the
of increased data volumes. data volume grows exponentially.
A similar problem may occur when We often use raster data sets with bil-
more than one line or point occurs within a lions of cells. If we insist on a one to one
raster cell. If two points occur, then which cell/attribute relationship, the table may
point ID is assigned? If two lines occur, then become too large. Even simple processes
which line ID should be assigned? such as sorting, searching, subsetting records
become prohibitively time consuming. Dis-
play and redraw rates become low, in gen-
Raster Features and Attribute eral reducing the utility of these data in
Tables particular and GIS in general.
Raster layers may also have associated An alternative relationship is often
attribute tables. This is most common when adopted to avoid these problems. We often
nominal data are represented, but may also allow a many to one relationship between
be adopted with ordinal or interval/ratio the raster cells and the attribute table (Figure
data. Just as with topological vector data, 2-21c). Many raster cells may refer to a sin-
features in the raster layer may be linked to gle row in the attribute column. This sub-
rows in an attribute table, and these rows stantially reduces the size of the attribute
may describe the essential non-spatial char- table for most data sets. It does this at the
acterstics of the features. cost of some spatial ambiguity (there may be
Figure 2-21a and b show date repre- multiple, non-contiguous patches for a spe-
sented in both vector (a) and raster (b) data cific type, e.g., the upper left and lower right
models. Vector data are shown in Figure 2- portion of the raster data set in Figure 2-21c
21a with a one to one correspondence are both of class 10, and are recognized as
between polygon features and rows in the distinct features in the vector and one to one
attribute table. The IDorg column is used an raster representation, but are represented by
identifier for each polygon, and the the same attribute entry in the many to one
attributes class and area assigned for each. raster representation. This reduces the flexi-
Figure 2-21b shows raster data set that repre- bility of an attribute table, and in effect cre-
sents the same data and maintains a one to ates multipart areas. The data for the
one relationship in the data table. An addi- represented variable may be summarized by
tional column, cell-ID, must be added to class - however, these classes may or may
uniquely identify each raster location, and not be spatially contiguous
the corresponding attributes IDorg, class, An alternative has been to maintain the
and area repeated for each cell. Note that the one to one relationship, but to index all the
area values are the same for all cells and raster cells in a contiguous group, thereby
hence all rows in the table. reducing the number of rows in the attribute
The nature of the raster data model often table. This requires software tools to
affects the characteristics of associated develop and maintain the indices, e.g., to
attribute tables and may require adjustments create them and reconstitute the indexing
in how attribute and spatial data are repre- after resampling. These indexing schemes
sented. Note that maintaining a one to one add overhead and increase data model com-
correspondence between raster cells and plexity, thereby removing one of the advan-
tages of raster data sets over vector data sets.
Chapter 2: Data Models 45
Figure 2-21: Raster data models may combine spatial and attribute data in manners equal and different
from vector data. A typical vector representation of vector area features is shown in a, above, that main-
tains a one to one relationship between polygons and table rows. Raster data may also maintain this one
to one relationship (b). A many to one relationship between cells and table rows may be adopted (c)
because raster data sets often contain a prohibitively large number of cells.
46 GIS Fundamentals
A Comparison of Raster and Vec- simple and rapid when using a raster data
tor Data Models model.
The question often arises, “which are Finally, raster data structures are the
better, raster or vector data models?” The most practical method for storing, display-
answer is neither and both. Neither of the ing, and manipulating digital image data,
two classes of data models is better in all such as aerial photographs and satellite
conditions or for all data. Both have advan- imagery. Digital image data are an important
tages and disadvantages relative to each source of information when building, view-
other and to additional, more complex data ing, and analyzing spatial databases. Image
models. In some instances it is preferable to display and analysis are based on raster
maintain data in a raster model, and in others operations to sharpen details on the image,
in a vector model. Most data may be repre- specify the brightness, contrast, and colors
sented in both, and may be converted among for display, and to aid in the extraction of
data models. As an example, elevation may information.
be represented as a set of contour lines in a Vector data models provide some advan-
vector data model or as a set of elevations in tages relative to raster data models. First,
a raster grid. The choice often depends on a vector models generally lead to more com-
number of factors, including the predomi- pact data storage, particularly for discrete
nant type of data (discrete or continuous), objects. Large homogenous regions are
the expected types of analyses, available recorded by the coordinate boundaries in a
storage, the main sources of input data, and vector data model. These regions are
the expertise of the human operators. recorded as a set of cells in a raster data
Raster data models exhibit several model. The perimeter grows more slowly
advantages relative to vector data models. than the area for most feature shapes, so the
First, raster data models are particularly suit- amount of data required to represent an area
able for representing themes or phenomena increases much more rapidly with a raster
that change frequently in space. Each raster data model. Vector data are much more com-
cell may contain a value different than its pact than raster data for most themes and
neighbors. Thus trends as well as more rapid levels of spatial detail.
variability may be represented. Vector data are a more natural means for
Raster data structures are generally sim- representing networks and other connected
pler, particularly when a fixed cell size is linear features. Vector data by their nature
used. Most raster models store cells as sets store information on intersections (nodes)
of rows, with cells organized from left to and the linkages between them (lines). Traf-
right, and rows stored from top to bottom. fic volume, speed, timing, and other factors
This organization is quite easy to code in an may be associated with lines and intersec-
array structure in most computer languages. tions to model many kinds of networks.
Raster data models also facilitate easy Vector data models are easily presented
overlays, at least relative to vector models. in a preferred map format. Humans are
Each raster cell in a layer occupies a given familiar with continuous line and rounded
position corresponding to a given location curve representations in hand- or machine-
on the Earth surface. Data in different layers drawn maps, and vector-based maps show
align cell-to-cell over this position. Thus, these curves. Raster data often show a “stair-
overlay involves locating the desired grid step” edge for curved boundaries, particu-
cell in each data layer and comparing the larly when the cell resolution is large relative
values found for the given cell location. This to the resolution at which the raster is dis-
cell look-up is quite rapid in most raster data played. Cell edges are often visible for lines,
structures, and hence layer overlay is quite and the width and stair-step pattern changes
as lines curve. Vector data may be plotted
Chapter 2: Data Models 47
storage requirements large for most data small for most data
sets without com- sets
pression
display and output good for images, but map-like, with contin-
discrete features uous curves, poor for
may show “stairstep” images
edges
coordinate. The cell in which the point ner linear features in the raster data set, but
resides is given a number or other code iden- often at the cost of line discontinuities.
tifying the point feature occurring at the cell The output from vector-to-raster conver-
location. If the cell size is too large, two or sion depends on the input algorithm used.
more vector points may fall in the same cell, You may get a different output data layer
and either an ambiguous cell identifier when a different conversion algorithm is
assigned, or a more complex numbering and used, even though you use the same input.
assignment scheme implemented. Typically This brings up an important point to remem-
a cell size is chosen such that the diagonal ber when applying any spatial operations.
cell dimension is smaller than the distance The output often depends in subtle ways on
between the two closest point features. the spatial operation. What appear to be
Vector line features in a data layer may quite small differences in the algorithm or
also be converted to a raster data model. key defining parameters may lead to quite
Raster cells may be coded using different different results. Small changes in the
criteria. One simple method assigns a value assignment distance or rule in a vector-to-
to a cell if a vector line intersects with any raster conversion operation may result in
part of the cell (Figure 2-22, left). This large differences in output data sets, even
ensures the maintenance of connected lines with the same input. There is often no clear a
in the raster form of the data. This assign- priori best method. Empirical tests or previ-
ment rule often leads to wider than appropri- ous experiences are often useful guides to
ate lines because several adjacent cells may determine the best method with a given data
be assigned as part of the line, particularly set or conversion problem. The ease of spa-
when the line meanders near cell edges. tial manipulation in a GIS provides a power-
Other assignment rules may be applied, for ful and often easy to use set of tools. The
example, assigning a cell as occupied by a GIS user should bear in mind that these tools
line only when the cell center is near a vector may be more efficient at producing errors as
line segment (Figure 2-22, right). “Near” well as more efficient at providing correct
may be defined as some sub-cell distance, results. Until sufficient experience is
e.g., 1/3 the cell width. Lines passing obtained with a suite of algorithms, in this
through the corner of a cell will not be case vector-to-raster conversion, small, con-
recorded as in the cell. This may lead to thin- trolled tests should be performed to verify
Figure 2-22: vector-to-raster conversion. Two assignment rules result in different raster coding near
lines, but in this case not near points.
Chapter 2: Data Models 49
Figure 2-23: Raster data may be converted to vector formats, and may involve line smoothing or other
operations to remove the “stair-step” effect.
the accuracy of a given method or set of con- raster cells are converted to corresponding
straining parameters. vector data coordinates and structures. Point
Area features are converted from vector- features are represented as single raster cells.
to-raster with methods similar to those used Each vector point feature is usually assigned
for vector line features. Boundaries among the coordinate of the corresponding cell cen-
different polygons are identified as in vec- ter.
tor-to-raster conversion for lines. Interior Linear features represented in a raster
regions are then identified, and each cell in environment may be converted to vector
the interior region is assigned a given value. lines. Conversion to vector lines typically
Note that the border cells containing the involves identifying the continuous con-
boundary lines must be assigned. As with nected set of grid cells that form the line.
vector-to-raster conversion of linear fea- Cell centers are typically taken as the loca-
tures, there are several methods to determine tions of vertices along the line (Figure 2-23)
if a given border cell should be assigned as . Lines may then be “smoothed” using a
part of the area feature. One common mathematical algorithm to remove the “stair-
method assigns the cell to the area if more step” effect.
than one-half the cell is within the vector
polygon. Another common method assigns a
raster cell to an area feature if any part of the
raster cell is within the area contained within
the vector polygon. Assignment results will
vary with the method used.
Up to this point we have covered vector-
to-raster data conversion. Data may also be
converted in the opposite direction, in that
raster data may be converted to vector data.
Point, line, or area features represented by
50 GIS Fundamentals
Figure 2-24: A TIN data model defines a set of adjacent triangles over a sample space. Sample
points, facets, and edges are components of TIN data models.
Chapter 2: Data Models 51
Figure 2-26: Data may often be represented in several data models. Digital elevation data are commonly rep-
resented in raster (DEM), vector (contours), and TIN data models.
Chapter 2: Data Models 53
This may include not only information on crete units for particular problems, and so
the city boundary, but also streets, building may be naturally amenable to an object-ori-
locations, waterways, or other features that ented approach. A power or water distribu-
might be in separate data structures in a lay- tion system may be defined in this manner,
ered topological vector model. The topology where entities such as pumping stations or
could be included, but would likely be incor- holding reservoirs may be discretely
porated within the single object. Topological defined. However, it is more difficult to rep-
relationships to exterior objects may also be resent continuously varying features, such as
represented, e.g., relationships to adjacent elevation, with an object-oriented approach.
cities or counties. In addition, for many problems the defini-
The object-oriented data model has both tion and indexing of objects may be quite
advantages and disadvantages when com- complex. It has proven difficult to develop
pared to traditional topological vector and generic tools that may quickly and effi-
raster data models. Some geographic entities ciently implement object-oriented models.
may be naturally and easily identified as dis-
than one byte of storage is required for each One of the most common number cod-
value. Two bytes will store a number greater ing schemes uses ASCII designators. ASCII
than 65,500. Terrestrial elevations measured stands for the American Standard Code for
in feet or meters are all below this value, so Information Interchange. ASCII is a stan-
two bytes of data are often used to store ele- dardized, widespread data format that uses
vation data. Real numbers such as 12.19 or seven bits, or the numbers 0 through 126, to
865.3 typically require more bytes, and are represent text and other characters. An
effectively split, e.g., two bytes for the extended ASCII, or ANSI scheme, uses
whole part of the real number, and four bytes these same codes, plus an extra binary bit to
for the fractional portion. represent numbers between 127 and 255.
Binary numbers are often used to repre- These codes are then used in many pro-
sent codes. Spatial and attribute data may grams, including GIS, particularly for data
then be represented as text or as standard export or exchange.
codes. This is particularly common when ASCII codes allow us to easily and uni-
raster or vector data are converted for export formly represent alphanumeric characters
or import among different GIS software sys- such as letters, punctuation, other characters,
tems. For example, Arc/Info, a widely used and numbers. ASCII converts binary num-
GIS, produces several export formats that bers to alphanumeric characters through an
are in text or binary formats. Idrisi, another index. Each alphanumeric character corre-
popular GIS, supports binary and alphanu- sponds to a specific number between 0 and
meric raster formats. 255, which allows any sequence of charac-
ters to be represented by a number. One byte
is required to represent each character in
extended ASCII coding, so ASCII data sets use of storage space. In our example, each
are typically much larger than binary data line segment is stored only once. Several
sets. Geographic data in a GIS may use a polygons may point to the line segment as it
combination of binary and ASCII data stored is typically much more space-efficient to add
in files. Binary data are typically used for pointers than to duplicate the line segment.
coordinate information, and ASCII or other
codes may be used for attribute data.
Data Compression
Data compression is common in GIS.
Pointers
Compressions are based on algorithms that
Files may be linked by file pointers or reduce the size of a computer file while
other structures. A pointer is an address or maintaining the information contained in the
index that connects one file location to file. Compression algorithms may be “loss-
another. Pointers are a common way to orga- less”, in that all information is maintained
nize information within and across multiple during compression, or “lossy”, in that some
files. Figure 2-28 depicts an example of the information is lost. A lossless compression
use of pointers to organize spatial data. In algorithm will produce an exact copy when
Figure 2-28, a polygon is composed of a set it is applied and then the appropriate decom-
of lines. Pointers are used to link the set of pression algorithm applied. A lossy algo-
lines that form each polygon. There is a rithm will alter the data when it is applied
pointer from each line to the successive and the appropriate decompression algo-
string of lines that form the polygon. rithm applied. Lossy algorithms are most
Pointers help by organizing data in such often used with image data, and uncom-
a way as to improve access speed. Unorga- monly applied to thematic spatial data.
nized data would require time-consuming Data compression is most often applied
searches each time a polygon boundary was to discrete raster data, for example, when
to be identified. Pointers also allow efficient representing polygon or area information in
Figure 2-28: Pointers are used to organize vector data. Pointers reduce redundant storage and
increase speed of access.
56 GIS Fundamentals
Figure 2-29: Run-length coding is a common and relatively simple method for compressing raster
data. The left number in the run-length pair is the number of cells in the run, and the right is the cell
value. Thus, the 2:9 listed at the start of the first line indicates a run of length two for the cell value 9.
a raster GIS. There are redundant data ele- of cells across a row to locate a given cell.
ments in raster representations of large Access to a cell in run-length coding must be
homogenous areas. Each raster cell within a computed by summing along the run-length
homogenous area will have the same code as codes. This is typically a minor additional
most or all of the adjacent cells. Data com- cost, but in some applications the trade-off
pression algorithms remove much of this between speed and data volume may be
redundancy. objectionable.
Run-length coding is a common data Quad tree representations are another
compression method. This compression raster compression method. Quad trees are
technique is based on recording sequential similar to run-length codings in that they are
runs of raster cell values. Each run is most often used to compress raster data sets
recorded as the value and the run length. when representing area features. Quad trees
Seven sequential cells of type A might be may be viewed as a raster data structure with
listed as A7 instead of AAAAAAA. Thus, a variable spatial resolution. Raster cell sizes
seven cells would be represented by two are combined and adjusted within the data
characters. Consider the data recorded in layer to fit into each specific area feature
Figure 2-29, where each line of raster cells is (Figure 2-30). Large raster cells that fit
represented by a set of run-length codes. In entirely into one uniform area are assigned.
general run-length coding reduces data vol- Successively smaller cells are then fit, halv-
ume, as shown for the top three rows in Fig- ing the cell dimension at each iteration, until
ure 2-29. Note that in some instances run- the smallest cell size is reached.
length coding increases the data volume, The dynamically varying cell size in a
most often when there are no long runs. This quad tree representation requires more
occurs in the last line of Figure 2-29, where sophisticated indexing than simple raster
frequent changes in adjacent cell values data sets. Pointers are used to link data ele-
result in many short runs. However, for most ments in a tree-like structure, hence the
thematic data sets containing area informa- name quad trees. There are many ways to
tion, run length coding substantially reduces structure the data pointers, from large to
the size of raster data sets. small, or by dividing quandrants, and these
There is also some data access cost in methods are beyond the scope of an intro-
run-length coding. Standard raster data ductory text. Further information on the
access involves simply counting the number
Chapter 2: Data Models 57
structure of quad trees may be found in the and wavelet compression algorithms are
references at the end of this chapter. often applied to reduce the size of spatial
A quad tree representation may save data, particularly image or other data.
considerable space when a raster data set Generic bit and byte-level compression
includes large homogeneous areas. Each methods may be applied to any files for
large area may be represented mostly by a compression or communications. There is
few large cells representing the main body, usually some cost in time to the compression
and sets of recursively smaller cells along and decompression.
the margins to represent the spatial detail at
the edges of homogenous areas. As with Common File Formats
most data compression algorithms, space
savings are not guaranteed. There may be A small number of file formats are com-
conditions where the additional indexing monly used to store and transfer spatial data.
overhead requires more space than is saved. Some of these file structures arose from dis-
As with run-length coding, this most often tribution formats adopted by governmental
occurs in spatially complex areas. departments or agencies. Others are based
on formats specified by software vendors,
There are many other data compression while some have been devised by standards-
methods that are commonly applied. JPEG making bodies. Some knowledge of the
types, properties, and common naming con- for representing features that vary continu-
ventions of these file formats is helpful to ously across space, such as temperature or
the GIS pracitioner. precipitation. Data may be converted
Common geographic data formats may between raster and vector data models.
be placed into three large classes: raster, vec- We use data structures and computer
tor, and attribute. Raster formats may be codes to represent our conceptualizations in
further split into single-band and multi-band more abstract, but computer-compatible
file types. Multi-band raster data sets are forms. These structures may be optimized to
most often used to store and distribute image reduce storage space and increase access
data, while single band raster data sets are speed, or to enhance processing based on the
used to store both single-band images and nature of our spatial data.
non-image spatial data. Table 2-3 summa-
rizes some of the most common spatial data
formats.
Summary
In this chapter we have described our
main ways of conceptualizing spatial enti-
ties, and of representing these entities as spa-
tial features in a computer. We commonly
employ two conceptualizations, also called
spatial data models: a raster data model and
a vector data model. Both models use a com-
bination of coordinates, defined in a Carte-
sian or spherical system, and attributes, to
represent our spatial features. Features are
usually segregated by thematic type in lay-
ers.
Vector data models describe the world
as a set of point, line, and area features.
Attributes may be associated with each fea-
ture. A vector data model splits that world
into discrete features, and often supports
topological relationships. Vector models are
most often used to represent features that are
considered discrete, and are compatible with
vector maps, a common output form.
Raster data models are based on grid
cells, and represent the world as a “checker-
board”, with uniform values within each
cell. A raster data model is a natural choice
Chapter 2: Data Models 59
shapefile, .shp, .shx, .dbf, Three or more binary files that include the vec-
ESRI .prj, and others tor coordinate, attribute, and other information
(V).
TIGER, U.S. tgrxxyyy, stfzz Set of files by U.S. census areas, xx is a state
Census code, yyy an area code, zz numbers for various
file types
MIF/MID, .mif, .mid Map Interchange File, vector and raster data
MapInfo transport from MapInfo.
Suggested Reading
Batty, M and Xie, Y., Model structures, exploratory spatial data analysis, and aggregation, International
Journal of Geographical Information Systems, 1994, 8:291-307.
Bhalla, N., Object-oriented data models: a perspective and comparative review, Journal of Information
Science, 1991, 17:145-160.
Bregt, A. K., Denneboom, J, Gesink, H. J., and van Randen, Y., Determination of rasterizing error: a
case study with the soil map of The Netherlands, International Journal of Geographical Information
Systems, 1991, 5:361-367.
Carrara, A., Bitelli, G., and Carla, R., Comparison of techniques for generating digital terrain models
from contour lines, International Journal of Geographical Information Systems, 1997, 11:451-473.
Congalton, R.G., Exploring and evaluating the consequences of vector-to-raster and raster-to-vector con-
version, Photogrammetric Engineering and Remote Sensing, 63:425-434.
Holroyd, F. and Bell, S. B. M., Raster GIS: Models of raster encoding, Computers and Geosciences,
1992, 18:419-426.
Joao, E. M., Causes and Consequences of Map Generalization, Taylor and Francis, London, 1998.
Kumler, M.P., An intensive comparison of triangulated irregular networks (TINs) and digital elevation
models, Cartographica, 1994, 31:1-99.
Langram, G., Time in Geographical Information Systems, Taylor and Francis, London, 1992.
Laurini, R. and Thompson, D., Fundamentals of Spatial Information Systems, Academic Press, London,
1992.
Lee, J., Comparison of existing methods for building triangular irregular network models of terrain from
grid digital elevation models, International Journal of Geographical Information Systems, 5:267-
285.
Maquire, D. J., Goodchild, M. F., and Rhind, D. eds., Geographical Information Systems: Principles and
Applications, Longman Scientific, Harlow, 1991.
Nagy, G. and Wagle, S. G., Approximation of polygonal maps by cellular maps, Communications of the
Association of Computational Machinery, 1979, 22:518-525.
Peuquet, D. J., A conceptual framework and comparison of spatial data models, Cartographica, 1984,
21:66-113.
Peuquet, D. J., An examination of techniques for reformatting digital cartographic data. Part II: the ras-
ter to vector process, Cartographica, 1981, 18:375-394.
Piwowar, J. M., LeDrew, E. F., and Dudycha, D. J., Integration of spatial data in vector and raster for-
mats in geographical information systems, International Journal of Geographical Information Sys-
tems, 1990, 4:429-444.
Chapter 2: Data Models 61
Peuker, T. K. and Chrisman, N., Cartographic Data Structures, The American Cartographer, 1975, 2:55-
69.
Rossiter, D. G., A theoretical framework for land evaluation, Geoderma, 1996, 72:165-190.
Shaffer, C.A., Samet, H., and Nelson R. C., QUILT: a geographic information system based on
quadtrees, International Journal of Geographical Information Systems, 1990, 4:103-132.
Sklar, F. and Costanza, R. Quantitative methods in landscape ecology: the analysis and interpretation of
landscape heterogeneity. in: Turner, M. and Gardner, R., editors. The development of dynamic spa-
tial models for landscape ecology: A review and prognosis. New York: Springer-Verlag; 90:239-
288.
Tomlinson, R. F., The impact of the transition from analogue to digital cartographic representation, The
American Cartographer, 1988, 15:249-262.
Wedhe, M., Grid cell size in relation to errors in maps and inventories produced by computerized map
processes, Photogrammetric Engineering and Remote Sensing, 48:1289-1298.
Worboys, M. F., GIS: A Computing Perspective, Taylor and Francis, London, 1995.
Zeiler, M., Modeling Our World: The ESRI Guide to Geodatabase Design, ESRI Press, Redlands, 1999.
Study Questions
Define a data model and describe the two most commonly used data models.
What is topology, and why is it important? What is planar topology, and when might
you want non-planar vs. planar topology?
What are the respective advantages and disadvantages of vector data models vs. ras-
ter data models?
Under what conditions are mixed cells a problem in raster data models? In what ways
may the problem of mixed cells be addressed?
What are binary and ASCII numbers? Can you convert the following decimal num-
bers to a binary form: 8, 12, 244?
Why do we need to compress data? Which are most commonly compressed, raster
data or vector data? Why?
What is a pointer when used in the context of spatial data, and how are they helpful in
organizing spatial data?