Professional Documents
Culture Documents
Geostat Tutorial1 ESDA PDF
Geostat Tutorial1 ESDA PDF
Geostatistics
TUTORIAL
Exploratory spatial data analysis
ANA CRISTINA COSTA | ccosta@novaims.unl.pt
April 2016
TUTORIAL: Exploratory Spatial Data Analysis 2
Table of Contents
1 Introduction ......................................................................................................... 6
List of Figures
Figure 1: R5D.xls file in Excel ................................................................................................ 7
Figure 4: Selecting all data in order to copy and paste it in a new worksheet ................... 9
Figure 10: Illustration of the relationship between the mean, the median and the mode
............................................................................................................................................ 14
Figure 11: Scatterplots illustrating different types of association between two variables
............................................................................................................................................ 16
Figure 13: R5D index versus the weather stations’ elevation ........................................... 17
Figure 18: Setting symbol properties for graduated colours example. ............................. 24
Figure 19: Setting graduated colours for R5D data posting .............................................. 25
Figure 21: Setting points’ labels for R5D data posting ...................................................... 26
Figure 23: Histogram on the Explore Data menu of the Geostatistical Analyst toolbar ... 28
Figure 24: Regional histogram and descriptive statistics of the R5D index ...................... 28
TUTORIAL: Exploratory Spatial Data Analysis 4
Figure 25: Selecting units in the map to investigate spatial regimes ................................ 29
Figure 26: Setting the threshold value of the indicator map in the Classification window
............................................................................................................................................ 31
Figure 27: Indicator map of the R5D index using 67.4 mm as threshold .......................... 31
Figure 29: Naming the new field of the attribute table .................................................... 33
Figure 30: Python parser to code the values of an indicator variable in the Field
Calculator window .............................................................................................................. 33
Figure 31: Attribute table of the R5D_1990 layer with the new indicator variables
(Quartile1, Quartile2, Quartile3) ........................................................................................ 34
Figure 32: Symbology of the 1st indicator variable map (1st quartile of R5D) ................... 35
Figure 36: Illustration of the Cluster and Outlier Analysis (Anselin Local Moran's I) tool 39
Figure 37: Illustration of z-score peaks in the Spatial Autocorrelation by Distance graph.
............................................................................................................................................ 43
Figure 39: Spatial autocorrelation of the R5D index by distance (z-scores of the Local
Moran’s I statistic) .............................................................................................................. 45
Figure 40 Spatial autocorrelation of the R5D index by distance (z-scores of the Local
Moran’s I statistic) with the Beginning Distance parameter equal to 20000 meters ....... 45
Figure 41: Cluster and Outlier Analysis (Anselin Local Moran's I) dialog-box ................... 46
List of Tables
Table 1: Summary statistics of the R5D index in 1990 and stations’ elevation ................ 14
Table 2: Summary of when to use the mean, median and mode ..................................... 15
1 INTRODUCTION
The purpose of this tutorial is to explain how to explore data using descriptive statistics
in Excel and, mainly, how to use Exploratory Spatial Data Analysis (ESDA) tools in
ArcGIS/ArcMap. This tutorial was produced using ArcGIS 10.1.
The illustration of concepts and tools is based on the R5D index data collected in the
south of Portugal during the 1990s decade (disclosed in the R5D Excel file), particularly
the 1990 year. The study area boundary (polygon) is disclosed in the StudyArea shapefile
(Limit folder). For further details on the data used in this tutorial, refer to the
LabData_info.pdf file.
The R5D index data are available for the whole 1990s decade, but in this tutorial we will
only use the data of the 1990 year. Therefore, the first step of the analysis corresponds
to filtering the data and creating a new table with the data rows corresponding to that
year. Afterwards, we will explore the data using descriptive statistics and graphic tools in
Excel and ArcGIS/ArcMap.
1. Click in any cell of the first row of data (or select the row with the fields/variables
names).
TUTORIAL: Exploratory Spatial Data Analysis 8
2. Turn on the Filter tool by pressing the Filter button in the DATA menu (Figure 2).
This will add a menu to each field name, which is accessible by selecting the small
arrow in each cell of the first row.
3. In the menu (small arrow) of the year field, unselect the “Select All” and, afterwards,
select 1990 (Figure 3).
4. Select all rows and columns, and then copy and paste it into a new worksheet (Figure
4).
5. Change the name of this worksheet from Sheet2 to R5D_1990 (Figure 5).
Figure 4: Selecting all data in order to copy and paste it in a new worksheet
Before analysing the R5D index, we need to remove the missing values of R5D, which are
coded with the NODATA value –9999. These R5D values indicate that there are daily
precipitation data available for the corresponding year and location, but there are not
enough daily data to compute the R5D index (see the LabData_info.pdf file). This is why
these missing values were assigned a code, instead of being left in blank.
To delete the R5D values equal to –9999, we will use the Filter tool:
2. In the menu (small arrow) of the R5D field, unselect the “Select All” and, afterwards,
select –9999.
3. Delete the (only) row that has been selected in the previous step (Figure 6).
4. Turn off the Filter tool by pressing the Filter button in the DATA menu.
1. In Excel 2007, click the MS Office button, , and then press the Excel Options
button. In Excel 2013, select Options from the File menu.
4. In the Add-Ins dialog box, select the Analysis ToolPak item (Figure 7). Afterwards,
click the OK button.
Notes:
- If the Analysis ToolPak is not listed in the Add-Ins dialog box, click the Browse…
button to locate it.
- If a pop-up message appears to inform that the Analysis ToolPak is not installed
in the computer, click the Yes button to install it.
5. The (new) Data Analysis button is added to the Data menu (Figure 8).
TUTORIAL: Exploratory Spatial Data Analysis 11
Additional resources:
See a video from YouTube on How to enable the Analysis ToolPak add-in in Excel
2007: https://www.youtube.com/watch?v=6nCP65Nbm0E [accessed: 13 April,
2016]
See a video from YouTube on How to enable the Analysis ToolPak add-in in Excel
2013: https://www.youtube.com/watch?v=c-
lp4RKxHIM&annotation_id=annotation_3061641839&feature=iv&src_vid=E_apzh
8oCU8 [accessed: 13 April, 2016]
See a video from YouTube on How to enable a Data Analysis Plug in Alternative
for Mac: https://www.youtube.com/watch?v=LRZTvAFfKEU&nohtml5=False
[accessed: 13 April, 2016]
First, we will explore the data of the R5D and elev variables, separately, by using
univariate descriptive statistics1. A dataset with two variables contains what is called
bivariate data. Hence, afterwards, we will investigate the possible existence of a
relationship between the R5D index and the stations’ elevation using a bivariate analysis,
which is based on the correlation coefficient and the scatterplot.
3. Fill in the fields of the Descriptive Statistics dialog box as follows (Figure 10):
a) Click the Input Range field, and select (with the mouse) all data of the R5D and
elev variables, including the names of these variables from the first row.
1
Univariate descriptive statistics signifies that the statistics are separately computed for each variable.
TUTORIAL: Exploratory Spatial Data Analysis 12
The descriptive statistics of the R5D index (Table 1) allow concluding that, in 1990:
The R5D index was measured in 92 [count] weather stations in the south of
Portugal.
The distribution of the R5D index is slightly skewed with a right tail [positively
asymmetric], because the mean is slightly greater than the median (Figure 11).
The regional average of the R5D index is equal to 88 mm [mean], and the typical
deviation from this value is equal to 26.9 mm [standard deviation].
The R5D index is smaller than 81.25 mm [median] in 50% of the weather stations.
The R5D index has a great variability in the south of Portugal: the minimum value
was observed in Castro Verde (49 mm) and the maximum in Comporta (184.1
mm).
The descriptive statistics of the weather stations’ elevation (Table 1) allow concluding:
The average elevation is equal to 188 m, and the typical deviation from this value
is equal to 113.5 m.
50% of the weather stations are located at an altitude smaller than 174.5 meters.
The stations’ elevation has a great variability in the south of Portugal: the lowest
station is located in Montevil (5 m) and the highest in São Julião (530 m).
TUTORIAL: Exploratory Spatial Data Analysis 14
Table 1: Summary statistics of the R5D index in 1990 and stations’ elevation
R5D Elevation
Mean 88.004 188.098
Standard Error 2.802 11.829
Median 81.25 174.5
Mode 115.7 170
Standard Deviation 26.872 113.465
Sample Variance 722.107 12874.199
Kurtosis 0.809 0.359
Skewness 0.904 0.665
Range 135.1 525
Minimum 49 5
Maximum 184.1 530
Sum 8096.4 17305
Count 92 92
Figure 11: Illustration of the relationship between the mean, the median and the mode
NOTES:
The value of the Mode computed by Excel for continuous data is wrong! The mode is the
most frequent value in the data set. But, when we have continuous data, we are more
likely not to have any one value that is more frequent than the other. For continuous
data, the mode represents the highest bar in a histogram. Normally, the mode is used
for categorical data where we wish to know which is the most common category.
The mean is dragged in the direction of extreme values in the tail of the distribution
(skewed distributions). The more skewed the distribution, the greater the difference
between the median and mean, and the greater emphasis should be placed on using the
median as opposed to the mean. The best measure of central tendency with respect to
the different types of variables is shown in Table 2.
TUTORIAL: Exploratory Spatial Data Analysis 15
Measure of
Type of variable
central tendency
Categorical – nominal Mode
Categorical – ordinal Median
Discrete (counts; not skewed) Mean
Discrete (counts; skewed) Median
Continuous (not skewed) Mean
Continuous (skewed) Median
Additional resources:
See this interactive tutorial to learn more about descriptive statistics and
exploratory data analysis tools: “Summarizing Distributions”, in: Online Statistics
Education: An Interactive Multimedia Course of Study, Rice University (Lead
Developer), University of Houston Clear Lake, Tufts University.
http://onlinestatbook.com/2/summarizing_distributions/summarizing_distributio
ns.html [accessed: 14 April, 2016]
When the points cluster along a straight line, the relationship is named a linear
relationship. If the points cluster along a curved line, it is named nonlinear. The strength
of the relationship is depicted by the cloud of points in the scatterplot. The tighter the
points cluster about a line (linear or nonlinear), the stronger the relationship between
the two variables. If the points are not clustered very closely about a line, the association
is weaker. Diffuse clouds of points indicate the absence of a relationship between the
two variables. When one variable (Y) increases with the second variable (X), we say that
X and Y have a positive association. Conversely, when Y decreases as X increases, we say
that they have a negative association.
TUTORIAL: Exploratory Spatial Data Analysis 16
Figure 12: Scatterplots illustrating different types of association between two variables
The following steps describe how to compute the correlation coefficient between the
R5D index and the other quantitative variables (x-longitude, y-latitude, elev-elevation),
using the Analysis ToolPak:
1. Delete the year column, because it is redundant.
2. Click the Data Analysis button from the Data menu (Figure 8).
4. Fill in the fields of the Correlation dialog box as follows (Figure 13):
a) Click the Input Range field, and select (with the mouse) all data of the x, y,
R5D and elev variables, including the names of these variables from the first
row.
b) Click the Labels in First Row option.
c) Verify that the Output Options is set to New Worksheet Ply.
d) Click the Ok button.
TUTORIAL: Exploratory Spatial Data Analysis 17
The correlation between the R5D index and the other quantitative variables is very low
(Table 3), which indicates that the relationship is not linear or that there is no
association between them. Considering that the R5D is a precipitation index, we would
expect that when the elevation increases the R5D values also increase (positive
association). In order to further investigate the relationship between the R5D index and
the stations’ elevation, the scatterplot of these variables should be produced, and allows
concluding that there is no association between them (Figure 14).
x y R5D elev
x 1
y 0.2774 1
Additional resources:
See this interactive tutorial to learn more about exploring the possible
relationship between two variables: “Describing Bivariate Data”, in: Online
Statistics Education: An Interactive Multimedia Course of Study, Rice University
(Lead Developer), University of Houston Clear Lake, Tufts University.
http://onlinestatbook.com/2/describing_bivariate_data/bivariate.html [accessed:
14 April, 2016]
TUTORIAL: Exploratory Spatial Data Analysis 19
2. Add the Excel table with the R5D data by selecting: File menu + Add Data + Add XY
Data....
3. In the Add XY Data... dialog box, browse for the R5D.xls file in your computer, and
select the R5D$ worksheet. The X and Y fields are automatically recognaised (Figure
15).
4. Click the OK button in the Table Does Not Have Object-ID Field pop-up window.
search specifications. The Query Builder may be used somewhat like a wizard, as it
allows you to use buttons and lists to construct your query 2:
- You can construct valid SQL queries regardless of your data source.
- You can build common queries with no prior knowledge of SQL.
- The conditional operators are filtered based on the chosen field type.
There are a few ways you can gain access to the Query Builder if you need to perform a
query on your feature layer or table records, including the Layer Properties dialog box as
follows:
1. Right-click the R5D$ Events layer in the Contents pane, and select Properties… to
open the Layer Properties dialog box.
2. From the Definition Query tab click the Query Builder… button. Proceed as follows,
or write down the query: `year` = 1990 AND `R5D` <> -9999 (Figure 16).
2
“Write a query in the query builder”, ArcGIS Pro Tool reference, http://pro.arcgis.com/en/pro-
app/help/mapping/navigation/write-a-query-in-the-query-builder.htm. [Accessed: 15 April 2016]
TUTORIAL: Exploratory Spatial Data Analysis 21
If you open up the Attribute Table (right-click the R5D$ Events layer, and click Open
Attribute Table), you can see that all 92 records correspond to the year 1990, and that
all R5D values are different from –9999.
The following describes how to export the previously selected data in the R5D layer to a
Shapefile (or Feature Class):
1. Right-click the R5D$ Events layer, and select Data > Export data….
2. Click the Browse button, , and navigate to the location where you want to store
the shapefile.
3. Change the Save as type drop-down menu to Shapefile. Type the following name for
the new shapefile: R5D_1990.
5. Click the Yes button to add the shapefile as a layer to the current map.
After adding the R5D_1990 shapefile as a layer to the map, the R5D$ Events layer can be
removed by selecting Remove after right-clicking the R5D$ Events layer.
TUTORIAL: Exploratory Spatial Data Analysis 22
2. The Extensions dialog box lists the extensions currently installed on your system that
work with the application you are using (i.e, ArcMap). Extensions are listed in this
dialog box whether or not you have registered them or whether or not licenses are
currently available for them on your License Manager. To enable the Geostatistical
Analyst extension, check the box next to it.
Enabling an extension does not cause the extension's user interface to appear
automatically; it simply enables any controls that the extension provides. If the
extension's controls are on a toolbar, such as the ArcGIS Geostatistical Analyst extension
toolbar, you will still need to display the toolbar by choosing it from the Toolbars pull-
right menu in the Customize menu (Figure 18).
Additional resources:
This YouTube video will familiarise ArcMap beginners how to use ArcMap 10.
Topics include opening projects, the organization of the data view, adding data to
the project, using tools and toolbars, and saving projects in different ArcMap
version formats. A Basic Introduction to ArcMap 10:
https://www.youtube.com/watch?v=hqHCJUudPvs&list=PL63EB94891DE02AA9
[accessed: 15 April, 2016]
Besides the Regional Histogram (Histogram tool) and the Thiessen polygons (Voronoi
Maps tool), the following sections illustrate the use of Data Posting, Indicator Maps and
the Local Moran's I statistic.
Proceed as follows to display the R5D values using graduated colours (Figure 19 and
Figure 20):
1. Right-click the R5D_1990 layer in the Contents pane, and select Properties….
4. In the Value field select R5D, which is the numeric field that contains the
quantitative data we want to map.
The default classification method is Natural Breaks, which seeks to reduce the
variance within classes and maximize the variance between classes.
The Equal Interval method sets the value ranges in each category equal in
size. The entire range of data values is divided equally into however many
categories have been chosen.
6. Optionally, select a Normalization field to normalize the data. The values in this field
will be used to divide the Value field to create ratios.
The map of graduated colours provides a first insight on the spatial distribution of the
R5D index (Figure 21):
The spatial distribution of the R5D index is fairly homogeneous in the centre of
the study region, varying between 49 mm and 104.7 mm, with the exception of
two locations with higher values.
The highest values are located in the southern part (Algarve region), in the
northwest corner (Setubal and Troia peninsulas), and in the northeast corner (S.
Mamede mountain range).
The lowest values are located inland, in the centre of the study region.
There is no apparent trend over the study domain.
Proceed as follows to display the data values as labels next to the data points (Figure
22):
1. Right-click the R5D_1990 layer in the Contents pane, and select Properties….
4. Select R5D from the Label Field drop-down menu. Click the Ok button.
Adding the values of the R5D index to the points’ locations in the map (Figure 23) does
not improve our knowledge on the data distribution and patterns, because the density
of points is high. In this case, the simple use of graduated colours is more appropriate.
2. Check if the Layer field is set to R5D_1990. Otherwise, click the Layer drop-down
arrow and select R5D_1990.
3. Click the Attribute drop-down arrow and select R5D. You may want to resize the
Histogram dialog box so you can also see the map.
TUTORIAL: Exploratory Spatial Data Analysis 28
Figure 24: Histogram on the Explore Data menu of the Geostatistical Analyst toolbar
Source: ESRI, “Exercise 2: Exploring your data”, ArcGIS Resource Center, Desktop 10,
http://help.arcgis.com/en/arcgisdesktop/10.0/help/index.html#/Exercise_2_Explorin
g_your_data/0031000000p8000000/. [Accessed: 15 April 2016]
The distribution of the R5D attribute is depicted in the histogram (Figure 25) with the
range of values separated into 10 classes (number of Bars). The frequency of data within
each class is represented by the height of each bar.
Generally, the important features of the distribution are its central value, spread, and
symmetry. The histogram indicates that the data is unimodal (one hump) and slightly
asymmetric, as expected from the exploratory data analysis in Excel. The right tail of the
distribution indicates the presence of a few sample points with large R5D values.
While keeping the Ctrl key pressed, click the histogram bars with R5D values ranging
from 1.44 to 1.84 (144 to 184 mm). Note that the x-axis values have been rescaled by a
factor of 100 (10–2) to make them easier to read. The sample points within this range are
selected on the map (Figure 25). These sample points are located within the Troia
peninsula (northwest corner) and Algarve (southern area). Clicking in the white
background of the histogram clears the selection.
Figure 25: Regional histogram and descriptive statistics of the R5D index
TUTORIAL: Exploratory Spatial Data Analysis 29
Click the Select Features button, , in the Tools toolbar. Select a few points by
dragging the mouse over the map. The bars corresponding to the selected observations
are depicted in the histogram (Figure 26). This functionality is useful to investigate the
possible existence of spatial regimes. A proportional effect might be present in the
southern area, because this is where the R5D index exhibits higher values. The points in
this area are spread throughout the histogram, thus there is no evidence of a
proportional effect in this area.
Descriptive statistics are also depicted in the upper-right corner of the Histogram
window. Besides the summary statistics previously discussed in the Exploratory data
analysis in Excel section, the 1st and 3rd quartile are also presented:
The 1st quartile equal to 67.4 means that the R5D index is smaller than 67.4 mm
in 25% of the weather stations (sample points).
The 3rd quartile equal to 107.95 means that the R5D index is smaller than 107.95
mm in 75% of the weather stations (sample points).
Note that the median corresponds to the 2nd quartile (50% of the sample values are
smaller than the median).
Optionally, you can add the histogram to the layout by pressing the Add to Layout
button. Close the histogram dialog box.
Additional resources:
For further details, see this ArcGIS Help Library topic: “Histograms”, in: ArcGIS
Help Library, ArcGIS Resource Center, ESRI.
http://help.arcgis.com/en/arcgisdesktop/10.0/help/index.html#/Histograms/003
10000000p000000/ [accessed: 15 April, 2016]
This YouTube video shows how to use the Histogram tool in Geostatistical
Analyst. histogram: https://www.youtube.com/watch?v=ZGPmcBq4Eac
[accessed: 15 April, 2016]
To illustrate how to create indicator maps of the R5D index, we will use the quartile
values to define the thresholds, which were determined in the previous section:
The 1st quartile (percentile of 25%) is equal to 67.4 mm;
The 2nd quartile (median = percentile of 50%) is equal to 81.25 mm;
The 3rd quartile (percentile of 75%) is equal to 107.95 mm.
The indicator maps can be produced by simply using the Symbology tool, or using a more
sophisticated approach that consists in creating the indicator variables in the attribute
table.
The following explains how to create indicator maps using the Symbology tool:
2. Right-click the duplicate R5D_1990 layer in the Contents pane, and select Properties.
3
Deciles are similar to quartiles. While quartiles sort data into four quarters (25th, 50th and 75th
percentiles), deciles sort data into ten equal parts: The 10th, 20th, 30th, 40th, 50th, 60th, 70th, 80th and
90th percentiles.
TUTORIAL: Exploratory Spatial Data Analysis 31
5. In the Value field select R5D, and click the Classify button.
6. In the Classification window (Figure 27): change the number of Classes to 2; change
the Classification Method to ‘Manual’; insert 67,4 in the first value of the Break
Values field. Click Ok. Figure 28 shows the resulting indicator map.
7. Repeat all the previous steps considering the necessary modifications for the 2nd and
3rd quartiles.
Figure 27: Setting the threshold value of the indicator map in the Classification window
Figure 28: Indicator map of the R5D index using 67.4 mm as threshold
2. Right-click the R5D_1990 layer in the Contents pane, and select Open Attribute
Table.
3. Editing the attributes of features and values in tables takes place within an edit
session. When you've completed your edits, you can save them and end the edit
session. To start an edit session for the R5D_1990 layer, select Start Editing from the
Editor button in the Editor toolbar.
4. Click the Options button in the Table window, and select Add Field… (Figure 29).
5. Type in the new field name as 'Quartile1' and keep the type as Short Integer (Figure
30).
6. The new indicator variable 'Quartile1' takes the value 1 if the R5D values are smaller
than or equal to 67.4, and takes the value 0 otherwise.
a) Right click on the 'Quartile1' field name and select Field Calculator….
b) In the Field Calculator window, switch to Python parser and type (Figure 31):
1 if !R5D! <= 67.4 else 0
c) Click Ok. You can now see the values of the 'Quartile1' field equal to 1 when the
R5D record is smaller than or equal to 67.4. Otherwise, the 'Quartile1' field is equal
to 0.
d) Select Save Edits from the Editor button in the Editor toolbar.
e) Select Stop Editing from the Editor button in the Editor toolbar.
7. Repeat steps from 4 to 6 considering the necessary modifications for the 2 nd and 3rd
quartiles. In the end of this process, the attribute table of the R5D_1990 layer should
look like Figure 32.
Figure 31: Python parser to code the values of an indicator variable in the Field Calculator window
TUTORIAL: Exploratory Spatial Data Analysis 34
Figure 32: Attribute table of the R5D_1990 layer with the new indicator variables (Quartile1, Quartile2,
Quartile3)
2. Right-click the duplicate R5D_1990 layer in the Contents pane, and select Properties.
5. In the Value field select Quartile1, and click the Add All Values button.
6. Edit the Labels of the values 0 and 1, for example as in Figure 33. Change the symbol
properties to your preference. Click Ok.
7. Repeat all the previous steps considering the necessary modifications for the 2nd and
3rd quartiles.
TUTORIAL: Exploratory Spatial Data Analysis 35
Figure 33: Symbology of the 1st indicator variable map (1st quartile of R5D)
Using five or six indicator maps would be more informative. Nevertheless, the indicator
maps of the R5D index (Figure 34) confirm the previous conclusions. There is no
apparent trend over the study domain. The lowest values are located inland, in the
centre of the study region. The highest values are located in the southern part (Algarve
region), in the northwest corner (Setubal and Troia peninsulas), and in the northeast
corner (S. Mamede mountain range). Moreover, in centre of the study domain, there are
two locations with high values surrounded by lower values.
Many spatial statistics analysis techniques assume that data are stationary, meaning the
relationship between two points and their values depends on the distance between
them, not their exact location. Any location inside a Thiessen polygon represents the
area closer to that data point than to any other data point. This allows exploring the
variation of each sample point based on its relationship to surrounding sample points.
The following illustrates the use of the Voronoi Map tool of ArcGIS Geostatistical Analyst
to explore the R5D index data:
a) Check if the Layer field is set to R5D_1990. Otherwise, click the Layer drop-down
arrow and select R5D_1990.
d) Optionally, change the statistic used to assign values to the polygons in the Type
field. The default is ‘Simple’, which corresponds to the R5D value recorded at the
sample point within each polygon.
e) Optionally, you can add the Voronoi map to the layout by pressing the Add to
Layout button.
f) Optionally, use the Export… button to save the Voronoi map as a Shapefile or
Feature Class, which can be added as a layer. All the statistics in the Type field are
saved in the attribute table of the Voronoi map layer, thus they can be easily
displayed using the Symbology tool. The Simple statistic corresponds to the R5D
field of the Voronoi map layer. Try it!
3. The polygons that are selected in the tool view are linked to points in the ArcMap
data view, which are also highlighted.
The (simple) Voronoi map of the R5D index allows to conclude the following:
The spatial distribution of the R5D index is fairly homogeneous in the centre of
the study region, where the spatial autocorrelation pattern seems to be
anisotropic (i.e., ellipse shaped) with the major continuity direction in the south-
southwest/north-northeast (points separated by a large distance in this direction
are similar, as opposed to the perpendicular direction). However, in the southern
region (Algarve), the major continuity direction seems to be west-east. This
indicates the presence of two different spatial regimes in the study domain.
The highest values are located in the southern part (Algarve region), in the
northwest corner (Setubal and Troia peninsulas), and in the northeast corner (S.
Mamede mountain range).
The lowest values are located inland, in the centre of the study region.
Nevertheless, the Cluster Voronoi map (Figure 36) depicts six polygons that
might correspond to spatial outliers. However, as stated before, the daily
precipitation data used to compute the R5D index was subject to a thorough
quality analysis, thus we can assume that there are no measurement errors, or
anomalies in the data. The R5D values of these polygons most likely reflect local
meteorological conditions that are different from the surrounding areas.
Given a set of features and an analysis attribute4, the Cluster and Outlier Analysis
(Anselin Local Moran's I) tool identifies statistically significant local clusters and spatial
outliers using the Anselin’s Local Moran's I statistic. This tool creates a new output
feature class with the following attributes for each feature in the input feature class:
Local Moran's I value, z-score, p-value, and COType, which is a code representing the
cluster type for each statistically significant feature (Figure 37). The cluster/outlier type
(COType) field distinguishes between a statistically significant cluster of high values (HH),
cluster of low values (LL), outlier in which a high value is surrounded primarily by low
values (HL), and outlier in which a low value is surrounded primarily by high values (LH).
Results are only reliable if the dataset contains at least 30 features.
4
This tool requires an input field such as a count, rate, or other numeric measurement.
TUTORIAL: Exploratory Spatial Data Analysis 39
Figure 37: Illustration of the Cluster and Outlier Analysis (Anselin Local Moran's I) tool
Source: ESRI, “Cluster and Outlier Analysis (Anselin Local Moran's I)”, ArcGIS for Desktop.
ArcGIS Pro, http://pro.arcgis.com/en/pro-app/tool-reference/spatial-statistics/cluster-
and-outlier-analysis-anselin-local-moran-s.htm. [Accessed: 21 April 2016]
A positive value for I indicates that a feature [point or polygon] has neighbouring
features with similarly high or low attribute values; this feature is part of a cluster. A
negative value for I indicates that a feature has neighbouring features with dissimilar
values; this feature is an outlier. In either instance, the p-value for the feature must be
small enough for the cluster or outlier to be considered statistically significant. By
default, they are considered statistically significant if the p-value is smaller than 0.05
(95% confidence level). This analytical approach creates issues with both multiple testing
and dependency. The following paragraphs explain these issues and how to deal with
them5:
Multiple Testing — With a confidence level of 95 percent, probability theory tells us that
there are 5 out of 100 chances that a spatial pattern could appear structured (clustered
or dispersed, for example) and could be associated with a statistically significant p-value,
when in fact the underlying spatial processes promoting the pattern are truly random.
We would falsely reject the ‘complete spatial randomness’ null hypothesis in these cases
because of the statistically significant p-values. Five chances out of 100 seems quite
conservative until you consider that local spatial statistics perform a test for every
feature in the dataset. If there are 10,000 features, for example, we might expect as
many as 500 false results.
Spatial Dependency — Features near to each other tend to be similar; more often than
not spatial data exhibits this type of dependency. Nonetheless, many statistical tests
require features to be independent. For local pattern analysis tools this is because
spatial dependency can artificially inflate statistical significance. Spatial dependency is
exacerbated with local pattern analysis tools because each feature is evaluated within
the context of its neighbours, and features that are near each other will likely share
many of the same neighbours. This overlap accentuates spatial dependency.
In ArcGIS 10.2 or later, the Cluster and Outlier Analysis (Anselin Local Moran's I) tool
provides an optional Boolean parameter, the False Discovery Rate (FDR) Correction,
which will potentially account for multiple testing and spatial dependency. For this
5
ESRI (2016). “What is a z-score? What is a p-value?”. ArcGIS Pro, Tool Reference, Spatial Statistics toolbox,
http://pro.arcgis.com/en/pro-app/tool-reference/spatial-statistics/what-is-a-z-score-what-is-a-p-value.htm.
[Accessed: 21 April 21, 2016]
ESRI (2016). “Modeling spatial relationships”. ArcGIS Pro, Tool Reference, Spatial Statistics toolbox,
http://pro.arcgis.com/en/pro-app/tool-reference/spatial-statistics/modeling-spatial-relationships.htm.
[Accessed: 21 April 21, 2016]
TUTORIAL: Exploratory Spatial Data Analysis 40
method, statistically significant p-values are ranked from smallest (strongest) to largest
(weakest), and based on the false positive estimate, the weakest are removed from this
list. The remaining features with statistically significant p-values are identified by the
COType field in the output feature class. When no FDR correction is applied, features
with p-values smaller than 0.05 are considered statistically significant. The FDR
correction reduces this p-value threshold from 0.05 to a value that better reflects the 95
percent confidence level given multiple testing. While not perfect, empirical tests show
this method performs much better than assuming that each local test is performed in
isolation, or applying the classical, overly conservative, multiple test methods (e.g.,
Bonferroni or Sidak corrections)6.
Distance Band or Threshold Distance sets the scale of analysis for most
conceptualizations of spatial relationships (e.g., Inverse distance and Fixed distance
band). It is a positive numeric value representing a cut-off distance. Features outside the
specified cut-off for a target feature are ignored in the analysis for that feature. The
Calculate Distance Band from Neighbor Count tool will evaluate minimum, average, and
maximum distances for a specified number of neighbours and can help you determine
an appropriate distance band value to use for analysis. See also Selecting a fixed distance
band value for additional guidelines. These are a few recommendations:
- Use a distance band that is large enough to ensure all features will have at least
one neighbour, or results will not be valid.
- Especially if the values for the input field are skewed, each feature should have
about eight neighbours.
- Use a distance band that reflects maximum spatial autocorrelation. Run the
Incremental Spatial Autocorrelation tool and note where the resulting z-scores
seem to peak. Use the distance associated with the peak value for your analysis.
Before applying the Cluster and Outlier Analysis (Anselin Local Moran's I) tool, we also
need to select an appropriate Conceptualization of Spatial Relationships7:
6
Caldas de Castro, M., & Singer, B. H. (2006). Controlling the false discovery rate: a new application to
account for multiple and dependent tests in local statistics of spatial association. Geographical Analysis,
38(2), 180-208.
7
If none of the options for the Conceptualization of Spatial Relationships parameter work well for your
analysis, you can create an ASCII text file or table with the feature-to-feature relationships you want and
then use these to build a spatial weights matrix file. If one of the options above is close, but not perfect for
your purposes, you can use the Generate Spatial Weights Matrix tool to create a basic SWM file, then edit
your spatial weights matrix file.
TUTORIAL: Exploratory Spatial Data Analysis 41
- Fixed distance band: works well for point data. It is often a good option for
polygon data when there is a large variation in polygon size (very large polygons
at the edge of the study area and very small polygons at the centre of the study
area, for example), and you want to ensure a consistent scale of analysis.
- Zone of indifference: works well when fixed distance is appropriate but imposing
sharp boundaries on neighbourhood relationships is not an accurate
representation of your data. Keep in mind that the zone of indifference
conceptual model considers every feature to be a neighbour of every other
feature. Consequently, this option is not appropriate for large datasets since the
Distance Band or Threshold Distance value supplied does not limit the number of
neighbours but only specifies where the intensity of spatial relationships begins
to wane.
- K nearest neighbours: effective when you want to ensure you have a minimum
number of neighbours for your analysis. Especially when the values associated
with your features are skewed (are not normally distributed), it is important that
each feature is evaluated within the context of at least eight or so neighbours
(this is a rule of thumb only).
When the distribution of your data varies across your study area so that some
features are far away from all other features, this method works well. Note,
however, that the spatial context of your analysis changes depending on
variations in the sparsity/density of your features. When fixing the scale of
analysis is less important than fixing the number of neighbours, the K nearest
neighbours’ method is appropriate.
- Delaunay triangulation: good option when your data includes island polygons
(isolated polygons that do not share any boundaries with other polygons), or in
cases where there is a very uneven spatial distribution of features. It is not
appropriate when you have coincident features, however.
TUTORIAL: Exploratory Spatial Data Analysis 42
Table 4 indicates how different choices for the Conceptualization of Spatial Relationships
parameter behave for each possible input types of the Distance Band or Threshold
Distance.
Table 4: Interaction of the Conceptualization of Spatial Relationships parameter with possible Distance
Band or Threshold Distance values
Adapted from: ESRI (2016). “Modeling spatial relationships”. ArcGIS Pro, Tool Reference, Spatial
Statistics toolbox, http://pro.arcgis.com/en/pro-app/tool-reference/spatial-statistics/modeling-spatial-
relationships.htm. [Accessed: 21 April 21, 2016]
Distance Polygon Contiguity,
Band / Inverse Distance, Inverse Fixed Distance Band, Zone of Delaunay
Threshold Distance Squared Indifference Triangulation, K
Distance Nearest Neighbours
No threshold or cut-off is
applied; every feature is a Invalid. Runtime error will be
0 Ignored.
neighbour of every other generated.
feature.
A default distance will be
A default distance will be computed.
computed. This default will
This default will be the minimum
blank be the minimum distance to Ignored.
distance to ensure that every feature
ensure that every feature
has at least one neighbour.
has at least one neighbour.
For fixed distance band, only features
within this specified cut-off of each
The nonzero, positive value
other will be neighbours. For zone of
specified will be used as a
indifference, features within this
positive cut-off distance; neighbour
specified cut-off of each other will be Ignored.
number relationships will only exist
neighbours; features outside the cut-
among features within this
off are neighbours too, but they are
distance of each other.
assigned a smaller and smaller
weight/influence as distance increases.
Considering that the R5D index corresponds to a continuous variable, we will select
Inverse Distance Squared for the Conceptualization of Spatial Relationships. We will use
the Incremental Spatial Autocorrelation tool to select an appropriate Threshold
Distance. This tool measures spatial autocorrelation for a series of distances and
optionally creates a line graph of those distances and their corresponding z-scores of the
Local Moran’s I statistic. Z-scores reflect the intensity of spatial clustering, and
statistically significant peak z-scores indicate distances where spatial processes
promoting clustering (i.e., positive spatial autocorrelation) are most pronounced. These
peak distances are often appropriate values to use in tools that require a Distance Band
or Threshold Distance parameter.
When more than one statistically significant peak is present, clustering is pronounced at
each of those distances. Select the peak distance that best corresponds to the scale of
analysis you are interested in; often this is the first statistically significant peak
encountered (Figure 38).
If you are working with point data and the z-score never peaks (in other words, it just
keeps decreasing), it means there are many different spatial processes operating at a
variety of spatial scales and you will likely need to come up with different criteria for
determining the fixed distance to use in your analysis.
TUTORIAL: Exploratory Spatial Data Analysis 43
Figure 38: Illustration of z-score peaks in the Spatial Autocorrelation by Distance graph.
Source: ESRI (2016). “Incremental Spatial Autocorrelation”. ArcGIS Pro, Tool Reference, Spatial Statistics
toolbox > Analyzing Patterns toolset, http://pro.arcgis.com/en/pro-app/tool-reference/spatial-
statistics/incremental-spatial-autocorrelation.htm. [Accessed: 21 April 21, 2016]
1. Open ArcToolbox , and browse Spatial Statistics Tools > Analyzing Patterns.
a) Select the R5D_1990 layer in the Input Features field. This is the input
feature class.
b) Select R5D in the Input Field field. This is the analysis attribute.
Notes: Before using this tool, be sure to project your data if your study area extends
beyond 30 degrees.
If the tools fails to execute and returns the message "Error 001143 background
server threw an exception", disable the background processing and run the tool
again:
TUTORIAL: Exploratory Spatial Data Analysis 44
If we run the Incremental Spatial Autocorrelation tool again, but with the Beginning
Distance parameter equal to 20000, we obtain the graphic in Figure 41. However, the
Output Report File also provides the information that “At least one distance increment
resulted in features with no neighbours which may invalidate the significance of the
corresponding results” for the first three points (20000, 30894.87 and 41789.75 meters). In
other words, the z-score might not be significant for at least one of these points.
Therefore, the previous value found seems to be adequate.
TUTORIAL: Exploratory Spatial Data Analysis 45
Figure 40: Spatial autocorrelation of the R5D index by distance (z-scores of the Local Moran’s I statistic)
Figure 41 Spatial autocorrelation of the R5D index by distance (z-scores of the Local Moran’s I statistic)
with the Beginning Distance parameter equal to 20000 meters
Finally, we have all the information we need to use the Cluster and Outlier Analysis
(Anselin Local Moran's I) tool:
1. Open ArcToolbox , and browse Spatial Statistics Tools > Mapping Clusters.
3. In the Cluster and Outlier Analysis (Anselin Local Moran's I) dialog-box (Figure 42):
b) Select R5D in the Input Field field. This is the analysis attribute.
TUTORIAL: Exploratory Spatial Data Analysis 46
c) In the Output Feature Class field, click the Browse button, , and navigate
to the location where you want to store the resulting feature. Save it as a
shapefile named R5D_LocalMoran.
f) In ArcGIS 10.2 or later, check the False Discovery Rate (FDR) Correction
option.
Results show that there are no spatial outliers (Figure 43). A cluster of low values is
located in the centre of the study domain, and two clusters of high values are located in
the southern area (Algarve), more specifically near the Monchique (west) and Caldeirão
(east) mountain ranges. A single point in the northwest corner (Troia peninsula)
corresponds to a third cluster of high values, which confirms that this extreme value is
not an outlier. The remaining 69 points are not statistically significant, which means that
we do not have enough evidence to reject the ‘complete spatial randomness’ hypothesis
with 95% confidence. Note that this results were obtained without the False Discovery
Rate (FDR) Correction, which is not available in ArcGIS 10.1.
Figure 42: Cluster and Outlier Analysis (Anselin Local Moran's I) dialog-box
TUTORIAL: Exploratory Spatial Data Analysis 47
Given a set of features and an associated attribute, this tool evaluates whether the
pattern expressed is clustered, dispersed, or random. When the z-score or p-value
indicates statistical significance, a positive Moran's I index value indicates tendency
toward clustering while a negative Moran's I index value indicates tendency toward
dispersion.
In general, the Global Moran's I statistic is bounded by –1 and 1. This is always the case
when your weights are row standardized. When you do not row standardize the
weights, there may be instances where the statistic value falls outside the –1 to 1 range,
and this indicates a problem with your parameter settings. For polygon features, you will
almost always want to row standardize.
Similarly to the Cluster and Outlier Analysis (Anselin Local Moran's I) tool, we must
choose an appropriate Conceptualization of Spatial Relationships method, as well as the
Distance Band or Threshold Distance. All previous recommendations on these issues are
valid.
Proceed as follows to use the Spatial Autocorrelation (Morans I) tool with the R5D index
data:
1. Open ArcToolbox , and browse Spatial Statistics Tools > Analyzing Patterns.
a) Select the R5D_1990 layer in the Input Features field. This is the input
feature class.
b) Select R5D in the Input Field field. This is the analysis attribute.
c) Check the Generate Report (optional) option. The path to the HTML report
will be included with the messages summarising the tool execution
parameters8. If this window is closed, open it by selecting: Geoprocessing
menu + Results.
For the R5D index data, the Global Moran's I value equal to 0.51 is statistically significant
(p-value<0.05) (Table 5), which provides evidence that the spatial distribution of high
values and/or low values in the dataset is more spatially clustered than would be
expected if underlying spatial processes were random.
8
The HTML report is saved in the default folder. For example:
C:\...\My Documents\ArcGIS\Default.gdb\MoransI_Result.html.
TUTORIAL: Exploratory Spatial Data Analysis 49