An Overview of The Spatial Statistics Toolbox PDF

An overview of the Spatial Statistics toolbox Page 1 of 155
An overview of the Spatial Statistics toolbox

ArcGIS 10.3
Locate topic
The Spatial Statistics toolbox contains statistical tools for analyzing spatial distributions, patterns, processes, and relationships. While there
may be similarities between spatial and nonspatial (traditional) statistics in terms of concepts and objectives, spatial statistics are unique in
that they were developed specifically for use with geographic data. Unlike traditional nonspatial statistical methods, they incorporate space
(proximity, area, connectivity, and/or other spatial relationships) directly into their mathematics.
The tools in the Spatial Statistics toolbox allow you to summarize the salient characteristics of a spatial distribution (determine the mean
center or overarching directional trend, for example), identify statistically significant spatial clusters (hot spots/cold spots) or spatial outliers,
assess overall patterns of clustering or dispersion, group features based on attribute similarities, identify an appropriate scale of analysis, and
explore spatial relationships. In addition, for those tools written with Python, the source code is available to encourage you to learn from,
modify, extend, and/or share these and other analysis tools with others.
Note: The tools in the Spatial Statistics toolbox will not work directly with an XY Event Layer (a layer
created from a table containing x-coordinate and y-coordinate fields). Use the Copy Features tool
to first convert the XY Event data into a feature class before you run your analysis.
When using shapefiles, keep in mind that they cannot store null values. Tools or other procedures
that create shapefiles from nonshapefile inputs may store or interpret null values as zero. In some
cases, nulls are stored as very large negative values in shapefiles. This can lead to unexpected
results. See Geoprocessing considerations for shapefile output for more information.
Toolset Description
Analyzing Patterns These tools evaluate if features, or the values associated with features, form a clustered, dispersed, or
random spatial pattern.
Mapping Clusters These tools may be used to identify statistically significant hot spots, cold spots, or spatial outliers. There
are also tools to identify or group features with similar characteristics.
Measuring Geographic These tools address questions such as Where's the center? What's the shape and orientation? How dispersed
Distributions are the features?
Modeling Spatial These tools model data relationships using regression analyses or construct spatial weights matrices.
Relationships
Rendering These tools may be helpful for rendering analysis results.
Utilities These utility tools perform a variety of miscellaneous functions: computing areas, assessing minimum
distances, exporting variables and geometry, converting spatial weights files, and collecting coincident
points.
Spatial Statistics toolsets
Additional resources:
www.esriurl.com/spatialstats contains an up-to-date list of all of the resources available for using the Spatial Statistics tools, including:
 Tutorials
 Videos
 Free web seminars
 Books, articles, and white papers
 Sample scripts and case studies
Copyright © 1995-2014 Esri. All rights reserved.
Spatial Statistics toolbox licensing
Locate topic
Instructions
(DO NOT TRANSLATE THIS SECTION) There is one table for each toolset within the toolbox. If there are nested toolsets (a toolset within a
toolset), the nested toolset has its own table as well (for example, the Data Management toolbox has "Projections and
Transformations/Feature toolset"). If the toolbox does not contain toolsets, or there are tools that are not in a particular toolset, there will
be a table for the toolbox.
Note: DO NOT use the <esri_license> attribute that you see in the tool prolog of your tool, toolset,
or toolbox documentation. Be sure that's set to "none" (click inside the <esri_license> tag and
look at the attribute inspector in XMetaL). The problem with setting the <esri_license>
attribute is that it causes the publisher to output something like "This topic applies to arcinfo
license level only". The problem with this is that we end up with license info in two places and we
have to maintain both places. We don't want to do that. It was a nightmare at 9.3, lead to all
sorts of mistakes, and now that we're going to basic, standard, and advanced licenses, it all has
to be redone. There's one place, one place only, where we doc gp tool licenses, and it's in these
tables.
 The first row of the table contains the default values for each tool in the table.
 Table Cell Values:
 Y—The tool is available for the product level found in the column header (ArcGIS for Advanced, Basic, Standard)
 N—The tool is not available for the product level. For example, the Erase tool is only available with an Advanced license, so the
Print to file://C:\Users\Gis\AppData\Local\Temp\~hh70C0.htm
PDF without this message by purchasing novaPDF (http://www.novapdf.com/) 23/02/2017
row would have "N" "N" "Y".

 L—The tool has a limitation at the license level. THIS IS NEW AT 10.1 and is used when constructing the new header with the 3
check boxes. For example, the Buffer tool has a couple of parameters that are disabled at the Basic and Standard level. Putting an
'L' tells the publisher to construct a header that has a special symbol for the level; instead of a check mark, it's a partially filled-in
box. WHEN AN 'L' IS USED, there is almost always a note in the last column to the effect "Some parameters are limited by license
level—see the tool reference page for more detail."—see below.
 empty—Inherit the default value found in the first row of the table.
 Extension name—If the tool requires an extension, enter the name(s) of the extension. If there is more than one extension (that
is, spatial analyst or geostatistical analyst), separate the extensions with "or" (that is, "Spatial Analyst or Geostatistical Analyst").
Use variables for the extension name (see below; we've pasted these variables into these instructions) When published, we add
"Requires". For example, "Requires Spatial Analyst"—go look at a published tool reference page if you don't understand how we
use the extension names. Please don't do any formatting of text (bold, italic, and so on)—plain text please. Note that we're not
using the new corporate branded phrases like "ArcGIS Network Analyst extension"
 The last column is for any special licensing information. Do NOT use <esri_note> here—just type in the text. The only use for this
cell is for those tools that license certain parameters (or parameter values) at a certain license level. Typically, this is only for tools in
the Analysis toolbox, such as Buffer. In these cases, we just want a general statement, such as "Some parameters are limited by license
level—see the tool reference page for more detail." In these cases, the table cell will typically have an "L" instead of a "Y" or "N"—see
above. YOU DO NOT NEED TO SAY THINGS LIKE "This tool is available with an ArcGIS for Basic or ArcGIS for Standard license, if the
Spatial Analyst or 3D Analyst or the 3D Analyst extension is installed." This situation is captured by entering "Spatial Analyst or 3D
Analyst" in the ArcGIS for Basic and Standard cells and a "Y" in the ArcGIS Advanced cell.
 There is no default note column. That is, the last column of the first row (containing toolset default values) is ignored.
 To add a new tool, add a new row to the table using Table > Insert Rows or Columns in XMetaL. Please insert the new tool row in
alphabetical order. Make sure the tool is an xref to the CORRECT tool reference page.
 To add a new toolset, insert a new <esri_section1> in the appropriate place (again -- alphabetical order please) and make it look like
other toolsets -- section title = Toolset: <toolset name>, first row = Toolset: <toolset name>, etc. Note that we anticipate one table
per toolset so don't add extra tables.
BUGS:
Caution: Cells in this first column should not have a <p> tag. That is, <p><xref>your
tool</xref></p> DOES NOT WORK... your tool will show up as unlicensed. You just
want <xref>your tool</xref> in this column. You should edit in "Tags on" view (View
menu > Tags On) to ensure you don't have <p> tags around your <xref>. This is a silly
bug, but a bug nonetheless.
Caution: Do not use <draft-comment> in any of the first 3 columns. There's a bug in the
publisher where text inside a <draft-comment> GETS PUBLISHED TO OUTPUT.
Common Extensions
(DO NOT TRANSLATE THIS SECTION) Here are the common extensions variables for you to cut/paste. If there's a missing extension in this
list, no worries, it's just a spell-checked list of official names--just add your extension/product name to the tables. These variables are
found in $\Shared Libraries\Reused Content\Product Names. At the bottom of the Product Names topic is a section titled "Extensions - short
names" that contains these variables. See your doc lead if you have questions about editing the Product Names topic or inserting variables.
 3D Analyst
 Spatial Analyst
 Network Analyst
 Geostatistical Analyst
 Schematics
 Tracking Analyst
 Data Interoperability
 Data Reviewer
 Workflow Manager
 Roads and Highways
 Military Analyst
 Military Overlay Editor (MOLE)
Toolset: Analyzing Patterns

ArcGIS for Desktop ArcGIS for Desktop ArcGIS for Desktop
Note
Basic Standard Advanced
toolset: Analyzing Patterns Y Y Y

Average Nearest Neighbor
High/Low Clustering
Incremental Spatial
Autocorrelation
Multi-Distance Spatial Cluster
Analysis
Spatial Autocorrelation
Toolset: Mapping Clusters

ArcGIS for ArcGIS for
ArcGIS for
Desktop Desktop Note
Desktop Basic
Standard Advanced
toolset: Mapping Clusters Y Y Y

Cluster and Outlier Analysis:
Anselin Local Moran's I
Grouping Analysis
Hot_Spot_Analysis
Optimized Hot Spot Analysis
ArcGIS Spatial Analyst extension
for the Density Surface
parameter
Similarity Search
Toolset: Measuring Geographic Distributions

Note
toolset: Measuring Geographic Y Y Y

Distributions
Central Feature
Directional Distribution (Standard
Deviational Ellipse)
Linear Directional Mean
Mean Center
Median Center
Standard Distance
Toolset: Modeling spatial relationships

ArcGIS for Desktop ArcGIS for Desktop
ArcGIS for Desktop Basic Note
Standard Advanced
toolset: Modeling spatial Y Y Y

relationships
Exploratory Regression
Generate Network Spatial
Network Analyst Network Analyst Network Analyst
Weights
Generate Spatial Weights
Matrix
Geographically Weighted
Spatial Analyst or Spatial Analyst or Y
Regression
Geostatistical Analyst Geostatistical Analyst
Ordinary Least Squares
Toolset: Rendering
Note
toolset: Rendering Y Y Y
Cluster/Outlier Analysis with
Rendering
Collect Events with Rendering
Count Rendering
Hot Spot Analysis with Rendering
Z Score Rendering
Toolset: Utilities
Note
toolset: Utilities Y Y Y
Calculate Areas
Calculate Distance Band from
Neighbor Count
Collect Events
Convert Spatial Weights Matrix to
Table
Export Feature Attribute to ASCII
Spatial Statistics toolbox sample applications
Locate topic
Epidemiologists, crime analysts, demographers, emergency response planners, transportation analysts, archaeologists, wildlife biologists, retail
analysts, and many other GIS practitioners increasingly need advanced spatial analysis tools. Spatial statistics help fill this need.
Spatial statistics allow you to
 Summarize the key characteristics of a distribution
 Identify statistically significant spatial clusters (hot spots/cold spots) and spatial outliers
 Assess overall patterns of clustering or dispersion
 Partition features into similar groups
 Identify features with similar characteristics

 Model spatial relationships
Summarize Key Characteristics
Questions Tools Examples
Where is the center? Mean Center or Median Center Where is the population center, and how is it
changing over time?
Which feature is most accessible? Central Feature Where should the new support center be
located?
What is the dominant direction or Linear Directional Mean What is the primary wind direction in the
orientation? winter?
How are fault lines oriented in this region?
How dispersed, compact, or integrated Standard Distance or Directional Which gang operates over the broadest
are features? Distribution (Standard Deviational territory?
Ellipse)
Which disease strain has the widest
distribution?
Based on animal sightings, to what extent are
species integrated?
Are there directional trends? Directional Distribution (Standard What is the orientation of the debris field?
Deviational Ellipse) Where is the debris concentrated?
Identify Statistically Significant Clusters

Where are the hot spots? Where are the Hot Spot Analysis (Getis-Ord Gi*) Where are the sharpest boundaries between
cold spots? How intense is the affluence and poverty?
or Cluster and Outlier Analysis
clustering?
(Anselin Local Moran's I) Where are biological diversity and habitat quality
highest?
or Optimized Hot Spot Analysis
Where are the outliers? Cluster and Outlier Analysis (Anselin Where do we find anomalous spending patterns in
Local Moran's I) Los Angeles?
How can resources be most effectively Hot Spot Analysis (Getis-Ord Gi*) Where do we see unexpectedly high rates of
deployed? diabetes?
Where are kitchen fires a higher-than-expected
proportion of residential fires?
Do crimes committed during the daytime have the
same spatial pattern as those committed at night?
Which locations are farthest from the Hot Spot Analysis (Getis-Ord Gi*) Where should evacuation sites be located?
problem?
Which features are most alike? What Grouping Analysis Which crimes in the database are most similar to
does the spatial fabric of the data look the one just committed?
like?
Are there distinct spatial regimes of test scores?
Which regions are associated with high test scores
and which with low test scores?
Which disease incidents are likely part of the same
outbreak based on space, time, and symptoms?
Which features are most similar or most Similarity Search Which locations have similar characteristics to
dissimilar? those with my best performing stores?
How do salaries for my employees compare to
salaries for equivalent jobs in other cities most like
mine?
Which crimes in the database most closely match a
particular crime of interest?
Assess Overall Spatial Patterns

Do spatial characteristics differ? Spatial Autocorrelation (Global Moran's I) Which types of crime are most spatially
concentrated?
or Average Nearest Neighbor
Which plant species is most dispersed across
the study area?
Is the spatial pattern changing over Spatial Autocorrelation (Global Moran's I) Are rich and poor becoming more or less
time? spatially segregated?
or High/Low Clustering (Getis-Ord General
G) Is there an unexpected spike in
pharmaceutical purchases?
Is the disease remaining geographically fixed
over time, or is it spreading to neighboring
places?
Are containment efforts effective?
Are the spatial processes similar? Multi-Distance Spatial Cluster Analysis Does the spatial pattern of the disease mirror
(Ripley's K Function) the spatial pattern of the population at risk?
Does the spatial pattern for commercial
burglary deviate from the spatial pattern for
commercial establishments?
Is the data spatially correlated? Spatial Autocorrelation (Global Moran's I) Do regression residuals exhibit statistically
significant spatial autocorrelation?
At which distances is spatial clustering Which distance best reflects an appropriate
most pronounced? scale for my analysis?

Incremental Spatial Autocorrelation
Model Relationships
Is there a correlation? How strong is Ordinary Least Squares (OLS), Exploratory What is the relationship between
the relationship? Which variables are Regression, and Geographically Weighted educational attainment and income? Is the
the most consistent predictors? Are the Regression (GWR) relationship consistent across the study
relationships consistent across the area?
study area?
Is there a positive relationship between
vandalism and residential burglary?
Which combinations of the candidate
explanatory variables will yield properly
specified regression models?
Does illness increase with proximity to water
features?
What factors might contribute to Ordinary Least Squares (OLS), Exploratory What are the key variables that explain high
particular outcomes? Where else might Regression, and Geographically Weighted forest fire frequency?
there be a similar response? Regression (GWR)
What demographic characteristics contribute
to high rates of public transportation usage?
Which environments should be protected to
encourage reintroduction of an endangered
species?
Where will mitigation measures be Ordinary Least Squares (OLS) Where do kids consistently turn in high test
most effective? scores? What characteristics seem to be
and Geographically Weighted Regression
associated? Where is each characteristic
(GWR)
most important?
What factors are associated with a higher-
than-expected proportion of traffic
accidents? Which factors are the strongest
predictors in each high-accident location?
How might the pattern change? What Ordinary Least Squares (OLS) Where are the 911 call hot spots? Which
can be done to prepare? variables effectively predict call volumes?
Given future projections, what is the
(GWR)
expected demand for emergency response
resources?
Why is this location a hot spot? Why is Hot Spot Analysis (Getis-Ord Gi*), Why are cancer rates so high in particular
this location a cold spot? areas?
Ordinary Least Squares (OLS),
Why are literacy rates low in some regions?
(GWR) Are there places in the United States where
people are persistently dying young? Why?
GIS offers many different approaches for analyzing spatial data. Sometimes visual analysis is sufficient: a map is created, and it reveals all the
information needed to make a decision. Other times, however, it is difficult to draw conclusions from a map alone. Cartographers make choices
when a map is constructed: which features are included or excluded, how features are symbolized, the classification thresholds selected
determining whether a feature appears bright red or a less-intense pink, how titles are worded, and so on. All these cartographic elements help
communicate the context and scope of the problem being analyzed, but they can also change the characteristics of what we see and,
consequently, can change our interpretation. Spatial statistics help cut through some of the subjectivity to get more directly at spatial patterns,
trends, processes, and relationships. When your analytic questions are especially difficult or the decisions made as a result of your analysis are
exceptionally critical, it is important to examine your data and the context of your problem from a variety of perspectives. Spatial statistics
offer powerful tools that can effectively supplement and enhance visual, cartographic, and traditional (nonspatial) statistical approaches to
spatial data analysis.
Additional resources
 Mitchell, Andy. The ESRI Guide to GIS Analysis, Volume 2 . Esri Press, 2005.
 For a current list of free Esri Virtual Campus web seminars, tutorials, short videos, presentations, and articles, see Spatial Statistics
Resources.
Related Topics
Modeling spatial relationships
Regression analysis basics
Locate topic
This document provides additional information about tool parameters but also introduces essential vocabulary and concepts that are important
when you analyze your data using the Spatial Statistics tools. Use this document as a reference when you need additional information about
tool parameters.
Note:  Calculations based on either Euclidean or Manhattan distance require projected data to
accurately measure distances. Consequently, whenever distance is a component of your
analysis, which is almost always the case with spatial statistics, project your data using a
Projected Coordinate System (rather than a Geographic Coordinate System based on degrees,
minutes, and seconds).

 The tools in the Spatial Statistics toolbox will not work directly with XY Event Layers. Use
CopyFeatures to first convert the XY Event data into a feature class before you run your
analysis.
 When using shapefiles, keep in mind that they cannot store null values. Tools or other
procedures that create shapefiles from nonshapefile inputs may store or interpret null values
as zero. In some cases, nulls are stored as very large negative values in shapefiles. This can
lead to unexpected results. See Geoprocessing considerations for shapefile output for more
information.
Conceptualization of spatial relationships

An important difference between spatial and traditional (aspatial or nonspatial) statistics is that spatial statistics integrate space and spatial
relationships directly into their mathematics. Consequently, many of the tools in the spatial statistics toolbox require the user to select a
value for the Conceptualization of Spatial Relationships parameter prior to analysis. Common conceptualizations include inverse distance,
travel time, fixed distance, K nearest neighbors, and contiguity. The conceptualization of spatial relationships you use will depend on what
you're measuring. If you're measuring clustering of a particular species of seed-propagating plant, for example, inverse distance is probably
most appropriate. However, if you are assessing the geographic distribution of a region's commuters, travel time or travel cost might be
better choices for describing those spatial relationships. For some analyses, space and time might be less important than more abstract
concepts such as familiarity (the more familiar something is, the more functionally near it is) or spatial interaction (there are many more
phone calls, for example, between Los Angeles and New York than between New York and a smaller town nearer to New York, like
Poughkeepsie; some might argue that Los Angeles and New York are functionally closer).
The Grouping Analysis tool contains a parameter called Spatial Constraints, and while the parameter options are similar to those described
for the Conceptualization of Spatial Relationships parameter, they are used differently. When a spatial constraint is imposed, only features
that share at least one neighbor (as defined by contiguity, nearest neighbor relationships, or triangulation methods), may belong to the
same group. Additional information and examples are included in How Grouping Analysis works.
Options for the Conceptualization of Spatial Relationships parameter are discussed below. The option you select determines neighbor
relationships for tools that assess each feature within the context of neighboring features. These tools include the Spatial Autocorrelation
(Global Moran's I), Hot Spot Analysis (Getis-Ord Gi*), and Cluster and Outlier Analysis (Anselin Local Moran's I) tools. Note that some of
these options are only available if you use the Generate Spatial Weights Matrix or Generate Network Spatial Weights tools.
Inverse distance, inverse distance squared (impedance)
With the Inverse Distance options, the conceptual model of spatial relationships is one of impedance, or distance decay. All features
impact/influence all other features, but the farther away something is, the smaller the impact it has. You will generally want to specify a
Distance Band or Threshold Distance value when you use an inverse distance conceptualization to reduce the number of required
computations, especially with large datasets. When no distance band or threshold distance is specified, a default threshold value is
computed for you. You can force all features to be a neighbor of all other features by setting Distance Band or Threshold Distance to
zero.
Inverse Euclidean distance is appropriate for modeling continuous data-like temperature variations, for example. Inverse Manhattan
distance might work best when analyses involve the locations of hardware stores or other fixed urban facilities, in the case where road
network data isn't available. The conceptual model when you use the Inverse Distance Squared option is the same as with Inverse
Distance except the slope is sharper, so neighbor influences drop off more quickly and only a target feature's closest neighbors will exert
substantial influence in computations for that feature.
Distance band (sphere of influence)
For some tools, like Hot Spot Analysis, a fixed distance band is the default conceptualization of spatial relationships. With the Fixed
Distance Band option, you impose a sphere of influence, or moving window conceptual model of spatial interactions onto the data. Each
feature is analyzed within the context of those neighboring features located within the distance you specify for Distance Band or
Threshold Distance. Neighbors within the specified distance are weighted equally. Features outside the specified distance don't influence
calculations (their weight is zero). Use the Fixed Distance Band method when you want to evaluate the statistical properties of your data
at a particular (fixed) spatial scale. If you are studying commuting patterns and know that the average journey to work is 15 miles, for
example, you may want to use a 15-mile fixed distance for your analysis. See Selecting a fixed distance for strategies that can help you
identify an appropriate scale of analysis.
Zone of indifference
The Zone of Indifference option for the Conceptualization of Spatial Relationships parameter combines the Inverse Distance and Fixed
Distance Band models. Features within the distance band or threshold distance are included in analyses for the target feature. Once the
critical distance is exceeded, the level of influence (the weighting) quickly drops off. Suppose you're looking for a job and have the choice
between a job five miles away and another job six miles away. You probably won't think much about distance in making a decision about
which job to take. Now, suppose you have the choice between one job five miles away and another 20 miles away. In this case, distance
becomes more of an impedance and may be factored into your decision making. Use this method when you want to hold the scale of
analysis fixed but don't want to impose sharp boundaries on the neighboring features included in target feature computations.
Polygon contiguity (first order)

For polygon feature classes, you can choose CONTIGUITY_EDGES_ONLY (sometimes called the Rook's Case) or
CONTIGUITY_EDGES_CORNERS (sometimes referred to as Queen's Case). For EDGES_ONLY, polygons that share an edge (that have
coincident boundaries) are included in computations for the target polygon. Polygons that do not share an edge are excluded from the
target feature computations. For EDGES_CORNERS, polygons that share an edge and/or a corner will be included in computations for the
target polygon. If any portion of two polygons overlap, they are considered neighbors and will be included in each other's computations.
Use one of these contiguity conceptualizations with polygon features in cases where you are modeling some type of contagious process or
are dealing with continuous data represented as polygons.
K nearest neighbors
Neighbor relationships may also be constructed so that each feature is assessed within the spatial context of a specified number of its
closest neighbors. If K (the number of neighbors) is 8, then the eight closest neighbors to the target feature will be included in
computations for that feature. In locations where feature density is high, the spatial context of the analysis will be smaller. Similarly, in
locations where feature density is sparse, the spatial context for the analysis will be larger. An advantage to this model of spatial
relationships is that it ensures there will be some neighbors for every target feature, even when feature densities vary widely across the
study area. This method is available using the Generate Spatial Weights Matrix tool. The K_NEAREST_NEIGHBORS option with 8 for Number
of Neighbors is the default conceptualization used with Exploratory Regression to assess regression residuals.
Delaunay triangulation (natural neighbors)

The Delaunay Triangulation option constructs neighbors by creating Voronoi triangles from point features or from feature centroids such
that each point/centroid is a triangle node. Nodes connected by a triangle edge are considered neighbors. Using Delaunay triangulation
ensures every feature will have at least one neighbor even when data includes islands and/or widely varying feature densities. Do not use
the Delaunay Triangulation option when you have coincident features. This method is available using the Generate Spatial Weights Matrix
tool.
Space-Time window
With this option you define feature relationships in terms of both a space (fixed distance) and a time (fixed-time interval) window. This
option is available when you create a spatial weights matrix file using the Generate Spatial Weights Matrix tool. When you select
SPACE_TIME_WINDOW, you will also be required to specify a Date/Time Field, a Date/Time Interval Type (HOURS, DAYS, or MONTHS, for
example), and a Date/Time Interval Value. The interval value is an integer. If you selected HOURS for the Interval Type and a 3 for
Interval Value, for example, two features would be considered neighbors if the values in their Date/Time field were within three hours of
each other. With this conceptualization, features are neighbors if they fall within the specified distance and also fall within the specified
time interval of the target feature. As one possible example, you would select the SPACE_TIME_WINDOW Conceptualization of Spatial
Relationships if you wanted to create a spatial weights matrix file to use with Hot_Spot_Analysis in order to identify space-time hot
spots. Additional information, including how to visualize results, is presented in Space-Time Analysis.
Get spatial weights from file (user-defined spatial relationships)

You can create a file to store feature neighbor relationships using either the Generate Spatial Weights Matrix tool or the Generate
Network Spatial Weights tool. If you want to define spatial relationships using travel time or travel costs derived from a network dataset,
create a spatial weights matrix file using the Generate Network Spatial Weights tool, then use the resultant SWM file for your analyses. If
the spatial relationships for your features are defined in a table, use the Generate Spatial Weights Matrix tool to convert that table into a
spatial weights matrix (.swm) file. Particular fields should be included in your table in order to use the CONVERT_TABLE option to obtain an
SWM file. You can also provide a path to a formatted ASCII text file that defines your own custom conceptualization of spatial
relationships (based on spatial interaction, for example).
Selecting a conceptualization of spatial relationships: Best practices

The more realistically you can model how features interact with each other in space, the more accurate your results will be. Your choice
for the Conceptualization of Spatial Relationships parameter should reflect inherent relationships among the features you are analyzing.
Sometimes your choice will also be influenced by characteristics of your data.
The inverse distance methods, for example, are most appropriate with continuous data or to model processes where the closer two
features are in space, the more likely they are to interact/influence each other. With this spatial conceptualization, every feature is
potentially a neighbor of every other feature, and with large datasets, the number of computations involved will be enormous. You should
always try to include a Distance Band or Threshold Distance value when using the inverse distance conceptualizations. This is particularly
important for large datasets. If you leave the Distance Band or Threshold Distance parameter blank, a threshold distance will be
computed for you, but this may not be the most appropriate distance for your analysis; the default distance threshold will be the
minimum distance that ensures every feature has at least one neighbor.
The fixed distance band method works well for point data. It is the default option used by the Hot Spot Analysis (Getis-Ord Gi*) tool. It
is often a good option for polygon data when there is a large variation in polygon size (very large polygons at the edge of the study area
and very small polygons at the center of the study area, for example), and you want to ensure a consistent scale of analysis. See
Selecting a fixed distance below for strategies to help you determine an appropriate distance band value for your analysis.
The zone of indifference conceptualization works well when Fixed Distance is appropriate but imposing sharp boundaries on
neighborhood relationships is not an accurate representation of your data. Keep in mind that the Zone of Indifference conceptual model
considers every feature to be a neighbor of every other feature. Consequently, this option is not appropriate for large datasets since the
Distance Band or Threshold Distance value supplied does not limit the number of neighbors but only specifies where the intensity of
spatial relationships begins to wane.
The polygon contiguity conceptualizations are effective when polygons are similar in size and distribution, and when spatial
relationships are a function of polygon proximity (the idea that if two polygons share a boundary, spatial interaction between them
increases). When you select a polygon contiguity conceptualization, you will almost always want to select row standardization for tools
that have the Row Standardization parameter.
The K nearest neighbors option is effective when you want to ensure you have a minimum number of neighbors for your analysis.
Especially when the values associated with your features are skewed (are not normally distributed), it is important that each feature is
evaluated within the context of at least eight or so neighbors (this is a rule of thumb only). When the distribution of your data varies
across your study area so that some features are far away from all other features, this method works well. Note, however, that the
spatial context of your analysis changes depending on variations in the sparsity/density of your features. When fixing the scale of analysis
is less important than fixing the number of neighbors, the K nearest neighbors method is appropriate.
Some analysts consider Delaunay triangulation a way to construct natural neighbors for a set of features. This method is a good option
when your data includes island polygons (isolated polygons that do not share any boundaries with other polygons) or in cases where
there is a very uneven spatial distribution of features. It is not appropriate when you have coincident features, however. Similar to the K
nearest neighbors method, Delaunay triangulation ensures every feature has at least one neighbor but uses the distribution of the data
itself to determine how many neighbors each feature gets.
The Space-Time Window options allow you to define feature relationships in terms of both their spatial and their temporal proximity.
You would use this option if you wanted to identify space-time hot spots, or construct groups where membership was constrained by
space and time proximity. Examples of space-time analysis as well as strategies for effectively rendering the results from this type of
analysis are provided in Space-Time Analysis.
For some applications, spatial interaction is best modeled in terms of travel time or travel distance. If you are modeling accessibility to
urban services, for example, or looking for urban crime hot spots, modeling spatial relationships in terms of a network is a good option.
Use the Generate Network Spatial Weights tool to create a spatial weights matrix file (.swm) prior to analysis; select
GET_SPATIAL_WEIGHTS_FROM_FILE for your Conceptualization of Spatial Relationships value, then, for the Weights Matrix File parameter,
provide the full path to the SWM file you created.
Tip: ESRI Data & Maps, free to ArcGIS users, contains StreetMap data including a prebuilt network
dataset in SDC format. The coverage for this dataset is the United States and Canada. These
network datasets can be used directly by the Generate Network Spatial Weights tool.
If none of the options for the Conceptualization of Spatial Relationships parameter work well for your analysis, you can create an ASCII
text file or table with the feature-to-feature relationships and use these to build a spatial weights matrix file. If one of the options above
is close, but not perfect for your purposes, you can use the Generate_Spatial_Weights_Matrix tool to create a basic SWM file, then edit
your spatial weights matrix file.
Selecting a fixed-distance band value

Think of the fixed distance band you select as a moving window that momentarily settles on top of each feature and looks at that
feature within the context of its neighbors. There are several guidelines to help you identify an appropriate distance band for analysis:
 Select a distance based on what you know about the geographic extent of the spatial processes promoting clustering for the
phenomena you are studying. Often, you won't know this, but if you do, you should use your knowledge to select a distance value.
Suppose, for example, you know that the average journey-to-work commute distance is 15 miles. Using 15 miles for the distance
band is a good strategy for analyzing commuting data.
 Use a distance band that is large enough to ensure all features will have at least one neighbor, or results will not be valid. Especially
if the input data is skewed (does not create a nice bell curve when you plot the values as a histogram), you will want to make sure
that your distance band is neither too small (most features have only one or two neighbors) nor too large (several features include all
other features as neighbors), because that would make resultant z-scores less reliable. The z-scores are reliable (even with skewed
data) as long as the distance band is large enough to ensure several neighbors (approximately eight) for each feature. Even if none of
the features have all other features as neighbors, performance issues and even potential memory limitations can result if you create a
distance band where features have thousands of neighbors.
 Sometimes ensuring all features have at least one neighbor results in some features having many thousands of neighbors, and this is
not ideal. This can happen when some of your features are spatial outliers. To resolve this problem, determine an appropriate
distance band for all but the spatial outliers, and use the Generate_Spatial_Weights_Matrix tool to create a spatial weights matrix file
using that distance. When you run the Generate Spatial Weights Matrix tool, however, specify a minimum number of neighbors value
for the Number of Neighbors parameter. Example: Suppose you are evaluating access to healthy food in Los Angeles County using
census tract data. You know that more than 90 percent of the population live within three miles of shopping opportunities. If you are
analyzing census tracts you will find that distances between tracts (based on tract centroids) in the downtown region are about 1,000
meters on average, but distances between tracts in outlying areas are more than 18,000 meters. To ensure every feature has at least
one neighbor, your distance band would need to be more than 18,000 meters, and this scale of analysis (distance) is not appropriate
for the questions you are asking. The solution is to create a spatial weights matrix file for the census tract feature class using the
Generate_Spatial_Weights_Matrix tool. Specify a Threshold Distance of about 4800 meters (approximately three miles) and a
minimum number of neighbors value (let's say 2) for the Number of Neighbors parameter. This will apply the 4,800 meter fixed-
distance neighborhood to all features except those that do not have a least two neighbors using that distance. For those outlier
features (and only for those outlier features), the distance will be expanded just far enough to ensure every feature has at least two
neighbors.
 Use a distance band that reflects maximum spatial autocorrelation. Whenever you see spatial clustering on the landscape, you are
seeing evidence of underlying spatial processes at work. The distance band that exhibits maximum clustering, as measured by the
Incremental Spatial Autocorrelation tool, is the distance where those spatial process are most active, or most pronounced. Run the
Incremental Spatial Autocorrelation tool and note where the resulting z-scores seems to peak. Use the distance associated with the
peak value for your analysis.
Note: Distance values should be entered using the same units as specified by the geoprocessing
environment output coordinate system.
 Every peak represents a distance where the processes promoting spatial clustering are pronounced. Multiple peaks are common.
Generally, the peaks associated with larger distances reflect broad trends (a broad east-to-west trend, for example, where the
west is a giant hot spot and the east is a giant cold spot); generally, you will be most interested in peaks associated with
smaller distances, often the first peak.
 An inconspicuous peak often means there are many different spatial processes operating at a variety of spatial scales. You
probably want to look for other criteria to determine which fixed distance to use for your analysis (perhaps the most effective
distance for remediation).
 If the z-score never peaks (in other words, it just keeps increasing) and if you are using aggregated data (for example,
counties), it usually means the aggregation scheme is too coarse; the spatial processes of interest are operating at a scale that
is smaller than the scale of your aggregation units. If you can move to a smaller scale of analysis (moving from counties to
tracts, for example), this may help find a peak distance. If you are working with point data and the z-score never peaks, it
means there are many different spatial processes operating at a variety of spatial scales and you will likely need to come up
with different criteria for determining the fixed distance to use in your analysis. You will also want to check that your Beginning
Distance when you run the Incremental Spatial Autocorrelation tool isn't too large.
 If you do not specify a beginning distance, the Incremental Spatial Autocorrelation tool will use the distance that ensures all
features have at least one neighbor. If your data includes spatial outliers, that distance might be too large for your analysis,
however, and may be the reason you do not see a pronounced peak in the Output Report File. The solution is to run the
Incremental Spatial Autocorrelation tool on a selection set that temporarily excludes all spatial outliers. If a peak is found with
the outliers excluded, use the strategy outlined above with that peak distance applied to all of your features (including the
spatial outliers), and force each feature to have at least one or two neighbors. If you're not sure if any of your features are
spatial outliers:
 For polygon data, render polygon areas using a Standard Deviation rendering scheme and consider polygons with areas
that are greater than three standard deviations to be spatial outliers. You can use Calculate_Field or the Geometry
Calculator to create a field with polygon areas if you don't already have one.
 For point data, use the Near tool to compute each feature's nearest neighbor distance. To do this, set both the Input
Features and Near Features to your point dataset. Once you have a field with nearest neighbor distances, render those
values using a Standard Deviation rendering scheme and consider distances that are greater than three standard
deviations to be spatial outliers.
 Try not to get stuck on the idea that there is only one correct distance band. Reality is never that simple. Most likely, there are
multiple/interacting spatial processes promoting observed clustering. Rather than thinking you need one distance band, think of the
pattern analysis tools as effective methods for exploring spatial relationships at multiple spatial scales. Consider that when you
change the scale (change the distance band value), you could be asking a different question. Suppose you are looking at income
data. With small distance bands, you can examine neighborhood income patterns, middle scale distances might reflect community or
city income patterns, and the largest distance bands would highlight broad regional income patterns.
Distance method
Many of the tools in the Spatial Statistics toolbox use distance in their calculations. These tools provide you with the choice of either
Euclidean or Manhattan distance.
 Euclidean distance is calculated as
D = sq root [(x1–x2)**2.0 + (y1–y2)**2.0]
where (x1,y1) is the coordinate for point A, (x2,y2) is the coordinate for point B, and D is the straight-line distance between points A and B.
 Manhattan distance is calculated as
D = abs(x1–x2) + abs(y1–y2)
where (x1,y1) is the coordinate for point A, (x2,y2) is the coordinate for point B, and D is the vertical plus horizontal difference between
points A and B. It is the distance you must travel if you are restricted to north–south and east–west travel only. This method is generally
more appropriate than Euclidean distance when travel is restricted to a street network and where actual street network travel costs are not
available.
When your input features are not projected (i.e., when coordinates are given in degrees, minutes, and seconds) or when the output
coordinate system is set to a Geographic Coordinate System, or when you specify an output feature class path to a feature dataset that has
a Geographic Coordinate System spatial reference, distances will be computed using chordal measurements and the Distance Method
parameter will be disabled. Chordal distance measurements are used because they can be computed quickly and provide very good
estimates of true geodesic distances, at least for points within about thirty degrees of each other. Chordal distances are based on a sphere
rather than the true oblate ellipsoid shape of the earth. Given any two points on the earth's surface, the chordal distance between them is
the length of a line, passing through the three dimensional earth, to connect those two points. Chordal distances are reported in meters.
Caution: Be sure to project your data if your study area extends beyond 30 degrees. Chordal distances
are not a good estimate of geodesic distances beyond 30 degrees.
Self-potential (field giving intrazonal weight)

Several tools in the Spatial Statistics toolbox allow you to provide a field representing the weight to use for self-potential. Self-potential is
the distance or weight between a feature and itself. Often, this weight is zero, but in some cases, you may want to specify another fixed
value or a different value for every feature. If your conceptualization of spatial relationships is based on distances traveled within and
among census tracts, for example, you might decide to model self-potential to reflect average intrazonal travel costs based on polygon size:
dii = 0.5*[(Ai / π)**0.5]
where dii is the travel cost associated with intrazonal travel for polygon featurei, and Ai is the area associated with polygon featurei.
Standardization
Row standardization is recommended whenever the distribution of your features is potentially biased due to sampling design or an imposed
aggregation scheme. When row standardization is selected, each weight is divided by its row sum (the sum of the weights of all neighboring
features). Row standardized weighting is often used with fixed distance neighborhoods and almost always used for neighborhoods based on
polygon contiguity. This is to mitigate bias due to features having different numbers of neighbors. Row standardization will scale all weights
so they are between 0 and 1, creating a relative, rather than absolute, weighting scheme. Anytime you are working with polygon features
representing administrative boundaries, you will likely want to choose the Row Standardization option.
Examples:
 Suppose you have ALL crime incidents. In some parts of your study area there are lots of points because those are places with lots of
crime. In other parts, there are few points, because those are low crime areas. The density of the points is a very good reflection (is
representative) of what you're trying to understand: crime spatial patterns. You probably would not Row Standardize your spatial
weights.
 Suppose you've taken soil samples. For some reason (the weather was nice or you happened to be in a location where you didn't have
to climb fences, swim through swamps, or hike to the top of a mountain), you have lots of samples in some parts of the study area, but
fewer in others. In other words, the density of your points is not strictly the result of a carefully planned random sample; some of your
own biases may have been introduced. Further, where you have more points is not necessarily a reflection of the underlying spatial
distribution of the data you're analyzing. To help minimize any bias that may have been introduced during the sampling process, you
will want to Row Standardize your spatial weights. When you row standardize, the fact that one feature has two neighbors and another
has 18 doesn't have a big impact on results; all the weights sum to 1.
 Whenever you aggregate your data, you are imposing a structure on it. Rarely will that structure be a good reflection of the data you
are analyzing and the questions you are asking. For example, while census polygons (like census tracts) are designed around
population, even if your analysis involves population-related questions, you will still likely row standardize your weights because those
polygons represent just one of many ways they could have been drawn. With polygon data you will almost always want to Row
Standardize your spatial weights.
Distance band or threshold distance

Distance Band or Threshold Distance sets the scale of analysis for most conceptualizations of spatial relationships (for example, Inverse
Distance, Fixed Distance Band). It is a positive numeric value representing a cutoff distance. Features outside the specified cutoff for a
target feature are ignored in the analysis for that feature. With Zone of Indifference, however, the influence of features outside the given
distance is reduced in relation to proximity, while those inside the distance threshold are equally considered.
Choosing an appropriate distance is important. Some spatial statistics require each feature to have at least one neighbor for the analysis to
be reliable. If the value you set for Distance Band or Threshold Distance is too small (so that some features have no neighbors), a warning
message appears suggesting that you try again with a larger distance value. The Calculate Distance Band from Neighbor Count tool will
evaluate minimum, average, and maximum distances for a specified number of neighbors and can help you determine an appropriate
distance band value to use for analysis. See also Selecting a fixed distance band value for additional guidelines.
When no value is specified, a default threshold distance is computed. The table below indicates how different choices for the
Conceptualization of Spatial Relationships parameter behave for each of three possible input types (negative values are not valid):
Inverse Distance, Inverse Fixed Distance Band, Zone of Indifference Polygon Contiguity,
Distance Squared Delaunay
Triangulation, K
Nearest Neighbors
0 No threshold or cutoff is Invalid. Runtime error will be generated. Ignored.

applied; every feature is a
neighbor of every other feature.
blank A default distance will be A default distance will be computed. This default will be Ignored.
computed. This default will be the minimum distance to ensure that every feature has
the minimum distance to at least one neighbor.
ensure that every feature has
at least one neighbor.
positive The nonzero, positive value For Fixed Distance Band, only features within this Ignored.
number specified will be used as a specified cutoff of each other will be neighbors. For Zone
cutoff distance; neighbor of Indifference, features within this specified cutoff of
relationships will only exist each other will be neighbors; features outside the cutoff
among features within this are neighbors too, but they are assigned a smaller and
distance of each other. smaller weight/influence as distance increases.
Distance band options
Number of neighbors
Specify a positive integer to represent the number of neighbors to include in the analysis for each target feature. When the value chosen for
the Conceptualization of Spatial Relationships parameter is K Nearest Neighbors, each target feature will be evaluated within the context of
the closest K features (where K is the number of neighbors specified). For Inverse Distance or Fixed Distance Band, when you run the
Generate Spatial Weights Matrix tool, specifying a value for the Number of Neighbors parameter will ensure that each feature has a
minimum of K neighbors. For the polygon contiguity methods, any feature that does not have the Number of Neighbors specified will get
additional neighbors based on feature centroid proximity. For the Generate Network Spatial Weights tool, specifying a value for the
Maximum Number of Neighbors parameter will ensure no feature has more than the value specified. For the Grouping Analysis tool,
providing a value for the Number of Neighbors encourages feature proximity within each group. Specifying 6 neighbors, for example, will
limit groups to features sharing at least one of six nearest neighbors to other features in the group.
Weights matrix file

Several tools allow you to define spatial relationships among features by providing a path to a spatial weights matrix file. Spatial weights
are numbers that reflect the distance, time, or other cost between each feature and every other feature in the dataset. The spatial weights
matrix file may be created using the Generate Spatial Weights Matrix tool or Generate Network Spatial Weights tool, or it may be a simple
ASCII file.
When the spatial weights matrix file is a simple ASCII text file, the first line should be the name of a unique ID field. This gives you the
flexibility to use any numeric field in your dataset as the ID when generating this file; however, the ID field must be type INTEGER and
have unique values for every feature. After the first line, the spatial weights file should be formatted into three columns:
 From feature ID
 To feature ID
 Weight
For example, suppose you have three gas stations. The field you are using as the ID field is called StationID, and the feature IDs are 1, 2,
and 3. You want to model spatial relationships among these three gas stations using travel time in minutes. You could create an ASCII file
that might look like the following:
Generally, when weights represent distance or time, they are inverted (for example, 1/10 when the distance is 10 miles or 10 minutes) so
that nearer features have a larger weight than features that are farther away. Notice from the weights above that gas station 1 is 10
minutes from gas station 2. Notice also that travel time is not symmetrical in this example (traveling from gas station 1 to gas station 3 is 7
minutes, but traveling from gas station 3 to gas station 1 is only 6 minutes). Notice that the weight between gas station 1 and itself is 0
and that there is no entry for gas station 2 to itself. Missing entries are assumed to have a weight of 0.
Typing the values for the spatial weights matrix file can be a tedious job at best, even for small datasets. A better approach is to use the
Generate Spatial Weights Matrix tool or to write a quick Python script to perform this task for you.
Spatial weights matrix file (.swm)

The Generate Spatial Weights Matrix or Generate Network Spatial Weights tool will create a spatial weights matrix file (.swm) defining the
spatial relationships among all the features in your dataset based on the parameters you specify. This file is created in binary file format so
the values in the file cannot be viewed directly. To view or edit the feature relationships in an SWM file, use the
Convert_Spatial_Weights_Matrix_to_Table tool.
When the spatial relationships among features is stored in a table, you may use the Generate Spatial Weights Matrix tool to convert that
table into a spatial weights matrix file (.swm). The table will need the following fields:
Field name Description
<Unique ID field An integer field that exists in the input feature class with a unique ID for each feature. This is the from feature
name> ID.
NID An integer field containing neighbor feature IDs. This is the to feature ID.
WEIGHT This is the numeric weight quantifying the spatial relationship between the from and to features. Larger values
reflect bigger weights and stronger influence, or interaction, between two features.
Required Table Fields
Sharing spatial weights matrix files

The output from the Generate Spatial Weights Matrix and Generate Network Spatial Weights tools is an SWM file. This file is tied to the
input feature class, the unique ID field, and the output coordinate system settings when the SWM file was created. Other people can
duplicate the spatial relationships you define for analysis by using your SWM file and either the same input feature class, or a feature
class linking all or a subset of the features to a matching Unique ID field. Especially if you plan to share your SWM files with others, try to
avoid the situation where your output coordinate system differs from the spatial reference associated with your input feature class. A
better strategy is to project the input feature class, then set the output coordinate system to Same as Input Feature Class prior to
creating spatial weights matrix files.
Related Topics
Spatial Statistics toolbox sample applications
What is a z-score? What is a p-value?
Locate topic
Most statistical tests begin by identifying a null hypothesis. The null hypothesis for the pattern analysis tools (Analyzing Patterns toolset and
Mapping Clusters toolset) is Complete Spatial Randomness (CSR), either of the features themselves or of the values associated with those
features. The z-scores and p-values returned by the pattern analysis tools tell you whether you can reject that null hypothesis or not. Often,
you will run one of the pattern analysis tools, hoping that the z-score and p-value will indicate that you can reject the null hypothesis, because
it would indicate that rather than a random pattern, your features (or the values associated with your features) exhibit statistically significant
clustering or dispersion. Whenever you see spatial structure such as clustering in the landscape (or in your spatial data), you are seeing
evidence of some underlying spatial processes at work, and as a geographer or GIS analyst, this is often what you are most interested in.
The p-value is a probability. For the pattern analysis tools, it is the probability that the observed spatial pattern was created by some random
process. When the p-value is very small, it means it is very unlikely (small probability) that the observed spatial pattern is the result of random
processes, so you can reject the null hypothesis. You might ask: How small is small enough? Good question. See the table and discussion
below.
Z-scores are standard deviations. If, for example, a tool returns a z-score of +2.5, you would say that the result is 2.5 standard deviations.
Both z-scores and p-values are associated with the standard normal distribution as shown below.
Very high or very low (negative) z-scores, associated with very small p-values, are found in the tails of the normal distribution. When you run
a feature pattern analysis tool and it yields small p-values and either a very high or a very low z-score, this indicates it is unlikely that the
observed spatial pattern reflects the theoretical random pattern represented by your null hypothesis (CSR).
To reject the null hypothesis, you must make a subjective judgment regarding the degree of risk you are willing to accept for being wrong (for
falsely rejecting the null hypothesis). Consequently, before you run the spatial statistic, you select a confidence level. Typical confidence levels
are 90, 95, or 99 percent. A confidence level of 99 percent would be the most conservative in this case, indicating that you are unwilling to
reject the null hypothesis unless the probability that the pattern was created by random chance is really small (less than a 1 percent
probability).
Confidence Levels
The table below shows the uncorrected critical p-values and z-scores for different confidence levels.
Note: Tools that allow you to apply the False Discovery Rate (FDR) will use corrected critical p-values.
Those critical values will be the same or smaller than those shown in the table below.
z-score (Standard Deviations) p-value (Probability) Confidence level
< -1.65 or > +1.65 < 0.10 90%
< -1.96 or > +1.96 < 0.05 95%
< -2.58 or > +2.58 < 0.01 99%
Consider an example. The critical z-score values when using a 95 percent confidence level are -1.96 and +1.96 standard deviations. The
uncorrected p-value associated with a 95 percent confidence level is 0.05. If your z-score is between -1.96 and +1.96, your uncorrected p-
value will be larger than 0.05, and you cannot reject your null hypothesis because the pattern exhibited could very likely be the result of
random spatial processes. If the z-score falls outside that range (for example, -2.5 or +5.4 standard deviations), the observed spatial
pattern is probably too unusual to be the result of random chance, and the p-value will be small to reflect this. In this case, it is possible to
reject the null hypothesis and proceed with figuring out what might be causing the statistically significant spatial structure in your data.
A key idea here is that the values in the middle of the normal distribution (z-scores like 0.19 or -1.2, for example), represent the expected
outcome. When the absolute value of the z-score is large and the probabilities are small (in the tails of the normal distribution), however,
you are seeing something unusual and generally very interesting. For the Hot Spot Analysis tool, for example, unusual means either a
statistically significant hot spot or a statistically significant cold spot.
FDR Correction
The local spatial pattern analysis tools including Hot Spot Analysis and Cluster_and_Outlier_Analysis_Anselin_Local_Moran_s I provide an
optional Boolean parameter Apply False Discovery Rate (FDR) Correction. When this parameter is checked, the False Discovery Rate (FDR)
procedure will potentially reduce the critical p-value thresholds shown in the table above in order to account for multiple testing and spatial
dependency. The reduction, if any, is a function of the number of input features and the neighborhood structure employed.
Local spatial pattern analysis tools work by considering each feature within the context of neighboring features and determining if the local
pattern (a target feature and its neighbors) is statistically different from the global pattern (all features in the dataset). The z-score and p-
value results associated with each feature determines if the difference is statistically significant or not. This analytical approach creates
issues with both multiple testing and dependency.
Multiple Testing—With a confidence level of 95 percent, probability theory tells us that there are 5 out of 100 chances that a spatial pattern
could appear structured (clustered or dispersed, for example) and could be associated with a statistically significant p-value, when in fact
the underlying spatial processes promoting the pattern are truly random. We would falsely reject the CSR null hypothesis in these cases
because of the statistically significant p-values. Five chances out of 100 seems quite conservative until you consider that local spatial
statistics perform a test for every feature in the dataset. If there are 10,000 features, for example, we might expect as many as 500 false
results.
Spatial Dependency—Features near to each other tend to be similar; more often than not spatial data exhibits this type of dependency.
Nonetheless, many statistical tests require features to be independent. For local pattern analysis tools this is because spatial dependency
can artificially inflate statistical significance. Spatial dependency is exacerbated with local pattern analysis tools because each feature is
evaluated within the context of its neighbors, and features that are near each other will likely share many of the same neighbors. This
overlap accentuates spatial dependency.
There are at least three approaches for dealing with both the multiple test and spatial dependency issues. The first approach is to ignore the
problem on the basis that the individual test performed for each feature in the dataset should be considered in isolation. With this approach,
however, it is very likely that some statistically significant results will be incorrect (appear to be statistically significant when in fact the
underlying spatial processes are random). The second approach is to apply a classical multiple testing procedure such as the Bonferroni or
Sidak corrections. These methods are typically too conservative, however. While they will greatly reduce the number of false positives they
will also miss finding statistically significant results when they do exist. A third approach is to apply the FDR correction which estimates the
number of false positives for a given confidence level and adjusts the critical p-value accordingly. For this method statistically significant p-
values are ranked from smallest (strongest) to largest (weakest), and based on the false positive estimate, the weakest are removed from
this list. The remaining features with statistically significant p-values are identified by the Gi_Bin or COType fields in the output feature
class. While not perfect, empirical tests show this method performs much better than assuming that each local test is performed in
isolation, or applying the traditional, overly conservative, multiple test methods. The additional resources section provides more information
about the FDR correction.
The Null Hypothesis and Spatial Statistics

Several statistics in the Spatial Statistics toolbox are inferential spatial pattern analysis techniques including Spatial Autocorrelation (Global
Moran's I), Cluster and Outlier Analysis (Anselin Local Moran's I), and Hot Spot Analysis (Getis-Ord Gi*). Inferential statistics are grounded
in probability theory. Probability is a measure of chance, and underlying all statistical tests (either directly or indirectly) are probability
calculations that assess the role of chance on the outcome of your analysis. Typically, with traditional (nonspatial) statistics, you work with
a random sample and try to determine the probability that your sample data is a good representation (is reflective) of the population at
large. As an example, you might ask "What are the chances that the results from my exit poll (showing candidate A will beat candidate B by
a slim margin) will reflect final election results?" But with many spatial statistics, including the spatial autocorrelation type statistics listed
above, very often you are dealing with all available data for the study area (all crimes, all disease cases, attributes for every census block,
and so on). When you compute a statistic for the entire population, you no longer have an estimate at all. You have a fact. Consequently, it
makes no sense to talk about likelihood or probabilities anymore. So how can the spatial pattern analysis tools, often applied to all data in
the study area, legitimately report probabilities? The answer is that they can do this by postulating, via the null hypothesis, that the data is,
in fact, part of some larger population. Consider this in more detail.
The Randomization Null Hypothesis—Where appropriate, the tools in the Spatial Statistics toolbox use the randomization null hypothesis as
the basis for statistical significance testing. The randomization null hypothesis postulates that the observed spatial pattern of your data
represents one of many (n!) possible spatial arrangements. If you could pick up your data values and throw them down onto the features in
your study area, you would have one possible spatial arrangement of those values. (Note that picking up your data values and throwing
them down arbitrarily is an example of a random spatial process). The randomization null hypothesis states that if you could do this
exercise (pick them up, throw them down) infinite times, most of the time you would produce a pattern that would not be markedly
different from the observed pattern (your real data). Once in a while you might accidentally throw all the highest values into the same
corner of your study area, but the probability of doing that is small. The randomization null hypothesis states that your data is one of many,
many, many possible versions of complete spatial randomness. The data values are fixed; only their spatial arrangement could vary.
The Normalization Null Hypothesis—A common alternative null hypothesis, not implemented for the Spatial Statistics toolbox, is the
normalization null hypothesis. The normalization null hypothesis postulates that the observed values are derived from an infinitely large,
normally distributed population of values through some random sampling process. With a different sample you would get different values,
but you would still expect those values to be representative of the larger distribution. The normalization null hypothesis states that the
values represent one of many possible samples of values. If you could fit your observed data to a normal curve and randomly select values
from that distribution to toss onto your study area, most of the time you would produce a pattern and distribution of values that would not
be markedly different from the observed pattern/distribution (your real data). The normalization null hypothesis states that your data and
their arrangement are one of many, many, many possible random samples. Neither the data values nor their spatial arrangement are fixed.
The normalization null hypothesis is only appropriate when the data values are normally distributed.
Additional Resources:
 Ebdon, David. Statistics in Geography. Blackwell, 1985.
 Mitchell, Andy. The ESRI Guide to GIS Analysis, Volume 2. ESRI Press, 2005.
 Goodchild, M.F., Spatial Autocorrelation. Catmog 47, Geo Books, 1986
 Caldas de Castro, Marcia, and Burton H. Singer. "Controlling the False Discovery Rate: A New Application to Account for Multiple and
Dependent Test in Local Statistics of Spatial Association." Geographical Analysis 38, pp 180-208, 2006.
Related Topics
High/Low Clustering (Getis-Ord General G)
Spatial Autocorrelation (Global Moran's I)
Cluster and Outlier Analysis (Anselin Local Moran's I)
Hot Spot Analysis (Getis-Ord Gi*)
Ordinary Least Squares (OLS)
Emerging Hot Spot Analysis
An overview of the Analyzing Patterns toolset
Locate topic
Identifying geographic patterns is important for understanding how geographic phenomena behave.
Although you can get a sense of the overall pattern of features and their associated values by mapping them, calculating a statistic quantifies
the pattern. This makes it easier to compare patterns for different distributions or different time periods. Often the tools in the Analyzing
Patterns toolset are a starting point for more in-depth analyses. Using the Incremental Spatial Autocorrelation tool to identify distances where
the processes promoting spatial clustering are most pronounced, for example, might help you select an appropriate distance (scale of analysis)
to use for investigating hot spots (Hot Spot Analysis).
The tools in the Analyzing Patterns toolset are inferential statistics; they start with the null hypothesis that your features, or the values
associated with your features, exhibit a spatially random pattern. They then compute a p-value representing the probability that the null
hypothesis is correct (that the observed pattern is simply one of many possible versions of complete spatial randomness). Calculating a
probability may be important if you need to have a high level of confidence in a particular decision. If there are public safety or legal
implications associated with your decision, for example, you may need to justify your decision using statistical evidence.
The Analyzing Patterns tools provide statistics that quantify broad spatial patterns. These tools answer questions such as, "Are the features in
the dataset, or the values associated with the features in the dataset, spatially clustered?" and "Is the clustering becoming more or less
intense over time?". The following table lists the tools available and provides a brief description of each.
Tool Description
Average Nearest Calculates a nearest neighbor index based on the average distance from each feature to its nearest
Neighbor neighboring feature.
High/Low Clustering Measures the degree of clustering for either high values or low values using the Getis-Ord General G
statistic.
Incremental Spatial Measures spatial autocorrelation for a series of distances and optionally creates a line graph of those
Autocorrelation distances and their corresponding z-scores. Z-scores reflect the intensity of spatial clustering, and
statistically significant peak z-scores indicate distances where spatial processes promoting clustering are
most pronounced. These peak distances are often appropriate values to use for tools with a Distance Band
or Distance Radius parameter.
Spatial Autocorrelation Measures spatial autocorrelation based on feature locations and attribute values using the Global Moran's I
statistic.
Multi-Distance Spatial Determines whether features, or the values associated with features, exhibit statistically significant
Cluster Analysis clustering or dispersion over a range of distances.
(Ripley's k-function)
Analyzing patterns tools
Related Topics
Average Nearest Neighbor (Spatial Statistics)
Locate topic
Summary
Calculates a nearest neighbor index based on the average distance from each feature to its nearest neighboring feature.
You can access the results of this tool (including the optional report file) from the Results window. If you disable background processing,
results will also be written to the Progress dialog box.
Learn more about how Average Nearest Neighbor Distance works
Illustration
Usage
 The Average Nearest Neighbor tool returns five values: Observed Mean Distance, Expected Mean Distance, Nearest Neighbor Index,
z-score, and p-value. These values are accessible from the Results window and are also passed as derived output values for potential
use in models or scripts. Optionally, this tool will create an HTML file with a graphical summary of results. Double-clicking on the
HTML entry in the Results window will open the HTML file in the default Internet browser. Right-clicking on the Messages entry in the
Results window and selecting View will display the results in a Message dialog box.
Note:  If this tool is part of a custom model tool, the HTML link will only appear in the Results
window if it is set as a model parameter prior to running the tool.
 For best display of HTML graphics, ensure your monitor is set to 96 DPI.
 The z-score and p-value results are measures of statistical significance which tell you whether or not to reject the null hypothesis.
Note, however, that the statistical significance for this method is strongly impacted by study area size (see below). For the Average
Nearest Neighbor statistic, the null hypothesis states that features are randomly distributed.
 The Nearest Neighbor Index is expressed as the ratio of the Observed Mean Distance to the Expected Mean Distance. The expected
distance is the average distance between neighbors in a hypothetical random distribution. If the index is less than 1, the pattern
exhibits clustering; if the index is greater than 1, the trend is toward dispersion or competition.
 The average nearest neighbor method is very sensitive to the Area value (small changes in the Area parameter value can result in
considerable changes in the results). Consequently, the Average Nearest Neighbor tool is most effective for comparing different
features in a fixed study area. The picture below is a classic example of how identical feature distributions can be dispersed or
clustered depending on the study area specified.
 If an Area parameter value is not specified, then the area of the minimum enclosing rectangle around the input features is used.
Unlike the extent, a minimum enclosing rectangle will not necessarily align with the x- and y-axes.
 When the Input Feature Class is not projected (that is, when coordinates are given in degrees, minutes, and seconds) or when the
output coordinate system is set to a Geographic Coordinate System, distances are computed using chordal measurements. Chordal
distance measurements are used because they can be computed quickly and provide very good estimates of true geodesic distances,
at least for points within about thirty degrees of each other. Chordal distances are based on an oblate spheroid. Given any two points
on the earth's surface, the chordal distance between them is the length of a line, passing through the three-dimensional earth, to
connect those two points. Chordal distances are reported in meters.
Caution: Be sure to project your data if your study area extends beyond 30 degrees. Chordal
distances are not a good estimate of geodesic distances beyond 30 degrees.
 When chordal distances are used in the analysis, the Area parameter, if specified, should be given in meters.
 Prior to ArcGIS 10.2.1, you would see a warning message if the parameters and environment settings you selected would result in
calculations being performed using Geographic Coordinates (degrees, minutes, seconds). This warning advised you to project your
data into a Projected Coordinate System so that distance calculations would be accurate. Beginning at 10.2.1, however, this tool
calculates chordal distances whenever Geographic Coordinate System calculations are required.
Caution: Because of this change, there is a small chance that you will need to modify models that
incorporate this tool if your models were created prior to ArcGIS 10.2.1 and if your models
include hard-coded Geographic Coordinate System parameter values. If, for example, a
distance parameter is set to something like 0.0025 degrees, you will need to convert that
fixed value from degrees to meters and resave your model.
 There are special cases of input features that would result in invalid (zero-area) minimum enclosing rectangles. In these cases, a
small value derived from the input feature XY tolerance will be used to create the minimum enclosing rectangle. For example, if all
features are coincident (that is, all have the exact same X and Y coordinates), the area for a very small square polygon around the
single location will be used in calculations. Another example would be if all features align perfectly (for example, 3 points in a straight
line); in this case the area of a rectangle polygon with a very small width around the features will be used in computations. It is
always best to supply an Area value when using the Average Nearest Neighbor tool.
 Although this tool will work with polygon or line data, it is most appropriate for event, incident, or other fixed-point feature data. For
line and polygon features, the true geometric centroid for each feature is used in computations. For multipoint, polyline, or polygons
with multiple parts, the centroid is computed using the weighted mean center of all feature parts. The weighting for point features is
1, for line features is length, and for polygon features is area.
Legacy: In ArcGIS 10, optional graphical output is no longer displayed automatically. Instead, an
HTML file summarizing results is created. To view results, double-click the HTML file in the
Results window. Custom scripts or model tools created prior to ArcGIS 10 that use this tool
may need to be rebuilt. To rebuild these custom tools, open them, remove the Display
Results Graphically parameter, and resave.
 This tool will optionally create an HTML file summarizing results. HTML files will not automatically appear in the Catalog window. If
you want HTML files to be displayed in Catalog, open the ArcCatalog application, select the Customize menu option, click ArcCatalog
Options, and select the File Types tab. Click on the New Type button and specify HTML for File Extension.
 Map layers can be used to define the Input Feature Class. When using a layer with a selection, only the selected features are included
in the analysis.
Caution: When using shapefiles, keep in mind that they cannot store null values. Tools or other
information.
Syntax
AverageNearestNeighbor_stats (Input_Feature_Class, Distance_Method, {Generate_Report}, {Area})
Parameter Explanation Data Type

Input_Feature_Class Feature Layer
The feature class, typically a point feature class, for which the average
nearest neighbor distance will be calculated.
Distance_Method String
Specifies how distances are calculated from each feature to neighboring
features.
 EUCLIDEAN_DISTANCE —The straight-line distance between two points
(as the crow flies)
 MANHATTAN_DISTANCE —The distance between two points measured

along axes at right angles (city block); calculated by summing the
(absolute) difference between the x- and y-coordinates
Generate_Report Boolean
(Optional)
Area Double
A numeric value representing the study area size. The default value is the
(Optional) area of the minimum enclosing rectangle that would encompass all features
(or all selected features). Units should match those for the Output Coordinate
System.
Code Sample
AverageNearestNeighbor example 1 (Python window)
The following Python window script demonstrates how to use the AverageNearestNeighbor tool.
import arcpy
arcpy.env.workspace = r"C:\data"
arcpy.AverageNearestNeighbor_stats("burglaries.shp", "EUCLIDEAN_DISTANCE", "NO_REPORT", "#")
AverageNearestNeighbor example 2 (stand-alone Python script)

The following stand-alone Python script demonstrates how to use the AverageNearestNeighbor tool.
# Analyze crime data to determine if spatial patterns are statistically significant
# Import system modules

import arcpy
# Local variables...
workspace = "C:/data"
crime_data = "burglaries.shp"
try:
# Set the current workspace (to avoid having to specify the full path to the feature classes each time)
arcpy.env.workspace = workspace
# Obtain Nearest Neighbor Ratio and z-score

# Process: Average Nearest Neighbor...
nn_output = arcpy.AverageNearestNeighbor_stats(crime_data, "EUCLIDEAN_DISTANCE", "NO_REPORT", "#"
# Create list of Average Nearest Neighbor output values by splitting the result object
print "The nearest neighbor index is: " + nn_output[0]
print "The z-score of the nearest neighbor index is: " + nn_output[1]
print "The p-value of the nearest neighbor index is: " + nn_output[2]
print "The expected mean distance is: " + nn_output[3]
print "The observed mean distance is: " + nn_output[4]
print "The path of the HTML report: " + nn_output[5]
except:
# If an error occurred when running the tool, print out the error message.
print arcpy.GetMessages()
Environments
Current_workspace, Scratch_workspace, Output_coordinate_system, Geograpic_transformations
Output_coordinate_system
Feature geometry is projected to the Output Coordinate System prior to analysis. All mathematical computations are based on the
Output Coordinate System spatial reference. When the Output Coordinate System is based on degrees, minutes, and seconds,
geodesic distances are estimated using chordal distances.
Related Topics
Using the Results window
Multi-Distance Spatial Cluster Analysis (Ripley's K Function)
Calculate Areas
How Average Nearest Neighbor works
High/Low Clustering (Getis-Ord General G) (Spatial Statistics)
Locate topic
Summary
Measures the degree of clustering for either high values or low values using the Getis-Ord General G statistic.
Learn more about how High/Low Clustering: Getis-Ord General G works
Illustration
Usage
 The High/Low Clustering tool returns five values: Observed General G, Expected General G, Variance, z-score, and p-value. These
values are accessible from the Results window and are also passed as derived output values for potential use in models or scripts.
Optionally, this tool will create an HTML file with a graphical summary of results. Double-clicking on the HTML file in the Results
window will open the HTML file in the default Internet browser. Right-clicking on the Messages entry in the Results window and
selecting View will display the results in a Message dialog box.
 The Input Field should contain a variety of nonnegative values. You will get an error message if the Input Field contains negative
values. In addition, the math for this statistic requires some variation in the variable being analyzed; it cannot solve if all input values
are 1, for example. If you want to use this tool to analyze the spatial pattern of incident data, consider aggregating your incident
data. The Optimized Hot Spot Analysis tool may also be used to analyze the spatial pattern of incident data.
Note: Incident data are points representing events (crime, traffic accidents) or objects (trees,
stores) where your focus is on presence or absence rather than some measured attribute
associated with each point.
 The z-score and p-value are measures of statistical significance which tell you whether or not to reject the null hypothesis. For this
tool, the null hypothesis states that the values associated with features are randomly distributed.
 The z-score is based on the randomization null hypothesis computation. For more information on z-scores, see What is a z-score?
What is a p-value?
 The higher (or lower) the z-score, the stronger the intensity of the clustering. A z-score near zero indicates no apparent clustering
within the study area. A positive z-score indicates clustering of high values. A negative z-score indicates clustering of low values.
 When chordal distances are used in the analysis, the Distance Band or Threshold Distance parameter, if specified, should be given in
meters.
 For line and polygon features, feature centroids are used in distance computations. For multipoints, polylines, or polygons with
multiple parts, the centroid is computed using the weighted mean center of all feature parts. The weighting for point features is 1, for
line features is length, and for polygon features is area.
 Your choice for the Conceptualization of Spatial Relationships parameter should reflect inherent relationships among the features you
are analyzing. The more realistically you can model how features interact with each other in space, the more accurate your results
will be. Recommendations are outline in Selecting a Conceptualization of Spatial Relationships: Best Practices. Here are some
additional tips:
 A binary weighting scheme is recommended for this statistic: fixed distance, polygon contiguity, K nearest neighbors or Delaunay
triangulation. Select NONE for the Standardization parameter.
 FIXED_DISTANCE_BAND
The default Distance Band or Threshold Distance will ensure each feature has at least one neighbor, and this is important. But
often, this default will not be the most appropriate distance to use for your analysis. Additional strategies for selecting an
appropriate scale (distance band) for your analysis are outlined in Selecting a fixed distance band value.
 INVERSE_DISTANCE or INVERSE_DISTANCE_SQUARED (not recommended)
When zero is entered for the Distance Band or Threshold Distance parameter, all features are considered neighbors of all other
features; when this parameter is left blank, the default distance will be applied.
Weights for distances less than 1 become unstable when they are inverted. Consequently, the weighting for features separated
by less than 1 unit of distance are given a weight of 1.
For the inverse distance options (not recommended for this tool), any two points that are coincident will be given a weight of
one to avoid zero division. This assures features are not excluded from analysis.
 Additional options for the Conceptualization of Spatial Relationships parameter, including space-time relationships, are available
using the Generate Spatial Weights Matrix or Generate Network Spatial Weights tools. To take advantage of these additional options,
use one of these tools to construct the spatial weights matrix file prior to analysis; select GET_SPATIAL_WEIGHTS_FROM_FILE for the
Conceptualization of Spatial Relationships parameter; and for the Weights Matrix File parameter, specify the path to the spatial
weights file you created.
in the analysis.
 If you provide a Weights Matrix File with an .swm extension, this tool is expecting a spatial weights matrix file created using either
the Generate Spatial Weights Matrix or Generate Network Spatial Weights tools; otherwise, this tool is expecting an ASCII-formatted
spatial weights matrix file. In some cases, behavior is different depending on which type of spatial weights matrix file you use:
 ASCII-formatted spatial weights matrix files:
 Weights are used as is. Missing feature-to-feature relationships are treated as zeros.
 If the weights are row standardized, results will likely be incorrect for analyses on selection sets. If you need to run your
analysis on a selection set, convert the ASCII spatial weights file to an SWM file by reading the ASCII data into a table and
using the CONVERT_TABLE option with the Generate Spatial Weights Matrix tool.
 SWM-formatted spatial weights matrix file:
 If the weights are row standardized, they will be restandardized for selection sets; otherwise, weights are used as is.
 Running your analysis with an ASCII-formatted spatial weights matrix file is memory intensive. For analyses on more than 5,000
features, consider converting your ASCII-formatted spatial weights matrix file into an SWM-formatted file. First put your ASCII
weights into a formatted table (using Excel, for example). Next, run the Generate Spatial Weights Matrix tool using CONVERT_TABLE
for the Conceptualization of Spatial Relationships parameter. The output will be an SWM-formatted spatial weights matrix file.
 The Modeling Spatial Relationships help topic provides additional information about this tool's parameters.
information.
Syntax
HighLowClustering_stats (Input_Feature_Class, Input_Field, {Generate_Report}, Conceptualization_of_Spatial_Relationships,
Distance_Method, Standardization, {Distance_Band_or_Threshold_Distance}, {Weights_Matrix_File})

The feature class for which the General G statistic will be calculated.
Input_Field Field
The numeric field to be evaluated.
(Optional)
Conceptualization_of_Spatial_Relationships Specifies how spatial relationships among features are String
conceptualized.
 INVERSE_DISTANCE —Nearby neighboring features have a
larger influence on the computations for a target feature than
features that are far away.
 INVERSE_DISTANCE_SQUARED —Same as INVERSE_DISTANCE
except that the slope is sharper, so influence drops off more
quickly, and only a target feature's closest neighbors will exert
substantial influence on computations for that feature.
 FIXED_DISTANCE_BAND —Each feature is analyzed within the
context of neighboring features. Neighboring features inside the
specified critical distance receive a weight of 1 and exert
influence on computations for the target feature. Neighboring
features outside the critical distance receive a weight of zero and
have no influence on a target feature's computations.
 ZONE_OF_INDIFFERENCE —Features within the specified critical
distance of a target feature receive a weight of 1 and influence
computations for that feature. Once the critical distance is
exceeded, weights (and the influence a neighboring feature has
on target feature computations) diminish with distance.
 CONTIGUITY_EDGES_ONLY —Only neighboring polygon features
that share a boundary or overlap will influence computations for
the target polygon feature.
 CONTIGUITY_EDGES_CORNERS —Polygon features that share a
boundary, share a node, or overlap will influence computations
for the target polygon feature.
 GET_SPATIAL_WEIGHTS_FROM_FILE —Spatial relationships are
defined in a spatial weights file. The path to the spatial weights
file is specified in the Weights Matrix File parameter.
Distance_Method Specifies how distances are calculated from each feature to String
neighboring features.
 EUCLIDEAN_DISTANCE —The straight-line distance between two
points (as the crow flies)
 MANHATTAN_DISTANCE —The distance between two points
measured along axes at right angles (city block); calculated by
summing the (absolute) difference between the x- and y-
coordinates
Standardization Row standardization is recommended whenever the distribution of String
your features is potentially biased due to sampling design or an
imposed aggregation scheme.
 NONE —No standardization of spatial weights is applied.
 ROW —Spatial weights are standardized; each weight is divided
by its row sum (the sum of the weights of all neighboring
features).
Distance_Band_or_Threshold_Distance Double
Specifies a cutoff distance for Inverse Distance and Fixed Distance
(Optional) options. Features outside the specified cutoff for a target feature are
ignored in analyses for that feature. However, for Zone of
Indifference, the influence of features outside the given distance is
reduced with distance, while those inside the distance threshold are
equally considered. The distance value entered should match that of
the output coordinate system.
For the Inverse Distance conceptualizations of spatial relationships, a
value of 0 indicates that no threshold distance is applied; when this
parameter is left blank, a default threshold value is computed and
applied. This default value is the Euclidean distance that ensures
every feature has at least one neighbor.
This parameter has no effect when Polygon Contiguity or Get Spatial
Weights From File spatial conceptualizations are selected.
Weights_Matrix_File File
The path to a file containing weights that define spatial, and
(Optional) potentially temporal, relationships among features.
Code Sample
HighLowClustering example 1 (Python window)

The following Python window script demonstrates how to use the High/Low Clustering tool.
import arcpy
arcpy.HighLowClustering_stats("911Count.shp", "ICOUNT","false", "GET_SPATIAL_WEIGHTS_FROM_FILE","EUCLIDEAN_DISTANCE"
HighLowClustering example 2 (stand-alone Python script)

The following stand-alone Python script demonstrates how to use the High/Low Clustering tool.
# Analyze the spatial distribution of 911 calls in a metropolitan area

# using the High/Low Clustering (Getis-Ord General G) tool

import arcpy
# Set the geoprocessor object property to overwrite existing outputs

arcpy.gp.overwriteOutput = True
workspace = r"C:\Data"
try:
# Copy the input feature class and integrate the points to snap
# together at 500 feet
# Process: Copy Features and Integrate
cf = arcpy.CopyFeatures_management("911Calls.shp", "911Copied.shp",
"#", 0, 0, 0)
integrate = arcpy.Integrate_management("911Copied.shp #", "500 Feet")
# Use Collect Events to count the number of calls at each location

# Process: Collect Events
ce = arcpy.CollectEvents_stats("911Copied.shp", "911Count.shp", "Count", "#")
# Add a unique ID field to the count feature class

# Process: Add Field and Calculate Field
af = arcpy.AddField_management("911Count.shp", "MyID", "LONG", "#", "#", "#", "#",
"NON_NULLABLE", "NON_REQUIRED", "#",
"911Count.shp")
cf = arcpy.CalculateField_management("911Count.shp", "MyID", "[FID]", "VB")
# Create Spatial Weights Matrix for Calculations

# Process: Generate Spatial Weights Matrix...
swm = arcpy.GenerateSpatialWeightsMatrix_stats("911Count.shp", "MYID",
"euclidean6Neighs.swm",
"K_NEAREST_NEIGHBORS",
"#", "#", "#", 6,
"NO_STANDARDIZATION")
# Cluster Analysis of 911 Calls

# Process: High/Low Clustering (Getis-Ord General G)
hs = arcpy.HighLowClustering_stats("911Count.shp", "ICOUNT",
"false",
"GET_SPATIAL_WEIGHTS_FROM_FILE",
"EUCLIDEAN_DISTANCE", "NONE",
"#", "euclidean6Neighs.swm")
except:
Environments
Related Topics

Spatial weights
How High/Low Clustering (Getis-Ord General G) works
Incremental Spatial Autocorrelation (Spatial Statistics)
Locate topic
Summary
Measures spatial autocorrelation for a series of distances and optionally creates a line graph of those distances and their corresponding z-
scores. Z-scores reflect the intensity of spatial clustering, and statistically significant peak z-scores indicate distances where spatial
processes promoting clustering are most pronounced. These peak distances are often appropriate values to use for tools with a Distance
Band or Distance Radius parameter.
Learn more about how Incremental Spatial Autocorrelation works
Learn more about how Spatial Autocorrelation (Global Moran's I) works
Illustration
Z-score peaks reflect distances where the spatial processes promoting clustering are most pronounced.
Usage
 This tool can help you select an appropriate Distance Threshold or Radius for tools that have these parameters, such as Hot Spot
Analysis or Point Density.
 The Incremental Spatial Autocorrelation tool measures spatial autocorrelation for a series of distance increments and reports, for each
distance increment, the associated Moran's Index, Expected Index, Variance, z-score and p-value. These values are accessible from
the Results window by right-clicking on the Messages entry and selecting View. The tool also passes, as derived output, the first
peak z-score and maximum peak z-score for potential use in models or scripts (see, for example, the sample script below).
 When more than one statistically significant peak is present, clustering is pronounced at each of those distances. Select the peak
distance that best corresponds to the scale of analysis you are interested in; often this is the first statistically significant peak
encountered.
 The Input Field should contain a variety of values. The math for this statistic requires some variation in the variable being analyzed;
it cannot solve if all input values are 1, for example. If you want to use this tool to analyze the spatial pattern of incident data,
consider aggregating your incident data.
 When chordal distances are used in the analysis, the Beginning Distance and Distance Increment parameters, if specified, should be
given in meters.
in the analysis.
 For polygon features, you will almost always want to choose ROW for the Row Standardization parameter. Row Standardization
mitigates bias when the number of neighbors each feature has is a function of the aggregation scheme or sampling process, rather
than reflecting the actual spatial distribution of the variable you are analyzing.
 If no Beginning Distance is given, the default value is the minimum distance for which each feature in the dataset has at least one
neighbor. This may not be the most appropriate beginning distance if your dataset includes locational outliers.
 If no Increment Distance is given, the smaller of either the average nearest neighbor distance or (Td - B) / I is used, where Td is a
maximum threshold distance, B is the Beginning Distance and I is the Number of Distance Bands. This algorithm ensures calculations
will always be performed for the Number of Distance Bands specified and that the largest distance bands won't be so large that some
features have all or almost all other features as neighbors.
 If the Beginning Distance and/or Increment Distance specified will result in a distance band that is larger than the maximum
threshold distance, the Increment Distance will automatically be scaled down. To avoid this adjustment you can decrease the
Increment Distance and/or decrease the Number of Distance Bands specified.
 It is possible to run out of memory when you run this tool. This generally occurs when you specify a Beginning Distance and/or
Increment Distance resulting in features having many, many neighbors. You generally do not want to create spatial relationships
where your features have thousands of neighbors. Use a smaller value for the Increment Distance and temporarily remove locational
outliers so that you can start with a smaller Beginning Distance value.
 Even if you let the tool calculate a Beginning Distance and Increment Distance for you, processing time can be long for large
datasets. You can improve performance by:
 Temporarily removing locational outliers
 Instead of running the analysis on all features, select features in a representative portion of the study area and run the analysis
on just those features.
 Take a random sample of features from the dataset and run your analysis on just those sampled features.
 Distances are always based on the Output Coordinate System environment setting. The default setting for the Output Coordinate
System environment is Same as Input. Input features are projected to the output coordinate system prior to analysis.
 The optional Output Table will contain the distance value at each iteration, the Moran's I Index value, the expected Moran's I index
value, the variance, the z-score, and the p-value. A peak would be an increase in the z-score value followed by a decrease in the z-
score value. For example, if this tool finds the following series of z-scores for 50, 100, and 150 meter distances, 2.95, 3.68, 3.12, the
peak would be 100 meters.
 The optional Output Report File is created as a PDF file and may be accessed from the Results window by double-clicking on the file
name.
 This tool will optionally create a PDF report summarizing results. PDF files do not automatically appear in the Catalog window. If you
want PDF files to be displayed in Catalog, open the ArcCatalog application, select the Customize menu option, click ArcCatalog
Options, and select the File Types tab. Click on the New Type button and specify PDF, as shown below, for File Extension.
 On machines configured with the ArcGIS language packages for Chinese or Japanese, you might notice missing text or formatting
problems in the PDF Output Report File . These problems can be corrected by changing the font settings.
 When no peak z-scores are identified, both the first peak z-score and maximum peak z-score derived output parameters return a
blank.
 When using this tool in Python scripts, the result object returned from tool execution has the following outputs:
Position Description Data Type
0 First Peak Double
1 Max Peak Double
Syntax
IncrementalSpatialAutocorrelation_stats (Input_Features, Input_Field, Number_of_Distance_Bands, {Beginning_Distance},
{Distance_Increment}, {Distance_Method}, {Row_Standardization}, {Output_Table}, {Output_Report_File})

Input_Features The feature class for which spatial autocorrelation will be measured over a Feature Layer
series of distances.
Input_Field The numeric field used in assessing spatial autocorrelation. Field
Number_of_Distance_Bands The number of times to increment the neighborhood size and analyze the Long
dataset for spatial autocorrelation. The starting point and size of the
increment are specified in the Beginning Distance and Distance Increment
parameters, respectively.
Beginning_Distance Double
The distance at which to start the analysis of spatial autocorrelation and the
(Optional) distance from which to increment. The value entered for this parameter
should be in the units of the Output Coordinate System environment setting.
Distance_Increment Double
The distance to increase after each iteration. The distance used in the
(Optional) analysis starts at the Beginning Distance and increases by the amount
specified in the Distance Increment. The value entered for this parameter
should be in the units of the Output Coordinate System environment setting.
Distance_Method Specifies how distances are calculated from each feature to neighboring String
(Optional) features.
 EUCLIDEAN —The straight-line distance between two points (as the
crow flies)
 MANHATTAN —The distance between two points measured along axes at
right angles (city block); calculated by summing the (absolute)
difference between the x- and y-coordinates
Row_Standardization  NONE —No standardization of spatial weights is applied. Boolean
(Optional)  ROW —Spatial weights are standardized; each weight is divided by its
row sum (the sum of the weights of all neighboring features).
Output_Table Table
The table to be created with each distance band and associated z-score
(Optional) result.
Output_Report_File File
The PDF file to be created containing a line graph summarizing results.
(Optional)
Code Sample
IncrementalSpatialAutocorrelation example 1 (Python window)
The following Python window script demonstrates how to use the IncrementalSpatialAutocorrelation tool.
import arcpy, os
import arcpy.stats as SS
arcpy.env.workspace = r"C:\ISA"
SS.IncrementalSpatialAutocorrelation("911CallsCount.shp", "ICOUNT", "20", "", "", "EUCLIDEAN",
"ROW_STANDARDIZATION", "outTable.dbf", "outReport.pdf")
IncrementalSpatialAutocorrelation example (stand-alone Python script)

The following stand-alone Python script demonstrates how to use the IncrementalSpatialAutocorrelation tool.
# Hot Spot Analysis of 911 calls in a metropolitan area

# using the Incremental Spatial Autocorrelation and Hot Spot Analysis Tool

import arcpy, os
# Set geoprocessor object property to overwrite existing output, by default

# Local variables
workspace = r"C:\ISA"
try:
# Copy the input feature class and integrate the points to snap together at 30 feet
cf = arcpy.CopyFeatures_management("911Calls.shp", "911Copied.shp","#", 0, 0, 0)

ce = SS.CollectEvents("911Copied.shp", "911Count.shp")
# Use Incremental Spatial Autocorrelation to get the peak distance

# Process: Incremental Spatial Autocorrelation
isa = SS.IncrementalSpatialAutocorrelation(ce, "ICOUNT", "20", "", "", "EUCLIDEAN",
"ROW_STANDARDIZATION", "outTable.dbf", "outReport.pdf"
# Hot Spot Analysis of 911 Calls

# Process: Hot Spot Analysis (Getis-Ord Gi*)
distance = isa.getOutput(2)
hs = SS.HotSpots(ce, "ICOUNT", "911HotSpots.shp", "Fixed Distance Band",
"Euclidean Distance", "None", distance, "", "")
except:
Environments
Related Topics
How Incremental Spatial Autocorrelation works
Spatial weights
How Spatial Autocorrelation (Global Moran's I) works
Multi-Distance Spatial Cluster Analysis (Ripley's K Function) (Spatial Statistics)
Locate topic
Summary
Determines whether features, or the values associated with features, exhibit statistically significant clustering or dispersion over a range of
distances.
Learn more about how Multi-Distance Spatial Cluster Analysis works
Illustration
Measure of spatial clustering/dispersion over a range of distances.
Usage
 This tool requires projected data to accurately measure distances.
 Tool output is a table with fields: ExpectedK and ObservedK containing the expected and observed K values, respectively. Because
the L(d) transformation is applied, the ExpectedK values will always match the Distance value. A field named DiffK contains the
Observed K values minus the Expected K values. If a confidence interval option is specified, two additional fields named LwConfEnv
and HiConfEnv will be included in the Output Table as well. These fields contain confidence interval information for each iteration of
the tool, as specified by the Number of Distance Bands parameter. The K function will optionally create a graph layer summarizing
results.
 When the observed K value is larger than the expected K value for a particular distance, the distribution is more clustered than a
random distribution at that distance (scale of analysis). When the observed K value is smaller than the expected K value, the
distribution is more dispersed than a random distribution at that distance. When the observed K value is larger than the HiConfEnv
value, spatial clustering for that distance is statistically significant. When the observed K value is smaller than the LwConfEnv value,
spatial dispersion for that distance is statistically significant. Additional information about interpretation is found in How Multi-
Distance Spatial Cluster Analysis (Ripley's K-function) works.
 Enable the Display Results Graphically parameter to create a line graph summarizing tool results. The expected results will be
represented by a blue line while the observed results will be a red line. Deviation of the observed line above the expected line
indicates that the dataset is exhibiting clustering at that distance. Deviation of the observed line below the expected line indicates
that the dataset is exhibiting dispersion at that distance. The line graph is created as a graph layer; graph layers are temporary and
will be deleted when you close ArcMap. If you right-click the graph layer and select Save, the graph can be written to a Graph File. If
you save your map document after saving your graph, a link to the graph file will be saved with your .mxd. For more information
about graph files, see Exploring and visualizing data with graphs.
 The Weight Field is most appropriately used when it represents the number of incidents or counts.
 When no Weight Field is specified, the largest DiffK value tells you the distance where spatial processes promoting clustering are
most pronounced.
 The following explains how the confidence envelope is computed:
 No Weight Field
When no Weight Field is specified, the confidence envelope is constructed by distributing points randomly in the study area
and calculating L(d) for that distribution. Each random distribution of the points is called a "permutation". If 99 permutations is
selected, for example, the tool will randomly distribute the set of points 99 times for each iteration. After distributing the points
99 times the tool selects, for each distance, the Observed k value that deviated above and below the Expected k value by the
greatest amount; these values become the confidence interval.
 Including a Weight Field
When a Weight Field is specified, only the weight values are randomly redistributed to compute confidence envelopes; the point
locations remain fixed. In essence, when a Weight Field is specified, locations remain fixed and the tool evaluates the clustering
of feature values in space. On the other hand, when no Weight Field is specified the tool analyzes clustering/dispersion of
feature locations.
 Because the confidence envelope is constructed from random permutations, the values defining the confidence envelope will
change from one run to the next, even when parameters are identical. If you set a seed value, however, for the Random Number
Generator geoprocessing environment, repeat analyses will produce consistent results.
 The number of permutations selected for the Compute Confidence Envelope parameter may be loosely translated to confidence
levels: 9 for 90%, 99 for 99%, and 999 for 99.9%.
 When no study area is specified, the tool uses a minimum enclosing rectangle as the study area polygon. Unlike the extent, a
minimum enclosing rectangle will not necessarily align with the x- and y-axes.
 The k-function statistic is very sensitive to the size of the study area. Identical arrangements of points can exhibit clustering or
dispersion depending on the size of the study area enclosing them. Therefore, it is imperative that the study area boundaries are
carefully considered. The picture below is a classic example of how identical feature distributions can be dispersed or clustered
depending on the study area specified.
 A study area feature class is required if USER_PROVIDED_STUDY_AREA_FEATURE_CLASS is chosen for the Study Area Method parameter.
 If a Study Area Feature Class is specified, it should have exactly one single part feature (the study area polygon).
 If no Beginning Distance or Distance Increment is specified, then default values are calculated for you based on the extent of the
Input Feature Class.
 The K function has an undercount bias for features located near the study area boundary. The Boundary Correction Method
parameter provides methods for addressing this bias.

 NONE
No specific boundary correction is applied. However, points in the Input Feature Class that fall outside the user-specified study
area are used in neighbor counts. This method is appropriate if you've collected data from a very large study area but only need
to analyze smaller areas well within the boundaries of data collection.
 SIMULATE_OUTER_BOUNDARY_VALUES
This method creates points outside the study area boundary that mirror those found inside the boundary in order to correct for
underestimates near the edges. Points that are within a distance equal to the maximum distance band of an edge of the study
area are mirrored. The mirrored points are used so that edge points will have more accurate neighbor estimates. The diagram
below illustrates what points will be used in the calculation and which will be used only for edge correction.
 REDUCE_ANALYSIS_AREA
This edge correction technique shrinks the size of the analysis area by a distance equal to the largest distance band to be used
in the analysis. After shrinking the study area, points found outside of the new study area will be considered only when neighbor
counts are being assessed for points still inside the study area. They will not be used in any other way during the k-function
calculation. The diagram below illustrates which points will be used in the calculation and which will be used only for edge
correction.
 RIPLEY'S_EDGE_CORRECTION_FORMULA
This method checks each point's distance from the edge of the study area and its distance to each of its neighbors. All neighbors
that are further away from the point in question than the edge of the study area are given extra weighting. This edge correction
method is only appropriate for square or rectangular shaped study areas, or when you select MINIMUM_ENCLOSING_RECTANGLE
for the Study Area Method parameter.
 When no boundary correction is applied, the undercount bias increases as the analysis distance increases. If you enable the Display
Results Graphically parameter, you will notice that the ObservedK line droops at the larger distances.
 Mathematically, the Multi-Distance Spatial Cluster Analysis tool uses a common transformation of Ripley's k-function where the
expected result with a random set of points is equal to the input distance. The transformation L(d) is shown below.
where A is area, N is the number of points, d is the distance and k(i, j) is the weight, which (if there is no boundary correction) is 1
when the distance between i and j is less than or equal to d and 0 when the distance between i and j is greater than d. When edge
correction is applied, the weight of k(i, j) is modified slightly.
in the analysis.
information.
Syntax
MultiDistanceSpatialClustering_stats (Input_Feature_Class, Output_Table, Number_of_Distance_Bands, {Compute_Confidence_Envelope},
{Display_Results_Graphically}, {Weight_Field}, {Beginning_Distance}, {Distance_Increment}, {Boundary_Correction_Method},
{Study_Area_Method}, {Study_Area_Feature_Class})

The feature class upon which the analysis will be performed.
Output_Table Table
The table to which the results of the analysis will be written.
Number_of_Distance_Bands Long
The number of times to increment the neighborhood size and analyze the
dataset for clustering. The starting point and size of the increment are
specified in the Beginning Distance and Distance Increment parameters,
respectively.
Compute_Confidence_Envelope String
The confidence envelope is calculated by randomly placing feature points
(Optional) (or feature values) in the study area. The number of points/values
randomly placed is equal to the number of points in the feature class. Each
set of random placements is called a "permutation" and the confidence
envelope is created from these permutations. This parameter allows you to
select how many permutations you want to use to create the confidence
envelope.
 0_PERMUTATIONS_-_NO_CONFIDENCE_ENVELOPE —Confidence
envelopes are not created.
 9_PERMUTATIONS —Nine sets of points/values are randomly placed.
 99_PERMUTATIONS —99 sets of points/values are randomly placed.
 999_PERMUTATIONS —999 sets of points/values are randomly placed.
Display_Results_Graphically  NO_DISPLAY —No graphical summary will be created (default). Boolean
(Optional)  DISPLAY_IT —A graphical summary will be created as a graph layer.
Weight_Field A numeric field with weights representing the number of features/events Field
(Optional) at each location.
Beginning_Distance Double
The distance at which to start the cluster analysis and the distance from
(Optional) which to increment. The value entered for this parameter should be in the
units of the Output Coordinate System.
Distance_Increment Double
The distance to increment during each iteration. The distance used in the
(Optional) analysis starts at the Beginning Distance and increments by the amount
specified in the Distance Increment. The value entered for this parameter
should be in the units of the Output Coordinate System.
Boundary_Correction_Method String
Method to use to correct for underestimates in the number of neighbors
(Optional) for features near the edges of the study area.
 NONE —No edge correction is applied. However, if the input feature
class already has points that fall outside the study area boundaries,
these will be used in neighborhood counts for features near
boundaries.
 SIMULATE_OUTER_BOUNDARY_VALUES —This method simulates
points outside the study area so that the number of neighbors near
edges is not underestimated. The simulated points are the "mirrors" of
points near edges within the study area boundary.
 REDUCE_ANALYSIS_AREA —This method shrinks the study area such
that some points are found outside of the study area boundary. Points
found outside the study area are used to calculate neighbor counts but
are not used in the cluster analysis itself.
 RIPLEY_EDGE_CORRECTION_FORMULA —For all the points (j) in the
neighborhood of point i, this method checks to see if the edge of the
study area is closer to i, or if j is closer to i. If j is closer, extra weight
is given to the point j. This edge correction method is only appropriate
for square or rectangular shaped study areas.
Study_Area_Method Specifies the region to use for the study area. The K Function is sensitive String
(Optional) to changes in study area size so careful selection of this value is
important.
 MINIMUM_ENCLOSING_RECTANGLE —Indicates that the smallest
possible rectangle enclosing all of the points will be used.
 USER_PROVIDED_STUDY_AREA_FEATURE_CLASS —Indicates that a
feature class defining the study area will be provided in the Study Area
Feature Class parameter.
Study_Area_Feature_Class Feature class that delineates the area over which the input feature class Feature Layer
(Optional) should be analyzed. Only to be specified if User-provided Study Area
Feature Class is selected for the Study Area Method parameter.
Code Sample
Multi-DistanceSpatialClusterAnalysis Example (Python Window)
The following Python Window script demonstrates how to use the Multi-DistanceSpatialClusterAnalysis tool.
import arcpy
arcpy.MultiDistanceSpatialClustering_stats("911Calls.shp","kFunResult.dbf", 11,
"0_PERMUTATIONS_-_NO_CONFIDENCE_ENVELOPE",
"NO_REPORT", "#", 1000, 200,"REDUCE_ANALYSIS_AREA",
"MINIMUM_ENCLOSING_RECTANGLE", "#")
Multi-DistanceSpatialClusterAnalysis Example (stand-alone Python script)

The following stand-alone Python script demonstrates how to use the Multi-DistanceSpatialClusterAnalysis tool.
# Use Ripley's K-Function to analyze the spatial distribution of 911

# calls in Portland Oregon

import arcpy

try:
# Set Distance Band Parameters: Analyze clustering of 911 calls from

# 1000 to 3000 feet by 200 foot increments
numDistances = 11
startDistance = 1000.0
increment = 200.0
# Process: Run K-Function...

kFun = arcpy.MultiDistanceSpatialClustering_stats("911Calls.shp",
"kFunResult.dbf", numDistances,
"0_PERMUTATIONS_-_NO_CONFIDENCE_ENVELOPE",
"NO_REPORT", "#", startDistance, increment,
"REDUCE_ANALYSIS_AREA",
"MINIMUM_ENCLOSING_RECTANGLE", "#")
except:
Environments
Output_coordinate_system, Geograpic_transformations, Current_workspace, Scratch_workspace, Random_number_generator
Feature geometry is projected to the Output Coordinate System prior to analysis, so values entered for the Beginning Distance and
Distance Increment parameters should match those specified in the Output Coordinate System. All mathematical computations are
based on the Output Coordinate System spatial reference.
Related Topics
How Multi-Distance Spatial Cluster Analysis (Ripley's K-function) works
Spatial Autocorrelation (Global Moran's I) (Spatial Statistics)
Locate topic
Summary
Measures spatial autocorrelation based on feature locations and attribute values using the Global Moran's I statistic.
Learn more about how Spatial Autocorrelation (Global Moran's I) works
Illustration
Usage
 The Spatial Autocorrelation tool returns five values: the Moran's I Index, Expected Index, Variance, z-score, and p-value. These
values are accessible from the Results window and are also passed as derived output values for potential use in models or scripts.
Optionally, this tool will create an HTML file with a graphical summary of results. Double-clicking on the HTML file in the Results
window will open the HTML file in the default Internet browser. Right-clicking on the Messages entry in the Results window and
selecting View will display the results in a Message dialog box. If you execute this tool in the foreground, output values will also be
displayed in the progress dialog box.
 Given a set of features and an associated attribute, this tool evaluates whether the pattern expressed is clustered, dispersed, or
random. When the z-score or p-value indicates statistical significance, a positive Moran's I index value indicates tendency toward
clustering while a negative Moran's I index value indicates tendency toward dispersion.
 This tool calculates a z-score and p-value to indicate whether or not you can reject the null hypothesis. In this case, the null
hypothesis states that feature values are randomly distributed across the study area.
What is a p-value?
consider aggregating your incident data. Optimized Hot Spot Analysis may also be used to analyze the spatial pattern of incident
data.
meters.
will be. Recommendations are outlined in Selecting a Conceptualization of Spatial Relationships: Best Practices. Here are some
additional tips:
 INVERSE_DISTANCE or INVERSE_DISTANCE_SQUARED
For the inverse distance options (INVERSE_DISTANCE, INVERSE_DISTANCE_SQUARED, or ZONE_OF_INDIFFERENCE), any two points
that are coincident will be given a weight of one to avoid zero division. This assures features are not excluded from analysis.
in the analysis.
analysis on a selection set, convert the ASCII spatial weights file to an SWM file by reading the ASCII data into a table, then
Note: It is possible to run out of memory when you run this tool. This generally occurs when you
select Conceptualization of Spatial Relationships and Distance Band or Threshold Distance
resulting in features having many, many neighbors. You generally do not want to define
spatial relationships so that features have thousands of neighbors. You want all features to
have at least one neighbor and almost all features to have at least eight neighbors.
 For polygon features, you will almost always want to choose ROW for the Standardization parameter. Row Standardization mitigates
bias when the number of neighbors each feature has is a function of the aggregation scheme or sampling process, rather than
reflecting the actual spatial distribution of the variable you are analyzing.
information.
Syntax
SpatialAutocorrelation_stats (Input_Feature_Class, Input_Field, {Generate_Report}, Conceptualization_of_Spatial_Relationships,
Distance_Method, Standardization, {Distance_Band_or_Threshold_Distance}, {Weights_Matrix_File})
Input_Feature_Class The feature class for which spatial autocorrelation will be calculated. Feature Layer
Input_Field The numeric field used in assessing spatial autocorrelation. Field
(Optional)
Conceptualization_of_Spatial_Relationships String
Specifies how spatial relationships among features are defined.
context of neighboring features. Neighboring features inside the
specified critical distance (Distance_Band_or_Threshold)
receive a weight of one and exert influence on computations for
the target feature. Neighboring features outside the critical
distance receive a weight of zero and have no influence on a
target feature's computations.
 ZONE_OF_INDIFFERENCE —Features within the specified critical
distance (Distance_Band_or_Threshold) of a target feature
receive a weight of one and influence computations for that
feature. Once the critical distance is exceeded, weights (and the
influence a neighboring feature has on target feature
computations) diminish with distance.
 CONTIGUITY_EDGES_ONLY —Only neighboring polygon features
that share a boundary or overlap will influence computations for
the target polygon feature.
boundary, share a node, or overlap will influence computations
for the target polygon feature.
 GET_SPATIAL_WEIGHTS_FROM_FILE —Spatial relationships are
defined by a specified spatial weights file. The path to the spatial
weights file is specified by the Weights_Matrix_File parameter.
Specifies how distances are calculated from each feature to
 EUCLIDEAN_DISTANCE —The straight-line distance between two
points (as the crow flies)
measured along axes at right angles (city block); calculated by
summing the (absolute) difference between the x- and y-
coordinates
Standardization String
Row standardization is recommended whenever the distribution of
 ROW —Spatial weights are standardized; each weight is divided
by its row sum (the sum of the weights of all neighboring
features).
Distance_Band_or_Threshold_Distance Specifies a cutoff distance for Inverse Distance and Fixed Distance Double
(Optional) options. Features outside the specified cutoff for a target feature are
ignored in analyses for that feature. However, for Zone of
reduced with distance, while those inside the distance threshold are
equally considered. The distance value entered should match that of
the output coordinate system.
For the Inverse Distance conceptualizations of spatial relationships, a
value of 0 indicates that no threshold distance is applied; when this
parameter is left blank, a default threshold value is computed and
applied. This default value is the Euclidean distance that ensures
every feature has at least one neighbor.
This parameter has no effect when Polygon Contiguity or Get Spatial
Weights From File spatial conceptualizations are selected.
Weights_Matrix_File The path to a file containing weights that define spatial, and File
Code Sample
SpatialAutocorrelation example 1 (Python window)
The following Python window script demonstrates how to use the SpatialAutocorrelation tool.
import arcpy
arcpy.env.workspace = r"c:\data"
arcpy.SpatialAutocorrelation_stats("olsResults.shp", "Residual","NO_REPORT",
"GET_SPATIAL_WEIGHTS_FROM_FILE","EUCLIDEAN DISTANCE",
"NONE", "#","euclidean6Neighs.swm")
SpatialAutocorrelation example 2 (stand-alone Python script)

The following stand-alone Python script demonstrates how to use the SpatialAutocorrelation tool.
# Analyze the growth of regional per capita incomes in US

# Counties from 1969 -- 2002 using Ordinary Least Squares Regression

import arcpy

try:
arcpy.workspace = workspace
# Growth as a function of {log of starting income, dummy for South

# counties, interaction term for South counties, population density}
# Process: Ordinary Least Squares...
ols = arcpy.OrdinaryLeastSquares_stats("USCounties.shp", "MYID",
"olsResults.shp", "GROWTH",
"LOGPCR69;SOUTH;LPCR_SOUTH;PopDen69",
"olsCoefTab.dbf",
"olsDiagTab.dbf")
# Create Spatial Weights Matrix (Can be based off input or output FC)
swm = arcpy.GenerateSpatialWeightsMatrix_stats("USCounties.shp", "MYID",
"#", "#", "#", 6)
# Calculate Moran's I Index of Spatial Autocorrelation for

# OLS Residuals using a SWM File.
# Process: Spatial Autocorrelation (Morans I)...
moransI = arcpy.SpatialAutocorrelation_stats("olsResults.shp", "Residual",
"NO_REPORT", "GET_SPATIAL_WEIGHTS_FROM_FILE",
"EUCLIDEAN_DISTANCE", "NONE", "#",
"euclidean6Neighs.swm")
except:
Environments
Related Topics
Spatial weights
How Average Nearest Neighbor works
Locate topic
The Average Nearest Neighbor tool measures the distance between each feature centroid and its nearest neighbor's centroid location. It then
averages all these nearest neighbor distances. If the average distance is less than the average for a hypothetical random distribution, the
distribution of the features being analyzed is considered clustered. If the average distance is greater than a hypothetical random distribution,
the features are considered dispersed. The average nearest neighbor ratio is calculated as the observed average distance divided by the
expected average distance (with expected average distance being based on a hypothetical random distribution with the same number of
features covering the same total area).
Calculations
Interpretation
If the index (average nearest neighbor ratio) is less than 1, the pattern exhibits clustering. If the index is greater than 1, the trend is
toward dispersion.
The equations used to calculate the average nearest neighbor distance index (1) and z-score (4) are based on the assumption that the
points being measured are free to locate anywhere within the study area (for example, there are no barriers, and all cases or features are
located independently of one another). The p-value is a numerical approximation of the area under the curve for a known distribution,
limited by the test statistic. See What is a z-score? What is a p-value? for more information about these statistics.
Caution: The z-score and p-value for this statistic are sensitive to changes in the study area or changes to
the Area parameter. For this reason, only compare z-score and p-value results from this statistic
when the study area is fixed.
Output
The Average Nearest Neighbor tool returns five values: observed mean distance, expected mean distance, nearest neighbor index, z-score,
and p-value. These values are accessible from the Results window and are also passed as derived output values for potential use in models
or scripts. Optionally, this tool will create an HTML file with a graphic summary of results. Double-clicking the HTML file in the Results
window will open the HTML file in the default Internet browser. Right-clicking the Messages entry in the Results window and selecting View
will display the results in a Message dialog box.
Possible applications
 Evaluate competition or territory: Quantify and compare the spatial distribution of a variety of plant or animal species within a fixed
study area; compare average nearest neighbor distances for different types of businesses within a city.
 Monitor changes over time: Evaluate changes in spatial clustering for a single type of business within a fixed study area over time.
 Compare an observed distribution to a control distribution: In a timber analysis, you may want to compare the pattern of harvested
areas to the pattern of harvestable areas to determine if cut areas are more clustered than you would expect, given the distribution of
harvestable timber overall.
The following books have further information about this tool:
Ebdon, David. Statistics in Geography. Blackwell, 1985.
Mitchell, Andy. The ESRI Guide to GIS Analysis, Volume 2. ESRI Press, 2005.
How High/Low Clustering (Getis-Ord General G) works
Locate topic
The High/Low Clustering tool measures the concentration of high or low values for a given study area.
Calculations
View additional General G statistic computations.

Notice that the only difference between the numerator and the denominator is the weighting (wij). High/Low Clustering will only work with
positive values. Consequently, if your weights are binary (0/1) or are always less than 1, the range for General G will be between 0 and 1.
A binary weighting scheme is recommended for this statistic. Select Fixed Distance Band, Polygon Contiguity, K Nearest Neighbors, or
Delaunay Triangulation for the Conceptualization of Spatial Relationships parameter. Select None for the Standardization parameter.
Interpretation
The High/Low Clustering (Getis-Ord General G) tool is an inferential statistic, which means that the results of the analysis are interpreted
within the context of the null hypothesis. The null hypothesis for the High/Low Clustering (General G) statistic states that there is no spatial
clustering of feature values. When the p-value returned by this tool is small and statistically significant, the null hypothesis can be rejected
(see What is a z-score? What is a p-value?). If the null hypothesis is rejected, then the sign of the z-score becomes important. If the z-
score value is positive, the observed General G index is larger than the expected General G index, indicating high values for the attribute
are clustered in the study area. If the z-score value is negative, the observed General G index is smaller than the expected index, indicating
that low values are clustered in the study area.
The High/Low Clustering (Getis-Ord General G) tool is most appropriate when you have a fairly even distribution of values and are looking
for unexpected spatial spikes of high values. Unfortunately, when both the high and low values cluster, they tend to cancel each other out.
If you are interested in measuring spatial clustering when both the high values and the low values cluster, use the Spatial Autocorrelation
tool.
The null hypothesis for both the High/Low Clustering (Getis-Ord General G) and the Spatial Autocorrelation (Global Moran's I) tool is
complete spatial randomness (CSR); values are randomly distributed among the features in the dataset, reflecting random spatial processes
at work. However, the interpretation of z-scores for the High/Low Clustering tool is very different from the interpretation of z-scores for the
Spatial Autocorrelation (Global Moran's I) tool:
Result High/Low Clustering Spatial Autocorrelation
The p-value is not You cannot reject the null hypothesis. It is quite possible that the spatial distribution of feature attribute
statistically values is the result of random spatial processes. Said another way, the observed spatial pattern of values
significant. could well be one of many, many possible versions of complete spatial randomness.
The p-value is You may reject the null hypothesis. You may reject the null hypothesis. The spatial distribution of high
statistically The spatial distribution of high values and/or low values in the dataset is more spatially clustered than
significant, and values in the dataset is more would be expected if underlying spatial processes were truly random.
the z-score is spatially clustered than would be
positive. expected if underlying spatial
processes were truly random.
The p-value is You may reject the null hypothesis. You may reject the null hypothesis. The spatial distribution of high
statistically The spatial distribution of low values and low values in the dataset is more spatially dispersed than
significant, and values in the dataset is more would be expected if underlying spatial processes were truly random. A
the z-score is spatially clustered than would be dispersed spatial pattern often reflects some type of competitive
negative. expected if underlying spatial process: a feature with a high value repels other features with high
processes were truly random. values; similarly, a feature with a low value repels other features with
low values.
Output
The High/Low Clustering tool returns five values: the observed General G, expected General G, variance, z-score, and p-value. These values
are accessible from the Results window and are passed as derived output values for potential use in models or scripts. Optionally, this tool
will create an HTML file with a graphic summary of results. Double-clicking the HTML file in the Results window will open the HTML file in the
default Internet browser.
In addition, right-clicking the Messages entry in the Results window and selecting View will display the results in a Message dialog box.
Frequently asked questions

Q: Results from the Hot Spot Analysis (Getis-Ord Gi*) tool indicate statistically significant hot spots. Why aren't results from the High/Low
Clustering (Getis-Ord General G) tool statistically significant too?
A: Global statistics like the High/Low Clustering (Getis-Ord General G) tool assess the overall pattern and trend of your data. They are most
effective when the spatial pattern is consistent across the study area. Local statistics tools (like Hot Spot Analysis) assess each feature
within the context of neighboring features and compare the local situation to the global situation. Consider an example. When you compute
a mean or average for a set of values, you are also computing a global statistic. If all the values are near 20, the mean will also be near 20,
and that result will be a very good representation/summary of the dataset as a whole. But if half of the values are near 1 and the other half
of the values are near 100, the mean will be near 50. There might not be any data values anywhere near 50, so the mean value is not a
good representation/summary of the dataset as a whole. If you create a histogram of the data values, however, you will see the bimodal
distribution. Similarly, global spatial statistics, including the High/Low Clustering tool, are most effective when the spatial processes being
measured are consistent across the study area. Results will then be a good representation/summary of the overall spatial pattern. For more
information, see Getis and Ord (1992), cited below, and the analysis of SIDS they present.
Q: Why are the results from the High/Low Clustering (Getis-Ord General G) tool different than the results from the Spatial Autocorrelation
(Global Moran's I) tool?
A: See the table above. These tools measure different spatial patterns.
Q: Can you compare the z-scores or p-values from this tool to results from an analysis of a different study area?
A: Results really are not comparable unless the study area and parameters used for analysis are fixed (the same for all the analyses you
want to compare). If the study area, however, comprises a fixed set of polygons, and the analysis parameters are fixed, you can compare
z-scores for a particular attribute over time. Suppose, for example, you want to analyze trends in clustering of over-the-counter (OTC)
medication purchases at the tract level for a particular county. You could run High/Low Clustering for each time period, then create a line
graph of the results. If you found that the z-scores were statistically significant and increasing, you could conclude that the intensity of
spatial clustering for high OTC purchases was increasing.
Q: Does feature size impact analysis?

A: The size of your features can affect your results. If your large polygons, for example, tend to have low values and your smaller polygons
tend to have high values, even if the concentration of highs and lows are equally concentrated, the observed General G index may be
higher than the expected General G index, because there are more pairs of small polygons within the specified distance.
Potential applications
 Look for unexpected spikes in the number of emergency room visits, which might indicate an outbreak of a local or regional health
problem.
 Comparing the spatial pattern of different types of retail within a city to see which types cluster with competition to take advantage of
comparison shopping (automobile dealerships, for example) and which types repel competition (fitness centers/gyms, for example).
 Summarizing the level at which spatial phenomena cluster to examine changes at different times or in different locations. For example,
it is known that cities and their populations cluster. Using High/Low Clustering analysis, you can compare the level of population
clustering within a single city over time (analysis of urban growth and density).
Getis, Arthur, and J. K. Ord. "The Analysis of Spatial Association by Use of Distance Statistics." Geographical Analysis 24, no. 3. 1992.
How Incremental Spatial Autocorrelation works
Locate topic
With much of the spatial data analysis you do, the scale of your analysis will be important. The default Conceptualization of Spatial
Relationships for the Hot Spot Analysis tool, for example, is FIXED_DISTANCE_BAND and requires you to specify a distance value. For many
density tools you will be asked to provide a Radius. The distance you select should relate to the scale of the question you are trying to answer
or to the scale of remediation you are considering. Suppose, for example, you want to understand childhood obesity. What is your scale of
analysis? Is it at the individual household or neighborhood level? If so, the distance you use to define your scale of analysis will be small,
encompassing the homes within a block or two of each other. Alternatively, what will be the scale of remediation? Perhaps your question
involves where to increase after-school fitness programs as a way to potentially reduce childhood obesity. In that case, your distance will likely
be reflective of school zones. Sometimes it’s fairly easy to determine an appropriate scale of analysis; if you are analyzing commuting patterns
and know that the average journey to work is 12 miles, for example, then 12 miles would be an appropriate distance to use for your analysis.
Other times it is more difficult to justify any particular analysis distance. This is when the Incremental Spatial Autocorrelation tool is most
helpful.
Whenever you see spatial clustering in the landscape, you are seeing evidence of underlying spatial processes at work. Knowing something
about the spatial scale at which those underlying processes operate can help you select an appropriate analysis distance. The Incremental
Spatial Autocorrelation tool runs the Spatial Autocorrelation (Global Moran’s I) tool for a series of increasing distances, measuring the intensity
of spatial clustering for each distance. The intensity of clustering is determined by the z-score returned. Typically, as the distance increases, so
does the z-score, indicating intensification of clustering. At some particular distance, however, the z-score generally peaks. Sometimes you will
see multiple peaks.
Peaks reflect distances where the spatial processes promoting clustering are most pronounced. The color of each point on the graph
corresponds to the statistical significance of the z-score values.
One strategy for identifying an appropriate scale of analysis is to select the distance associated with the statistically significant peak that best
reflects the scale of your question. Often this is the first statistically significant peak.
How do I select the Beginning Distance and Distance Increment values?

All distance measurements are based on feature centroids and the default Beginning Distance is the smallest distance that will ensure every
feature has at least one neighboring feature. This is generally a good choice, unless your dataset includes locational outliers. Determine
whether or not you have locational outliers, then select all but the outlier features and run Incremental Spatial Autocorrelation on just the
selected features. If you find a peak distance for the selection set, use that distance to create a spatial weights matrix file based on all of
your features (even the outliers). When you run the Generate_Spatial_Weights_Matrix tool to create the spatial weights matrix file, set the
Number of Neighbors parameter to some value so that all features will have at least that many neighboring features.
The default Increment Distance is the average distance to each feature's nearest neighboring feature. If you've determined an appropriate
starting distance using the strategies above and still don't see a peak distance, you may want to experiment with smaller or larger
increment distances.
What if the graph never peaks?

In some cases, you will use the Incremental Spatial Autocorrelation tool and get a graph with a z-score that just continues to rise with
increasing distances; there is no peak. This most often happens in cases where data has been aggregated and the scale of the processes
impacting your Input Field variable are smaller than the aggregation scheme. You can try making your Distance Increment smaller to see if
this captures more subtle peaks. Sometimes, however, you won't get a peak because there are multiple spatial processes, each operating
at a different distance, in your study area. This is often the case with large point datasets that are noisy (no clear spatial pattern to the
point data values you're analyzing). In this case, you will need to justify your scale of analysis using some other criteria.
Interpreting results
When you run the Incremental Spatial Autocorrelation tool in the foreground, the z-score results for each distance are written to the
Progress window. This output is also available from the Results window. If you right-click on the Messages entry in the Results window and
select View, the tool results are displayed in a Message dialog box. When you specify a path for the optional Output Table parameter, a
table is created that includes fields for Distance, MoransI, ExpectedI, Variance, z_score, and p_value. By examining the z-score
values in the Progress window, Message dialog box, or Output Table, you can determine if there are any peak distances. More typically,
however, you would identify peak distances by looking at the graphic in the optional Output Report file. The report has three pages. An
example of the first page of the report is shown below. Notice that this graph has three peak z-scores associated with distances of 5000,
9000, and 13000 feet. A halo will be drawn to highlight both the first peak distance and the maximum peak distance, but all peaks
represent distances where the spatial processes promoting clustering are most pronounced. You can select the peak that best reflects the
scale of your analytical question. In some cases, there will only be one halo because the first and the maximum peaks are found at the
same distance. If none of the z-score peaks are statistically significant, then none of the peaks will have the light blue halo. Notice that the
color of the plotted z-score corresponds to the legend showing the critical values for statistical significance.
On page two of the report, the distances and z-score values are presented in table format. The last page of the report documents the
parameter settings used when the tool was run. To get a report file, provide a path for the Output Report parameter.
 Videos outlining some best practices for performing a hot spot analysis:
 Hot Spot Analysis Part 1

 Tutorial and video walking through an analysis of Dengue Fever data that uses the Incremental Spatial Autocorrelation tool:
 Spatial Pattern Analysis Tutorial
 Spatial Pattern Analysis of Dengue Fever Video

 See Selecting a Fixed Distance Band in Modeling Spatial Relationships.
 How Hot Spot Analysis Works includes a discussion of finding an appropriate scale of analysis.
 For a an up-to-date list of all of the spatial statistics resources available, go to www.esriurl.com/spatialstats.
How Multi-Distance Spatial Cluster Analysis (Ripley's K-function) works
Locate topic
The Multi-Distance Spatial Cluster Analysis tool, based on Ripley's K-function, is another way to analyze the spatial pattern of incident point
data. A distinguishing feature of this method from others in this toolset (Spatial Autocorrelation and Hot Spot Analysis) is that it summarizes
spatial dependence (feature clustering or feature dispersion) over a range of distances. In many feature pattern analysis studies, the selection
of an appropriate scale of analysis is required. For example, a Distance Band or Threshold Distance is often needed for the analysis. When
exploring spatial patterns at multiple distances and spatial scales, patterns change, often reflecting the dominance of particular spatial
processes at work. Ripley's K-function illustrates how the spatial clustering or dispersion of feature centroids changes when the neighborhood
size changes.
When using this tool, specify the number of distances to evaluate and, optionally, a starting distance and/or distance increment. With this
information, the tool computes the average number of neighboring features associated with each feature; neighboring features are those
closer than the distance being evaluated. As the evaluation distance increases, each feature will typically have more neighbors. If the average
number of neighbors for a particular evaluation distance is higher/larger than the average concentration of features throughout the study area,
the distribution is considered clustered at that distance.
Use this tool when you are interested in examining how the clustering/dispersion of your features changes at different distances (different
scales of analysis).
Calculations
A number of variations of Ripley's original K-function have been suggested. Here, a common transformation of the K-function, often
referred to as L(d), is implemented:
With the L(d) transformation, the Expected K value is equal to Distance
The default Beginning Distance and Distance Increment values are computed as follows:
 We always know the Number of Distance Bands (the default value is 10). We will use this Iterations value to compute a default Distance
Increment if one isn't provided.
 We initially compute a Maximum Distance value as 25 percent of the maximum extent length of a minimum enclosing rectangle around
the input features. If the Boundary Correction Method is REDUCE_ANALYSIS_AREA, then the Maximum Distance is set to the larger of
either 25 percent of the maximum extent length or 50 percent of the minimum extent length of the minimum enclosing rectangle.
 If a Beginning Distance is provided, the Distance Increment is (Maximum Distance - Beginning Distance) / Iterations.
 If no Beginning Distance is provided, the Distance Increment is Max Distance / Iterations and the Beginning Distance is set to the
Distance Increment value.
Interpreting unweighted K-function results

When the observed K value is larger than the expected K value for a particular distance, the distribution is more clustered than a random
distribution at that distance (scale of analysis). When the observed K value is smaller than the expected K value, the distribution is more
dispersed than a random distribution at that distance. When the observed K value is larger than the HiConfEnv value, spatial clustering for
that distance is statistically significant. When the observed K value is smaller than the LwConfEnv value, spatial dispersion for that distance
is statistically significant.
When no Weight Field is specified, the confidence envelope is constructed by distributing points randomly in the study area and
calculating k for that distribution. Each random distribution of the points is called a "permutation". If 99_PERMUTATIONS is selected, for
example, the tool will randomly distribute the set of points 99 times for each iteration. After distributing the points 99 times the tool selects,
for each distance, the k value that deviated above and below the Expected k value by the greatest amount; these values become the
confidence interval. The confidence envelopes tend to follow (have the same shape and location) as the blue Expected K line for unweighted
K.
Interpreting weighted K-function results

The K-function always evaluates feature spatial distribution in relation to Complete Spatial Randomness (CSR), even when a weight field is
provided. You can think of the weight as representing the number of coincident features at each feature location. For example, a feature
with a weight of 3 may be interpreted as 3 coincident features. There is one difference, however: a feature cannot be its own neighbor.
Consequently, you would get a different result for a dataset where there are 3 individual coincident points with a weight of 1 (all would be
counted as neighbors of each other) than you would for a dataset with a single point with a weight of 3 (a feature is not counted as a
neighbor of itself). Results from the weighted K-function will always be more clustered than results without a weight field. It is useful to run
the K-function on the points without a weight to get a baseline indicating how much clustering is associated with feature locations alone.
You can then compare the baseline to weighted results to get a feel for how much additional clustering or dispersion is added when the
weight is considered. The weighted K-function shows the clustering (dispersion) over and above (under and below) that which it would
obtain from the unweighted pattern. In fact, instead of CSR, you can use results from the unweighted K-function to represent the expected
pattern (with its own confidence envelope). There are two possible null hypotheses in this case:
1. The pattern of weighted features is not significantly more clustered (dispersed) than the underlying pattern of those features. You
reject the null hypothesis if the observed weighted results fall outside the unweighted results confidence envelope.
2. The pattern of weighted points is more clustered (dispersed) than chance would have it. You reject the null hypothesis if the
observed unweighted results fall within the confidence envelope for the weighted K-function results.
When a Weight Field is specified, only the weight values are randomly redistributed to compute confidence envelopes; the point locations
remain fixed. In essence, when a Weight Field is specified, locations remain fixed and the tool evaluates the clustering of feature values in
space. Because results are strongly structured by the fixed locations of the features, for weighted K analyses the confidence envelope tends
to follow/mirror the red Observed K line.
Bailey, T. C., and A. C. Gatrell. Interactive Spatial Data Analysis. Longman Scientific & Technical, Harlow, U.K. 395 pp. 1995.
Boots, B., and A. Getis. Point Pattern Analysis. Sage University Paper Series on Quantitative Applications in the Social Sciences, series no.
07–001. Sage Publications. 1988.
Getis, A. Interactive Modeling Using Second-Order Analysis. Environment and Planning A, 16: 173–183. 1984.
Locate topic
The Spatial Autocorrelation (Global Moran's I) tool measures spatial autocorrelation based on both feature locations and feature values
simultaneously. Given a set of features and an associated attribute, it evaluates whether the pattern expressed is clustered, dispersed, or
random. The tool calculates the Moran's I Index value and both a a z-score and p-value to evaluate the significance of that Index. P-values are
numerical approximations of the area under the curve for a known distribution, limited by the test statistic.
Calculations
View additional mathematics for Global Moran's I

The math behind the Global Moran's I statistic is shown above. The tool computes the mean and variance for the attribute being evaluated.
Then, for each feature value, it subtracts the mean, creating a deviation from the mean. Deviation values for all neighboring features
(features within the specified distance band, for example) are multiplied together to create a cross-product. Notice that the numerator for
the Global Moran's I statistic includes these summed cross-products. Suppose features A and B are neighbors, and the mean for all feature
values is 10. Notice the range of possible cross-product results:
Feature values Deviations Cross-products
A=50 B=40 40 30 1200
A= 8 B=6 -2 -4 8
A=20 B=2 10 -8 -80
When values for neighboring features are either both larger than the mean or both smaller than the mean, the cross-product will be
positive. When one value is smaller than the mean and the other is larger than the mean, the cross-product will be negative. In all cases,
the larger the deviation from the mean, the larger the cross-product result. If the values in the dataset tend to cluster spatially (high values
cluster near other high values; low values cluster near other low values), the Moran's Index will be positive. When high values repel other
high values, and tend to be near low values, the Index will be negative. If positive cross-product values balance negative cross-product
values, the Index will be near zero. The numerator is normalized by the variance so that Index values fall between -1.0 and +1.0 (see the
FAQ section below for exceptions).
After the Spatial Autocorrelation (Global Moran's I) tool computes the Index value, it computes the Expected Index value. The Expected and
Observed Index values are then compared. Given the number of features in the dataset and the variance for the data values overall, the
tool computes a z-score and p-value indicating whether this difference is statistically significant or not. Index values cannot be interpreted
directly; they can only be interpreted within the context of the null hypothesis.
Interpretation
The Spatial Autocorrelation (Global Moran's I) tool is an inferential statistic, which means that the results of the analysis are always
interpreted within the context of its null hypothesis. For the Global Moran's I statistic, the null hypothesis states that the attribute being
analyzed is randomly distributed among the features in your study area; said another way, the spatial processes promoting the observed
pattern of values is random chance. Imagine that you could pick up the values for the attribute you are analyzing and throw them down
onto your features, letting each value fall where it may. This process (picking up and throwing down the values) is an example of a random
chance spatial process.
When the p-value returned by this tool is statistically significant, you can reject the null hypothesis. The table below summarizes
interpretation of results:
The p-value is not You cannot reject the null hypothesis. It is quite possible that the spatial distribution of feature values is
statistically significant. the result of random spatial processes. The observed spatial pattern of feature values could very well be
one of many, many possible versions of complete spatial randomness (CSR).
The p-value is You may reject the null hypothesis. The spatial distribution of high values and/or low values in the
statistically significant, dataset is more spatially clustered than would be expected if underlying spatial processes were random.
and the z-score is
positive.
The p-value is You may reject the null hypothesis. The spatial distribution of high values and low values in the dataset
statistically significant, is more spatially dispersed than would be expected if underlying spatial processes were random. A
and the z-score is dispersed spatial pattern often reflects some type of competitive process—a feature with a high value
negative. repels other features with high values; similarly, a feature with a low value repels other features with low
values.
Note: The null hypothesis for both the High/Low Clustering (General G) tool and the Spatial
Autocorrelation (Global Moran's I) tool is Complete Spatial Randomness. The interpretation of z-
scores for the High/Low Clustering (General G) tool is different, however.
Output
The Spatial Autocorrelation (Global Moran's I) tool returns five values: the Moran's Index, Expected Index, Variance, z-score, and p-value.
These values are accessible from the Results window and are passed as derived output values for potential use in models or scripts.
Optionally, this tool will create an HTML file with a graphical summary of results. Double-clicking the HTML file in the Results window will
open the HTML file in the default Internet browser.
Tool output is accessible from the Results window.
Right-clicking the Messages entry in the Results window and selecting View will also display the results in a Message dialog box.
Best practice guidelines

 Does the Input Feature Class contain at least 30 features? Results aren't reliable with less than 30 features.
 Is the Conceptualization of Spatial Relationships you selected appropriate? See Selecting a Conceptualization of Spatial Relationships.
 Is the Distance Band or Threshold Distance appropriate? See Selecting a Fixed Distance.
 All features should have at least one neighbor.
 No feature should have all other features as a neighbor.

 Especially if the values for the Input Field are skewed, you want features to have about eight neighbors each.
 Should you Row Standardize? For polygon features, you will almost always want to row standardize. See Standardization.
FAQs
Q: Results from the Hot Spot Analysis (Getis-Ord Gi*) tool indicate statistically significant hot spots. Why aren't results from the Spatial
Autocorrelation (Global Moran's I) tool statistically significant too?
A: Global statistics like the Spatial Autocorrelation (Global Moran's I) tool assess the overall pattern and trend of your data. They are most
effective when the spatial pattern is consistent across the study area. Local statistics (like the Hot Spot Analysis (Getis-Ord Gi*) tool) assess
each feature within the context of neighboring features and compare the local situation to the global situation. Consider an example. When
you compute a mean or average for a set of values, you are also computing a global statistic. If all the values are near 20, the mean will
also be near 20, and that result will be a very good representation/summary of the dataset as a whole. But if half of the values are near 1
and the other half of the values are near 100, the mean will be near 50. There might not be any data values anywhere near 50, so the
mean value is not a good representation/summary of the dataset as a whole. If you create a histogram of the data values, you will see the
bimodal distribution. Similarly, global spatial statistics, including the Spatial Autocorrelation (Global Moran's I) tool, are most effective when
the spatial processes being measured are consistent across the study area. Results will then be a good representation/summary of the
overall spatial pattern. For more information, see Getis and Ord (1992) cited below, and the analysis of SIDS they present.
Q: Why are the results from High Low Clustering (Getis-Ord General G) different than the results from Spatial Autocorrelation (Global
Moran's I)?
A: These tools measure different spatial patterns. Click here for more information.
Q: Can you compare the z-scores or p-values from this tool to results from analyses for different study areas?
A: Results are not comparable across different study areas. When the study area is fixed, however (for example, all analyses are for
Counties in California), the Input Field is comparable (for example, all analyses involve some type of population count), and the tool
parameters are the same (Fixed Distance with a Distance Band or Threshold Distance of 5,000 meters and Row Standardization, for
example), you may compare statistically significant z-scores to get a sense of the intensity of spatial clustering or spatial dispersion or to
better understand trends over time. You can also run the analysis for a series of increasing Distance Band or Threshold Distance values to
see the distance/scale where the processes promoting spatial clustering are most pronounced.
Q: Why am I getting a Moran's Index greater than 1.0 or less than -1.0?
A: In general, the Global Moran's Index is bounded by -1.0 and 1.0. This is always the case when your weights are row standardized.
When you don't row standardize the weights, there may be instances where the Index value falls outside the -1.0 to 1.0 range, and this
indicates a problem with your parameter settings. The most common problems are the following:
 The Input Field is strongly skewed (create a histogram of the data values to see this), and the Conceptualization of Spatial
Relationships or Distance Band is such that some features have very few neighbors. The Global Moran's I statistic is asymptotically
normal, which means for skewed data, you will want each feature to have at least eight neighbors. The default value computed for the
Distance Band or Threshold Distance parameter ensures that every feature has at least one neighbor, but this may not be sufficient,
especially when values in the Input Field are strongly skewed.
 An Inverse Distance Conceptualization of Spatial Relationships is used, and the inverted distances are very small.
 Row standardization is not selected, but should be. Whenever your data has been aggregated, unless the aggregation scheme relates
directly to the field you are analyzing, you should select row standardization.
 Help identify an appropriate neighborhood distance for a variety of spatial analysis methods by finding the distance where spatial
autocorrelation is strongest.
 Measure broad trends in ethnic or racial segregation over time—is segregation increasing or decreasing?
 Summarize the diffusion of an idea, disease, or trend over space and time—is the idea, disease, or trend remaining isolated and
concentrated, or spreading and becoming more diffuse?
The following books and journal articles have further information about this tool:
Getis, Arthur, and J. K. Ord. "The Analysis of Spatial Association by Use of Distance Statistics." Geographical Analysis 24, no. 3. 1992.
Goodchild, Michael F. Spatial Autocorrelation. Catmog 47, Geo Books. 1986.
Griffith, Daniel. Spatial Autocorrelation: A Primer. Resource Publications in Geography, Association of American Geographers. 1987.
An overview of the Mapping Clusters toolset
Locate topic
The Mapping Clusters tools perform cluster analysis to identify the locations of statistically significant hot spots, cold spots, spatial outliers, and
similar features. The Mapping Clusters toolset is particularly useful when action is needed based on the location of one or more clusters. An
example would be the assignment of additional police officers to deal with a cluster of burglaries. Pinpointing the location of spatial clusters is
also important when looking for potential causes of clustering; where a disease outbreak occurs can often provide clues about what might be
causing it. Unlike the methods in the Analyzing Patterns toolset, which answer the question, "Is there spatial clustering?" with Yes or No, the
Mapping Clusters tools allow visualization of the cluster locations and extent. These tools answer the questions, "Where are the clusters (hot
spots/cold spots)?" , "Where are the spatial outliers?", and "Which features are most alike?".
Tool Description
Cluster and Given a set of weighted features, identifies statistically significant hot spots, cold spots, and spatial outliers using
Outlier Analysis the Anselin Local Moran's I statistic.
Grouping Analysis Groups features based on feature attributes and optional spatial/temporal constraints.
Hot Spot Analysis Given a set of weighted features, identifies statistically significant hot spots and cold spots using the Getis-Ord
Gi* statistic.
Optimized Hot Given incident points or weighted features (points or polygons), creates a map of statistically significant hot and
Spot Analysis cold spots using the Getis-Ord Gi* statistic. It evaluates the characteristics of the input feature class to produce
optimal results.
Similarity Search Identifies which candidate features are most similar or most dissimilar to one or more input features based on
feature attributes.
Mapping clusters tools
Related Topics
Cluster and Outlier Analysis (Anselin Local Moran's I) (Spatial Statistics)
Locate topic
Summary
Given a set of weighted features, identifies statistically significant hot spots, cold spots, and spatial outliers using the Anselin Local Moran's
I statistic.
Learn more about how Cluster and Outlier Analysis (Anselin Local Moran's I) works
Illustration
Usage
 This tool creates a new Output Feature Class with the following attributes for each feature in the Input Feature Class: Local Moran's I
index, z-score, p-value, and cluster/outlier type (COType).
 The z-scores and p-values are measures of statistical significance which tell you whether or not to reject the null hypothesis, feature
by feature. In effect, they indicate whether the apparent similarity (a spatial clustering of either high or low values) or dissimilarity (a
spatial outlier) is more pronounced than one would expect in a random distribution. The p-values and z-scores in the Output Feature
Class do not reflect any FDR (False Discovery Rate) corrections.
 A high positive z-score for a feature indicates that the surrounding features have similar values (either high values or low values).
The COType field in the Output Feature Class will be HH for a statistically significant cluster of high values and LL for a statistically
significant cluster of low values.
 A low negative z-score (for example, less than -3.96) for a feature indicates a statistically significant spatial data outlier. The COType
field in the Output Feature Class will indicate if the feature has a high value and is surrounded by features with low values (HL) or if
the feature has a low value and is surrounded by features with high values (LH).
 The COType field will always indicate statistically significant clusters and outliers for a 95 percent confidence level. Only statistically
significant features have values for the COType field. When you check the optional Apply False Discovery Rate (FDR) Correction
parameter, statistical significance is based on a corrected 95 percent confidence level.
 Default rendering for the Output Feature Class is based on the values in the COType field.
What is a p-value?
meters.
Legacy: Prior to ArcGIS 10.2.1, you would see a warning message if the parameters and
environment settings you selected would result in calculations being performed using
Geographic Coordinates (degrees, minutes, seconds). This warning advised you to project
your data into a Projected Coordinate System so that distance calculations would be
accurate. Beginning at 10.2.1, however, this tool calculates chordal distances whenever
Geographic Coordinate System calculations are required.
consider aggregating your incident data. The Optimized Hot Spot Analysis tool may also be used to analyze the spatial pattern of
incident data.
additional tips:
 More information about space-time cluster analysis is provided in the Space-Time Analysis documentation.
in the analysis.
 When this tool runs in ArcMap, the Output Feature Class is automatically added to the Table of Contents (TOC) with default rendering
applied to the COType field. The rendering applied is defined by a layer file in
<ArcGIS>/Desktop10.x/ArcToolbox/Templates/Layers. You can reapply the default rendering, if needed, by importing the
template layer symbology.
 The Output Feature Class includes a SOURCE_ID field which allows you to Join it to the Input Feature Class, if needed.
information.
Legacy: Prior to ArcGIS 10.0, the output feature class was a duplicate of the input feature class with
the COType, z-score, and p-value result fields tacked on. After ArcGIS 10.0, the output
feature class only includes the results and fields used in the analysis.
0 Output Feature Class Feature Class
1 Index field name Field
2 ZScore field name Field
3 Probability field name Field
4 COType field name Field
5 Source ID field name Field
Syntax
ClustersOutliers_stats (Input_Feature_Class, Input_Field, Output_Feature_Class, Conceptualization_of_Spatial_Relationships,
Distance_Method, Standardization, {Distance_Band_or_Threshold_Distance}, {Weights_Matrix_File},
{Apply_False_Discovery_Rate__FDR__Correction})

Input_Feature_Class The feature class for which cluster/outlier analysis will be Feature Layer
performed.
Input_Field The numeric field to be evaluated. Field
Output_Feature_Class The output feature class to receive the results fields. Feature Class
Conceptualization_of_Spatial_Relationships Specifies how spatial relationships among features are defined. String

context of neighboring features. Neighboring features inside
the specified critical distance (Distance_Band_or_Threshold)
receive a weight of one and exert influence on computations
for the target feature. Neighboring features outside the critical

 ZONE_OF_INDIFFERENCE —Features within the specified
critical distance (Distance_Band_or_Threshold) of a target
feature receive a weight of one and influence computations for
that feature. Once the critical distance is exceeded, weights
(and the influence a neighboring feature has on target feature
 CONTIGUITY_EDGES_ONLY —Only neighboring polygon
features that share a boundary or overlap will influence
computations for the target polygon feature.
 CONTIGUITY_EDGES_CORNERS —Polygon features that share
a boundary, share a node, or overlap will influence
 GET_SPATIAL_WEIGHTS_FROM_FILE —Spatial relationships
are defined by a specified spatial weights file. The path to the
spatial weights file is specified by the Weights_Matrix_File
parameter.
 EUCLIDEAN_DISTANCE —The straight-line distance between
two points (as the crow flies)
measured along axes at right angles (city block); calculated
by summing the (absolute) difference between the x- and y-
coordinates
Standardization String
Row standardization is recommended whenever the distribution of
 ROW —Spatial weights are standardized; each weight is
divided by its row sum (the sum of the weights of all
neighboring features).
Distance_Band_or_Threshold_Distance Specifies a cutoff distance for Inverse Distance and Fixed Distance Double
(Optional) options. Features outside the specified cutoff for a target feature
are ignored in analyses for that feature. However, for Zone of
reduced with distance, while those inside the distance threshold
are equally considered. The distance value entered should match
that of the output coordinate system.
For the Inverse Distance conceptualizations of spatial
relationships, a value of 0 indicates that no threshold distance is
applied; when this parameter is left blank, a default threshold
value is computed and applied. This default value is the Euclidean
distance that ensures every feature has at least one neighbor.
This parameter has no effect when Polygon Contiguity or Get
Spatial Weights From File spatial conceptualizations are selected.
Apply_False_Discovery_Rate__FDR__Correction  APPLY_FDR —Statistical significance will be based on the False Boolean
(Optional) Discovery Rate correction for a 95 percent confidence level.
 NO_FDR —Features with p-values less than 0.05 will appear in
the COType field reflecting statistically significant clusters or
outliers at a 95 percent confidence level (default).
Code Sample
ClusterandOutlierAnalysis example 1 (Python window)
The following Python window script demonstrates how to use the ClusterandOutlierAnalysis tool.
import arcpy
arcpy.env.workspace = "c:/data/911calls"
arcpy.ClustersOutliers_stats("911Count.shp", "ICOUNT","911ClusterOutlier.shp",
"GET_SPATIAL_WEIGHTS_FROM_FILE","EUCLIDEAN_DISTANCE",
"NONE","#", "euclidean6Neighs.swm","NO_FDR")
ClusterandOutlierAnalysis example 2 (stand-alone Python script)

The following stand-alone Python script demonstrates how to use the ClusterandOutlierAnalysis tool.

# using the Cluster-Outlier Analysis Tool (Anselin's Local Moran's I)

import arcpy
# Set geoprocessor object property to overwrite outputs if they already exist

arcpy.gp.OverwriteOutput = True
workspace = r"C:\Data\911Calls"
try:
# Set the current workspace
# (to avoid having to specify the full path to the feature classes each time)
"#", 0, 0, 0)


"911Count.shp")

"#", "#", "#", 6)
# Cluster/Outlier Analysis of 911 Calls

# Process: Local Moran's I
clusters = arcpy.ClustersOutliers_stats("911Count.shp", "ICOUNT",
"911ClusterOutlier.shp",
"#", "euclidean6Neighs.swm","NO_FDR")
except:
Environments
Current_workspace, Scratch_workspace, Output_coordinate_system, Geograpic_transformations, Qualified_field_names,
Output_has_Z_values, Default_output_Z_value, Z_resolution, Z_tolerance, Output_has_M_values, M_resolution, M_tolerance,
XY_resolution, XY_tolerance
Feature geometry is projected to the Output Coordinate System prior to analysis, so values entered for the Distance Band or
Threshold Distance parameter should match those specified in the Output Coordinate System. All mathematical computations are
based on the spatial reference of the Output Coordinate System. When the Output Coordinate System is based on degrees, minutes,
and seconds, geodesic distances are estimated using chordal distances in meters.
Related Topics
Spatial weights
How Cluster and Outlier Analysis (Anselin Local Moran's I) works
Calculate Distance Band from Neighbor Count
Collect Events
Hot Spot Analysis (Getis-Ord Gi*) (Spatial Statistics)
Locate topic
Summary
Given a set of weighted features, identifies statistically significant hot spots and cold spots using the Getis-Ord Gi* statistic.
Learn more about how Hot Spot Analysis (Getis-Ord Gi*) works
Illustration
Usage
 This tool identifies statistically significant spatial clusters of high values (hot spots) and low values (cold spots). It creates a new
Output Feature Class with a z-score, p-value, and confidence level bin (Gi_Bin) for each feature in the Input Feature Class.
 The z-scores and p-values are measures of statistical significance which tell you whether or not to reject the null hypothesis, feature
by feature. In effect, they indicate whether the observed spatial clustering of high or low values is more pronounced than one would
expect in a random distribution of those same values. The z-score and p-value fields do not reflect any kind of FDR (False Discovery
Rate) correction.
 The Gi_Bin field identifies statistically significant hot and cold spots regardless of whether or not the FDR correction is applied.
Features in the +/-3 bins reflect statistical significance with a 99 percent confidence level; features in the +/-2 bins reflect a 95
percent confidence level; features in the +/-1 bins reflect a 90 percent confidence level; and the clustering for features in bin 0 is not
statistically significant. Without FDR correction, statistical significance is based on the p-value and z-score fields. When you check the
optional Apply False Discovery Rate (FDR) Correction parameter, the critical p-values determining confidence levels is reduced to
account for multiple testing and spatial dependence.
 A high z-score and small p-value for a feature indicates a spatial clustering of high values. A low negative z-score and small p-value
indicates a spatial clustering of low values. The higher (or lower) the z-score, the more intense the clustering. A z-score near zero
indicates no apparent spatial clustering.
What is a p-value?
meters.
consider aggregating your incident data or using the Optimized Hot Spot Analysis tool.
 The Optimized Hot Spot Analysis tool interrogates your data to automatically select parameter settings that will optimize you hot spot
results. It will aggregate incident data, select an appropriate scale of analysis, and adjust results for multiple testing and spatial
dependence. The parameter options it selects are reported to the Results window, and these may help you refine your parameter
choices when you use this tool. This tool gives you full control and flexibility over your parameter settings.
additional tips:
 More information about space-time cluster analysis is provided in the Space-Time Analysis documentation.
in the analysis.
 If you provide a Weights Matrix File with an SWM extension, this tool is expecting a spatial weights matrix file created using either
the Generate Spatial Weights Matrix or Generate Network Spatial Weights tools; otherwise, this tool is expecting an ASCII formatted
 The default weight for self potential is zero, unless you specify a Self Potential Field value or include self potential weights
explicitly.
 The default weight for self potential is one, unless you specify a Self Potential Field value.
 When this tool runs in ArcMap, the Output Feature Class is automatically added to the table of contents with default rendering applied
to the Gi_Bin field. The hot-to-cold rendering applied is defined by a layer file in
 The Output Feature Class includes a SOURCE_ID field which allows you to Join it to the Input Feature Class, if needed.
information.
Legacy: Prior to ArcGIS 10.0, the output feature class was a duplicate of the input feature class with
the z-score and p-value results fields added. After ArcGIS 10.0, the output feature class only
includes the z-score and p-value fields as well as the fields input for the analysis. To join
other input fields to the output feature class, use the SOURCE_ID field to join the fields
using tools in the Joins toolset.
Legacy: Row Standardization has no impact on this tool: results from Hot Spot Analysis (the Getis-
Ord Gi* statistic) would be identical with or without row standardization. The parameter is
consequently disabled; it remains as a tool parameter only to support backwards
compatibility.
0 Output Feature Class Feature Class
1 Results field name (GiZScore) Field
2 Probability field name (GiPValue) Field
3 Source ID field name (SOURCE_ID) Field
Syntax
HotSpots_stats (Input_Feature_Class, Input_Field, Output_Feature_Class, Conceptualization_of_Spatial_Relationships, Distance_Method,
Standardization, {Distance_Band_or_Threshold_Distance}, {Self_Potential_Field}, {Weights_Matrix_File},
{Apply_False_Discovery_Rate__FDR__Correction})
The feature class for which hot spot analysis will be performed.
Input_Field Field
The numeric field (number of victims, crime rate, test scores, and
so on) to be evaluated.
Output_Feature_Class Feature Class
The output feature class to receive the z-score and p-value
results.
Conceptualization_of_Spatial_Relationships Specifies how spatial relationships among features are defined. String

context of neighboring features. Neighboring features inside
the specified critical distance (Distance_Band_or_Threshold)
receive a weight of one and exert influence on computations
for the target feature. Neighboring features outside the critical
 ZONE_OF_INDIFFERENCE —Features within the specified
critical distance (Distance_Band_or_Threshold) of a target
feature receive a weight of one and influence computations for
that feature. Once the critical distance is exceeded, weights
(and the influence a neighboring feature has on target feature
 CONTIGUITY_EDGES_ONLY —Only neighboring polygon
features that share a boundary or overlap will influence
 CONTIGUITY_EDGES_CORNERS —Polygon features that share
a boundary, share a node, or overlap will influence
 GET_SPATIAL_WEIGHTS_FROM_FILE —Spatial relationships
are defined by a specified spatial weights file. The path to the
spatial weights file is specified by the Weights_Matrix_File
parameter.
 EUCLIDEAN_DISTANCE —The straight-line distance between
two points (as the crow flies)
measured along axes at right angles (city block); calculated
by summing the (absolute) difference between the x- and y-
coordinates
Standardization Row standardization has no impact on this tool: results from Hot String
Spot Analysis (the Getis-Ord Gi* statistic) would be identical with
or without row standardization. The parameter is disabled; it
remains as a tool parameter only to support backwards
compatibility.
 ROW —No standardization of spatial weights is applied.
Distance_Band_or_Threshold_Distance Double
Specifies a cutoff distance for Inverse Distance and Fixed Distance
(Optional) options. Features outside the specified cutoff for a target feature
are ignored in analyses for that feature. However, for Zone of
reduced with distance, while those inside the distance threshold
are equally considered. The distance value entered should match
that of the output coordinate system.
For the Inverse Distance conceptualizations of spatial
relationships, a value of 0 indicates that no threshold distance is
applied; when this parameter is left blank, a default threshold
value is computed and applied. This default value is the Euclidean
distance that ensures every feature has at least one neighbor.
This parameter has no effect when Polygon Contiguity or Get
Spatial Weights From File spatial conceptualizations are selected.
Self_Potential_Field Field
The field representing self potential: the distance or weight
(Optional) between a feature and itself.
Apply_False_Discovery_Rate__FDR__Correction  APPLY_FDR —Statistical significance will be based on the False Boolean
(Optional) Discovery Rate correction.
 NO_FDR —Statistical significance will be based on the p-value
and z-score fields (default).
Code Sample
HotSpotAnalysis example 1 (Python window)
The following Python window script demonstrates how to use the HotSpotAnalysis tool.
import arcpy
arcpy.env.workspace = "C:/data"
arcpy.HotSpots_stats("911Count.shp", "ICOUNT", "911HotSpots.shp",
"GET_SPATIAL_WEIGHTS_FROM_FILE", "EUCLIDEAN_DISTANCE",
"NONE","#", "#", "euclidean6Neighs.swm","NO_FDR")
HotSpotAnalysis example 2 (stand-alone Python script)

The following stand-alone Python script demonstrates how to use the HotSpotAnalysis tool.

# using the Hot-Spot Analysis Tool (Local Gi*)

import arcpy

workspace = "C:/Data"
try:
"#", 0, 0, 0)


"911Count.shp")

"#", "#", "#", 6,

hs = arcpy.HotSpots_stats("911Count.shp", "ICOUNT", "911HotSpots.shp",
"#", "#", "euclidean6Neighs.swm","NO_FDR")
except:
Environments
Output_coordinate_system, Geograpic_transformations, Current_workspace, Scratch_workspace, Qualified_field_names,
Output_has_M_values, M_resolution, M_tolerance, Output_has_Z_values, Default_output_Z_value, Z_resolution, Z_tolerance,
Feature geometry is projected to the Output Coordinate System prior to analysis, so values entered for the Distance Band or
Threshold Distance parameter should match those specified in the Output Coordinate System. All mathematical computations are
based on the spatial reference of the Output Coordinate System. When the Output Coordinate System is based on degrees, minutes,
and seconds, geodesic distances are estimated using chordal distances in meters.
Related Topics
Spatial weights
How Hot Spot Analysis (Getis-Ord Gi*) works
Grouping Analysis (Spatial Statistics)
Locate topic
Summary
Groups features based on feature attributes and optional spatial/temporal constraints.
Learn more about how Grouping Analysis works
Illustration
Usage
 This tool produces an output feature class with the fields used in the analysis plus a new integer field named SS_GROUP. Default
rendering is based on the SS_GROUP field and shows you which group each feature falls into. If you indicate that you want 3 groups,
for example, each record will contain a 1, 2, or 3 for the SS_GROUP field. When NO_SPATIAL_CONSTRAINT is selected for the Spatial
Constraints parameter, the output feature class will also contain a new binary field called SS_SEED. The SS_SEED field indicates which
features were used as starting points to grow groups. The number of nonzero values in the SS_SEED field will match the value you
entered for the Number of Groups parameter.
 This tool will optionally create a PDF report file when you specify a path for the Output Report File parameter. This report contains a
variety of tables and graphs to help you understand the characteristics of the groups identified. The PDF report file is accessible
through the Results window.
Note: Creating the report file can add substantial processing time. Consequently, while Grouping
Analysis will create the Output Feature Class showing group membership, the PDF report file
will not be created if you specify more than 15 groups or more than 15 variables.
 The Unique ID Field provides a way for you to link records in the Output Feature Class back to data in the original input feature class.
Consequently, the Unique ID Field values must be unique for every feature, and typically should be a permanent field that remains
with the feature class. If you don't have a Unique ID Field in your dataset, you can easily create one by adding a new integer field to
your feature class table and calculating the field values to be equal to the FID/OID field. You cannot use the FID/OID field directly for
the Unique ID Field parameter.
 The Analysis Fields should be numeric and should contain a variety of values. Fields with no variation (that is, the same value for
every record) will be dropped from the analysis but will be included in the Output Feature Class. Categorical fields may be used with
the Grouping Analysis tool if they are represented as dummy variables (a value of one for all features in a category and zeros for all
other features).
 The Grouping Analysis tool will construct groups with or without space or time constraints. For some applications you may not want to
impose contiguity or other proximity requirements on the groups created. In those cases you will set the Spatial Constraints
parameter to NO_SPATIAL_CONSTRAINT.
 For some analyses, you will want groups to be spatially contiguous. The CONTIGUITY options are enabled for polygon feature classes
and indicate features can only be part of the same group if they share an edge (CONTIGUITY_EDGES_ONLY) or if they share either an
edge or a vertex (CONTIGUITY_EDGES_CORNERS) with another member of the group.

 The DELAUNAY_TRIANGULATION and K_NEAREST_NEIGHBORS options are appropriate for point or polygon features when you want to
ensure all group members are proximal. These options indicate that a feature will only be included in a group if at least one other
feature is a natural neighbor (Delaunay Triangulation) or a K Nearest Neighbor. K is the number of neighbors to consider and is
specified using the Number of Neighbors parameter.
 In order to create groups with both space and time constraints, use the Generate Spatial Weights Matrix tool to first create a spatial
weights matrix file (SWM file) defining the space-time relationships among your features. Next run Grouping Analysis setting the
Spatial Constraints parameter to GET_SPATIAL_WEIGHTS_FROM_FILE and the Spatial Weights Matrix File parameter to the SWM file
you created.
 Additional Spatial Constraints, such as Fixed Distance, may be imposed by using the Generate Spatial Weights Matrix tool to first
create an SWM file and then providing the path to that file for the Spatial Weights Matrix File parameter.
Note: Even though you may create a spatial weights matrix (SWM) file to define spatial
constraints, there is no actual weighting being applied. The SWM defines which features are
contiguous or proximal. Imposing a spatial constraint determines who can and cannot be
members of the same group. If you select CONTIGUITY_EDGES_ONLY, for example, all the
features in a single group will have at least one edge in common with another feature in the
group. This keeps the resultant groups spatially contiguous.
 Defining a spatial constraint ensures compact, contiguous, or proximal groups. Including spatial variables in your list of Analysis
Fields can also encourage these group attributes. Examples of spatial variables would be distance to freeway on-ramps, accessibility
to job openings, proximity to shopping opportunities, measures of connectivity, and even coordinates (X, Y). Including variables
representing time, day of the week, or temporal distance can encourage temporal compactness among group members.
 When there is a distinct spatial pattern to your features (an example would be three separate, spatially distinct clusters), it can
complicate the spatially constrained grouping algorithm. Consequently, the grouping algorithm first determines if there are any
disconnected groups. If the number of disconnected groups is larger than the Number of Groups specified, the tool cannot solve and
will fail with an appropriate error message. If the number of disconnected groups is exactly the same as the Number of Groups
specified, the spatial configuration of the features alone determines group results, as shown in (A) below. If the Number of Groups
specified is larger than the number of disconnected groups, grouping begins with the disconnected groups already determined. For
example, if there are three disconnected groups and the Number of Groups specified is 4, one of the three groups will be divided to
create a fourth group, as shown in (B) below.
 In some cases, the Grouping Analysis tool will not be able to meet the spatial constraints imposed, and some features will not be
included with any group (the SS_GROUP value will be -9999 with hollow rendering). This happens if there are features with no
neighbors. To avoid this, use K_NEAREST_NEIGHBORS which ensures all features have neighbors. Increasing the Number of Neighbors
parameter will help resolve issues with disconnected groups.
 While there is a tendency to want to include as many Analysis Fields as possible, for this tool it works best to start with a single
variable and build. Results are much easier to interpret with fewer analysis fields. It is also easier to determine which variables are
the best discriminators when there are fewer fields.
 When you select NO_SPATIAL_CONSTRAINT for the Spatial Constraints parameter, you have three options for the Initialization
Method: FIND_SEED_LOCATIONS, GET_SEEDS_FROM_FIELD, and USE_RANDOM_SEEDS. Seeds are the features used to grow individual
groups. If, for example, you enter a 3 for the Number of Groups parameter, the analysis will begin with three seed features. The
default option, FIND_SEED_LOCATIONS, randomly selects the first seed and makes sure that the subsequent seeds selected represent
features that are far away from each other in data space. Selecting initial seeds that capture different areas of data space improves
performance. Sometimes, you know that specific features reflect distinct characteristics that you want represented by different
groups. In that case, create a seed field to identify those distinctive features. The seed field you create should have zeros for all but
the initial seed features; the initial seed features should have a value of 1. You will then select GET_SEEDS_FROM_FIELD for the
Initialization Method parameter. If you are interested in doing some kind of sensitivity analysis to see which features are always
found in the same group, you might select the USE_RANDOM_SEEDS option for the Initialization Method parameter. For this option, all
of the seed features are randomly selected.
 Any values of 1 in the Initialization Field will be interpreted as a seed. If there are more seed features than Number of Groups, the
seed features will be randomly selected from those identified by the Initialization Field. If there are fewer seed features than
specified by Number of Groups, the additional seed features will be selected so they are far away (in data space) from those identified
by the Initialization Field.
 Sometimes you know the Number of Groups most appropriate for your data. In the case that you don't, however, you may have to
try different numbers of groups, noting which values provide the best group differentiation. When you check the Evaluate Optimal
Number of Groups parameter, a pseudo F-statistic will be computed for grouping solutions with 2 through 15 groups. If no other
criteria guide your choice for Number of Groups, use a number associated with one of the largest pseudo F-statistic values. The
largest F-statistic values indicate solutions that perform best at maximizing both within-group similarities and between-group
differences. When you specify an optional Output Report File, that PDF report will include a graph showing the F-statistic values for
solutions with 2 through 15 groups.
 Regardless of the Number of Groups you specify, the tool will stop if division into additional groups becomes arbitrary. Suppose, for
example, that your data consists of three spatially clustered polygons and a single analysis field. If all the features in a cluster have
the same analysis field value, it becomes arbitrary how any one of the individual clusters is divided after three groups have been
created. If you specify more than three groups in this situation, the tool will still only create three groups. As long as at least one of
the analysis fields in a group has some variation of values, division into additional groups can continue.
 When you include a spatial or space-time constraint in your analysis, the pseudo F-Statistics are comparable (as long as the Input
Features and Analysis Fields don't change). Consequently, you can use the F-Statistic values to determine not only optimal Number
of Groups, but also to help you make choices about the most effective Spatial Constraints option, Distance Method, and Number of
Neighbors.
 The K-Means algorithm used to partition features into groups when NO_SPATIAL_CONSTRAINT is selected for the Spatial Constraints
parameter and FIND_SEED_LOCATIONS or USE_RANDOM_SEEDS is selected for the Initialization Method incorporates heuristics and may
return a different result each time you run the tool (even using the same data and the same tool parameters). This is because there
is a random component to finding the initial seed features used to grow the groups.
 When a spatial constraint is imposed, there is no random component to the algorithm, so a single pseudo F-Statistic can be computed
for groups 2 through 15, and the highest F-Statistic values can be used to determine the optimal Number of Groups for your analysis.
Because the NO_SPATIAL_CONSTRAINT option is a heuristic solution, however, determining the optimal number of groups is more
involved. The F-Statistic may be different each time the tool is run, due to different initial seed features. When a distinct pattern
exists in your data, however, solutions from one run to the next will be more consistent. Consequently, to help determine the optimal
number of groups when the NO_SPATIAL_CONSTRAINT option is selected, the tool solves the grouping analysis 10 times for 2, 3, 4,
and up to 15 groups. Information about the distribution of these 10 solutions is then reported (min, max, mean, and median) to help
you determine an optimal number of groups for your analysis.
 The Grouping Analysis tool returns three derived output values for potential use in custom models and scripts. These are the pseudo
F-Statistic for the Number of Groups (Output_FStat), the largest pseudo F-Statistic for groups 2 through 15 (Max_FStat), and the
number of groups associated with the largest pseudo F-Statistic value (Max_FStat_Group). When you do not elect to Evaluate
Optimal Number of Groups, all of the derived output variables are set to None.
 The group number assigned to a set of features may change from one run to the next. For example, suppose you partition features
into two groups based on an income variable. The first time you run the analysis you might see the high income features labeled as
group 2 and the low income features labeled as group 1; the second time you run the same analysis, the high income features might
be labeled as group 1. You might also see that some of the middle income features switch group membership from one run to another
when NO_SPATIAL_CONSTRAINT is specified.
 While you can select to create a very large number of different groups, in most scenarios you will likely be partitioning features into
just a few groups. Because the graphs and maps become difficult to interpret with lots of groups, no report is created when you enter
a value larger than 15 for the Number of Groups parameter or select more than 15 Analysis Fields. You can increase this limitation on
the maximum number of groups, however.
Dive-in: Because you have the Python source code for the Grouping Analysis tool, you may override
the 15 variable/15 group report limitation, if desired. This upper limit is set by two variables
in both the Partition.py script file and the tool's validation code inside the Spatial
Statistics Toolbox:
maxNumGroups = 15
maxNumVars = 15
 For more information about the Output Report File, see Learn more about how Grouping Analysis works.
Syntax
GroupingAnalysis_stats (Input_Features, Unique_ID_Field, Output_Feature_Class, Number_of_Groups, Analysis_Fields, Spatial_Constraints,
{Distance_Method}, {Number_of_Neighbors}, {Weights_Matrix_File}, {Initialization_Method}, {Initialization_Field}, {Output_Report_File},
{Evaluate_Optimal_Number_of_Groups})
Input_Features Feature Layer
The feature class or feature layer for which you want to create groups.
Unique_ID_Field Field
An integer field containing a different value for every feature in the
input feature class. If you don't have a Unique ID field, you can create
one by adding an integer field to your feature class table and
calculating the field values to equal the FID or OBJECTID field.
Output_Feature_Class The new output feature class created containing all features, the Feature Class
analysis fields specified, and a field indicating which group each feature
belongs to.
Number_of_Groups Long
The number of groups to create. The Output Report parameter will be
disabled for more than 15 groups.
Analysis_Fields Field
A list of fields you want to use to distinguish one group from another.
[analysis_field,...] The Output Report parameter will be disabled for more than 15 fields.
Spatial_Constraints String
Specifies if and how spatial relationships among features should
constrain the groups created.
 CONTIGUITY_EDGES_ONLY —Groups contain contiguous polygon
features. Only polygons that share an edge can be part of the same
group.
 CONTIGUITY_EDGES_CORNERS —Groups contain contiguous
polygon features. Only polygons that share an edge or a vertex can
be part of the same group.
 DELAUNAY_TRIANGULATION —Features in the same group will
have at least one natural neighbor in common with another feature
in the group. Natural neighbor relationships are based on Delaunay
Triangulation. Conceptually, Delaunay Triangulation creates a
nonoverlapping mesh of triangles from feature centroids. Each
feature is a triangle node and nodes that share edges are
considered neighbors.
 K_NEAREST_NEIGHBORS —Features in the same group will be
near each other; each feature will be a neighbor of at least one
other feature in the group. Neighbor relationships are based on the
nearest K features where you specify an Integer value, K, for the
Number of Neighbors parameter.
 GET_SPATIAL_WEIGHTS_FROM_FILE —Spatial, and optionally
temporal, relationships are defined by a spatial weights file (.swm).
Create the spatial weights matrix file using the Generate Spatial
Weights Matrix or Generate Network Spatial Weights tool.
 NO_SPATIAL_CONSTRAINT —Features will be grouped using data
space proximity only. Features do not have to be near each other
in space or time to be part of the same group.
(Optional) features.
 EUCLIDEAN —The straight-line distance between two points (as the
crow flies)
 MANHATTAN —The distance between two points measured along
axes at right angles (city block); calculated by summing the
Number_of_Neighbors Long
This parameter is enabled whenever the Spatial Constraints parameter
(Optional) is K_NEAREST_NEIGHBORS or one of the CONTIGUITY methods. The
default number of neighbors is 8 and cannot be smaller than 2 for
K_NEAREST_NEIGHBORS. This value reflects the exact number of nearest
neighbor candidates to consider when building groups. A feature will
not be included in a group unless one of the other features in that
group is a K nearest neighbor. The default for CONTIGUITY_EDGES_ONLY
and CONTIGUITY_EDGES_CORNERS is 0. For the CONTIGUITY methods,
this value reflects the minimum number of neighbor candidates to
consider. Additional nearby neighbors for features with less than the
Number of Neighbors specified will be based on feature centroid
proximity.
The path to a file containing spatial weights that define spatial
(Optional) relationships among features.
Initialization_Method String
Specifies how initial seeds are obtained when the Spatial Constraint
(Optional) parameter selected is NO_SPATIAL_CONSTRAINT. Seeds are used to
grow groups. If you indicate you want 3 groups, for example, the
analysis will begin with three seeds.
 FIND_SEED_LOCATIONS —Seed features will be selected to
optimize performance.
 GET_SEEDS_FROM_FIELD —Nonzero entries in the Initialization
Field will be used as starting points to grow groups.
 USE_RANDOM_SEEDS —Initial seed features will be randomly
selected.
Initialization_Field The numeric field identifying seed features. Features with a value of 1 Field
(Optional) for this field will be used to grow groups.
Output_Report_File The full path for the .pdf report file to be created summarizing group File
(Optional) characteristics. This report provides a number of graphs to help you
compare the characteristics of each group. Creating the report file can
add substantial processing time.
Evaluate_Optimal_Number_of_Groups  EVALUATE —Groupings from 2 to 15 will be evaluated. Boolean
(Optional)  DO_NOT_EVALUATE —No evaluation of the number of groups will

be performed. This is the default.
Code Sample
GroupingAnalysis example 1 (Python window)
The following Python window script demonstrates how to use the GroupingAnalysis tool.
import arcpy
arcpy.env.workspace = r"C:\GA"
SS.GroupingAnalysis("Dist_Vandalism.shp", "TARGET_FID", "outGSF.shp", "4",
"Join_Count;TOTPOP_CY;VACANT_CY;UNEMP_CY",
"NO_SPATIAL_CONSRAINT", "EUCLIDEAN", "", "", "FIND_SEED_LOCATIONS", "",
"outGSF.pdf", "DO_NOT_EVALUATE")
GroupingAnalysis example 2 (stand-alone Python script)

The following stand-alone Python script demonstrates how to use the GroupingAnalysis tool.
# Grouping Analysis of Vandalism data in a metropolitan area

# using the Grouping Analysis Tool

import arcpy, os

try:
# Set the current workspace (to avoid having to specify the full path to
# the feature classes each time)
arcpy.env.workspace = r"C:\GA"
# Join the 911 Call Point feature class to the Block Group Polygon feature class
# Process: Spatial Join
fieldMappings = arcpy.FieldMappings()
fieldMappings.addTable("ReportingDistricts.shp")
fieldMappings.addTable("Vandalism2006.shp")
sj = arcpy.SpatialJoin_analysis("ReportingDistricts.shp", "Vandalism2006.shp", "Dist_Vand.shp",

"JOIN_ONE_TO_ONE",
"KEEP_ALL",
fieldMappings,
"COMPLETELY_CONTAINS", "", "")
# Use Grouping Analysis tool to create groups based on different variables or analysis fields
# Process: Group Similar Features
ga = SS.GroupingAnalysis("Dist_Vand.shp", "TARGET_FID", "outGSF.shp", "4",
"Join_Count;TOTPOP_CY;VACANT_CY;UNEMP_CY",
"NO_SPATIAL_CONSRAINT", "EUCLIDEAN", "", "", "FIND_SEED_LOCATIONS"
"outGSF.pdf", "DO_NOT_EVALUATE")
# Use Summary Statistic tool to get the Mean of variables used to group
# Process: Summary Statistics
SumStat = arcpy.Statistics_analysis("outGSF.shp", "outSS", "Join_Count MEAN; \
VACANT_CY MEAN;TOTPOP_CY MEAN;UNEMP_CY MEAN",
"GSF_GROUP")
except:
Environments
XY_resolution, XY_tolerance, Random_number_generator
Related Topics
Spatial weights

How Grouping Analysis works
Similarity Search
Directional Distribution (Standard Deviational Ellipse)
Optimized Hot Spot Analysis (Spatial Statistics)
Locate topic
Summary
Given incident points or weighted features (points or polygons), creates a map of statistically significant hot and cold spots using the Getis-
Ord Gi* statistic. It evaluates the characteristics of the input feature class to produce optimal results.
Learn more about how Optimized Hot Spot Analysis works
Illustration
Usage
 This tool identifies statistically significant spatial clusters of high values (hot spots) and low values (cold spots). It automatically
aggregates incident data, identifies an appropriate scale of analysis, and corrects for both multiple testing and spatial dependence.
This tool interrogates your data in order to determine settings that will produce optimal hot spot analysis results. If you want full
control over these settings, use the Hot Spot Analysis tool instead.
 The computed settings used to produce optimal hot spot analysis results are reported in the Results window. The associated
workflows and algorithms are explained in How Optimized Hot Spot Analysis works.
 This tool creates a new Output Feature Class with a z-score, p-value and confidence level bin (Gi_Bin) for each feature in the Input
Feature Class.
 The Gi_Bin field identifies statistically significant hot and cold spots, corrected for multiple testing and spatial dependence using the
False Discovery Rate (FDR) correction method. Features in the +/-3 bins (features with a Gi_Bin value of either +3 or -3) are
statistically significant at the 99 percent confidence level; features in the +/-2 bins reflect a 95 percent confidence level; features in
the +/-1 bins reflect a 90 percent confidence level; and the clustering for features with 0 for the Gi_Bin field is not statistically
significant.
 The z-score and p-value fields do not reflect any kind of FDR (False Discovery Rate) correction. For more information on z-scores and
p-values, see What is a z-score? What is a p-value?
 The Input Features may be points or polygons. With polygons, an Analysis Field is required.
 If you provide an Analysis Field, it should contain a variety of values. The math for this statistic requires some variation in the
variable being analyzed; it cannot solve if all input values are 1, for example.
 With an Analysis Field, this tool is appropriate for all data (points or polygons) including sampled data. In fact, this tool is effective
and reliable even in cases where there is oversampling. With lots of features (oversampling) the tool has more information to
compute accurate and reliable results. With few features (undersampling), the tool will still do all it can to produce accurate and
reliable results, but there will be less information to work with.
Because the underlying Getis-Ord Gi* statistic used by this tool is asymptotically normal, even when the Analysis Field contains
skewed data, results are reliable.
 With point data you will sometimes be interested in analyzing data values associated with each point feature and will consequently
provide an Analysis Field. In other cases you will only be interested in evaluating the spatial pattern (clustering) of the point locations
or point incidents. The decision to provide an Analysis Field or not will depend on the question you are asking.
 Analyzing point features with an Analysis Field allows you to answer questions like: Where do high and low values cluster?
 The analysis field you select might represent:

 Counts (such as the number of traffic accidents at street intersections)
 Rates (such as city unemployment, where each city is represented by a point feature)
 Averages (such as the mean math test score among schools)
 Indices (such as a consumer satisfaction score for car dealerships across the country)
 Analyzing point features when there is no Analysis Field allows you to identify where point clustering is unusually (statistically
significant) intense or sparse. This type of analysis answers questions like: Where are there many points? Where are there very
few points?
 When you don't provide an Analysis Field the tool will aggregate your points in order to obtain point counts to use as an analysis
field. There are three possible aggregation schemes:
 For COUNT_INCIDENTS_WITHIN_FISHNET_POLYGONS, an appropriate polygon cell size is computed and used to create a fishnet
polygon mesh. The fishnet is positioned over the incident points and the points within each polygon cell are counted. If no
Bounding Polygons Defining Where Incidents Are Possible feature layer is provided, the fishnet cells with zero points are
removed and only the remaining cells are analyzed. When a bounding polygon feature layer is provided, all fishnet cells that fall
within the bounding polygons are retained and analyzed. The point counts for each polygon cell are used as the analysis field.
 For COUNT_INCIDENTS_WITHIN_AGGREGATION_POLYGONS, you need to provide the Polygons For Aggregating Incidents Into Counts
feature layer. The point incidents falling within each polygon will be counted and these polygons with their associated counts will
then be analyzed. The COUNT_INCIDENTS_WITHIN_AGGREGATION_POLYGONS is an appropriate aggregation strategy when points are
associated with administrative units such as tracts, counties, or school districts. You might also use this option if you want the
study area to remain fixed across multiple analyses to enhance making comparisons.
 For SNAP_NEARBY_INCIDENTS_TO_CREATE_WEIGHTED_POINTS, a snap distance is computed and used to aggregate nearby incident
points. Each aggregated point is given a count reflecting the number of incidents that were snapped together. The aggregated
points are then analyzed with the incident counts serving as the analysis field. The
SNAP_NEARBY_INCIDENTS_TO_CREATE_WEIGHTED_POINTS option is an appropriate aggregation strategy when you have many
coincident, or nearly coincident, points and want to maintain aspects of the spatial pattern of the original point data. In many
cases you will want to try both SNAP_NEARBY_INCIDENTS_TO_CREATE_WEIGHTED_POINTS and
COUNT_INCIDENTS_WITHIN_FISHNET_POLYGONS and see which result best reflects the spatial pattern of the original point data.
Fishnet solutions can artificially separate clusters of point incidents, but the output may be easier for some people to interpret
than weighted point output.
Caution: Analysis of point data without specifying an Analysis Field only makes sense when you have
all known point incidents and when you can be confident there is no bias in the point
distribution you are analyzing. With sampled data you will almost always be including an
Analysis Field (unless you are specifically interested in the spatial pattern of your sampling
scheme).
 When you select COUNT_INCIDENTS_WITHIN_FISHNET_POLYGONS for the Incident Data Aggregation Method you may optionally provide
a Bounding Polygons Defining Where Incidents Are Possible feature layer. When no bounding polygons are provided, the tool cannot
know if a location without an incident should be a zero to indicate that an incident is possible at that location, but didn't occur, or if
the location should be removed from the analysis because incidents would never occur at that location. Consequently, when no
bounding polygons are provided, only fishnet cells with at least one incident are retained for analysis. If this isn't the behavior you
want, you can provide a Bounding Polygons Defining Where Incidents Are Possible feature layer to ensure that all locations within
the bounding polygons are retained. Fishnet cells with no underlying incidents will receive an incident count of zero.
 Any incidents falling outside the Bounding Polygons Defining Where Incidents Are Possible or the Polygons For Aggregating
Incidents Into Counts will be excluded from analysis.
 If you have the ArcGIS Spatial Analyst extension you can choose to create a Density Surface of your point Input Features. With point
Input Features , the Density Surface parameter is enabled when you specify an Analysis Field or select the
SNAP_NEARBY_INCIDENTS_TO_CREATE_WEIGHTED_POINTS for the Incident Data Aggregation Method. The output Density Surface will be
clipped to the raster analysis mask specified in the environment settings. If no raster mask is specified, the output raster layer will be
clipped to a convex hull around the Input Features .
 You should use the Generate Spatial Weights Matrix and Hot Spot Analysis (Getis-Ord Gi*) tools if you want to identify space-time hot
spots. More information about space-time cluster analysis is provided in the Space-Time Cluster Analysis topic.
in the analysis.
 When this tool runs in ArcMap, the Output Features are automatically added to the table of contents with default rendering applied to
the Gi_Bin field. The hot-to-cold rendering applied is defined by a layer file in
information.
License: The Density Surface parameter is only enabled when you have the ArcGIS Spatial Analyst
extension.
Syntax
OptimizedHotSpotAnalysis_stats (Input_Features, Output_Features, {Analysis_Field}, {Incident_Data_Aggregation_Method},
{Bounding_Polygons_Defining_Where_Incidents_Are_Possible}, {Polygons_For_Aggregating_Incidents_Into_Counts}, {Density_Surface})
Data
Parameter Explanation
Type
Input_Features Feature
The point or polygon feature class for which hot spot analysis will
be performed. Layer
Output_Features The output feature class to receive the z-score, p-value, and Feature
Gi_Bin results. Class
Analysis_Field The numeric field (number of incidents, crime rates, test scores, Field
(Optional) and so on) to be evaluated.
Incident_Data_Aggregation_Method String
(Optional)
The aggregation method to use to create weighted features for
analysis from incident point data.
 COUNT_INCIDENTS_WITHIN_FISHNET_POLYGONS —A
fishnet polygon mesh will overlay the incident point data and
the number of incidents within each polygon cell will be
counted. If no bounding polygon is provided in the Bounding
Polygons Defining Where Incidents Are Possible parameter,
only cells with at least one incident will be used in the
analysis; otherwise, all cells within the bounding polygons
will be analyzed.
 COUNT_INCIDENTS_WITHIN_AGGREGATION_POLYGONS —
You provide aggregation polygons to overlay the incident
point data in the Polygons For Aggregating Incidents Into
Counts parameter. The incidents within each polygon are
counted.
 SNAP_NEARBY_INCIDENTS_TO_CREATE_WEIGHTED_POINTS
—Nearby incidents will be aggregated together to create a
single weighted point. The weight for each point is the
number of aggregated incidents at that location.
Bounding_Polygons_Defining_Where_Incidents_Are_Possible Feature
A polygon feature class defining where the incident Input
(Optional) Layer
Features could possibly occur.
Polygons_For_Aggregating_Incidents_Into_Counts Feature
The polygons to use to aggregate the incident Input Features in
(Optional) Layer
order to get an incident count for each polygon feature.
Density_Surface Raster
The output density surface of point input features. This
(Optional) parameter is only enabled when Input Features are points and Dataset
you have the ArcGIS Spatial Analyst extension. The output
surface created will be clipped to the raster analysis mask
specified in your environment settings. If no raster mask is
specified, the output raster layer will be clipped to a convex hull
of the input features.
Code Sample
OptimizedHotSpotAnalysis example 1 (Python window)
The following Python window script demonstrates how to use the OptimizedHotSpotAnalysis tool.
import arcpy
arcpy.env.workspace = r"C:\OHSA"
arcpy.OptimizedHotSpotAnalysis_stats("911Count.shp", "911OptimizedHotSpots.shp", "#", "SNAP_NEARBY_INCIDENTS_TO_CREATE_WE
OptimizedHotSpotAnalysis example 2 (stand-alone Python script)

The following stand-alone Python script demonstrates how to use the OptimizedHotSpotAnalysis tool.

import arcpy

arcpy.overwriteOutput = True
workspace = r"C:\OHSA\data.gdb"
try:
# Create a polygon that defines where incidents are possible

# Process: Minimum Bounding Geometry of 911 call data
arcpy.MinimumBoundingGeometry_management("Calls911", "Calls911_MBG", "CONVEX_HULL", "ALL",
"#", "NO_MBG_FIELDS")
# Optimized Hot Spot Analysis of 911 call data using fishnet aggregation method with a bounding polygon of
# Process: Optimized Hot Spot Analysis
ohsa = arcpy.OptimizedHotSpotAnalysis_stats("Calls911", "Calls911_ohsaFishnet", "#", "COUNT_INCIDENTS_WITHIN_FISHNET_
"Calls911_MBG", "#", "#")
except:
# If any error occurred when running the tool, print the messages
Environments
XY_resolution, XY_tolerance, Cell_size, Mask, Snap_raster
Related Topics
Spatial weights
How Optimized Hot Spot Analysis Works
Similarity Search (Spatial Statistics)
Locate topic
Summary
Identifies which candidate features are most similar or most dissimilar to one or more input features based on feature attributes.
Learn more about how Similarity Search works
Illustration
Usage
 You will provide a layer containing the Input Features To Match and a second layer containing the Candidate Features from which
matches will be obtained. Often your Input Features To Match and your Candidate Features will be in the same feature layer. While
one option is to create two separate datasets, you don't have to do this. It is much easier to create layers with two different selection
sets instead. Suppose you have a file with all crime incidents that have occurred over the past month. If you want to find all of the
crimes that are most similar to the latest carjacking, you could
 Using standard ArcMap selection tools or geoprocessing tools, select the record for the latest carjacking from the layer with all
crime incidents.
 Right-click the layer with the selection and click Selection > Create Layer From Selected Features. Use this new layer for the
Input Features To Match parameter.
 Switch the selection on the layer with all crime incidents. Use this layer for the Candidate Features parameter.
Caution: A common mistake when all inputs come from a single dataset is to forget to switch the
selection so the Input Features To Match have exactly the same features as the
Candidate Features. It is very unlikely this is what you want. The most typical scenario
is to have a single Input Features To Match and many Candidate Features .
 If there is more than one Input Features To Match, matching is based on averaged Attributes of Interest values. So, for example, if
there are two Input Features To Match and one of the Attributes of Interest is a population variable, the tool will look for Candidate
Features with populations that are most like the average population values. If the population values are 100 and 102, for example,
the tool will look for candidates with populations near 101.
Note: When you have more than one Input Features To Match , you will want to select Attributes
of Interest with similar values. If, for example, the population value for one of the inputs is
100 and the other input is 100,000, the tool will look for matches with populations near the
average of those two values: 50,050. Notice that this averaged value is nothing like the
population for either of the Input Features To Match.
 Output Features will always contain points unless the Input Features To Match and the Candidate Features are both polygons or both
polylines. Creating polygon or polyline Output Features can slow performance for large datasets, so you can check the Collapse
Output To Points to force point geometries for improved performance.
 With the Most Or Least Similar parameter, you can search for features that are either MOST_SIMILAR or LEAST_SIMILAR to the Input
Features To Match. In some cases you will want to see both ends of the spectrum. If you enter 3 for the Number of Results
parameter and BOTH for the Most Or Least Similar parameter, for example, the tool will return the three most similar and the three
least similar candidate features.
 Any given solution match in the Output Features will either be a solution that is most similar or least similar to the target Input
Features To Match; a single solution cannot be both (and solution matches won't be duplicated in the Output Features ).
Consequently, when you select BOTH for the Most Or Least Similar parameter, the maximum number of resulting matches possible
(Number of Results) will be half the number of Candidate Features . When you enter a Number of Results value that is too large, the
tool will adjust it to the maximum possible.
 Sometimes, in order to explore the spatial pattern of similarity, you will want to rank similarity for all of the Candidate Features. An
easy way to indicate that you want all of the Candidate Features to be ranked is to enter zero for the Number of Results parameter.
The tool will then determine the number of valid features in the candidates dataset and write all of them to the Output Features in
rank order from most to least similar.
 For the Match Method parameter you may select ATTRIBUTE_VALUES, RANKED_ATTRIBUTE_VALUES, or ATTRIBUTE_PROFILES.
 For ATTRIBUTE_VALUES the most similar candidates will have the smallest sum of squared differences for all of the Attributes of
Interest; all values are standardized before differences are calculated.
 For RANKED_ATTRIBUTE_VALUES the most similar candidates will have the smallest sum of squared ranks for all of the Attributes of
Interest. The Output Features reports these sums in the SIMINDEX (Sum of Squared Rank Differences) field.
 For ATTRIBUTE_PROFILES the cosine similarity is measured. Cosine similarity looks for the same relationships among standardized
attribute values rather than trying to match magnitudes. Suppose there are four Attributes of Interest called A1, A2, A3, and A4,
and that A2 is twice as large as A1, A3 is almost equal to A2, and A4 is three times larger than A3. For the ATTRIBUTE_PROFILES
Match Method the tool will be looking for candidates with those same attribute relationships: twice as large, then almost equal,
then three times larger. Because this method is looking at attribute relationships, you must specify a minimum of two Attributes
of Interest for this method. You might use the cosine similarity method (ATTRIBUTE_PROFILES) to find places like Los Angeles,
but at a smaller scale overall. The cosine similarity index ranges from 1.0 (perfect similarity) to -1.0 (perfect dissimilarity). The
cosine similarity index is written to the Output Features SIMINDEX (Cosine Similarity) field.
 The Attributes of Interest must be numeric and must be present (same field name and same field type) in both the Input Features To
Match and the Candidate Features datasets. For the Attributes of Interest parameter, the tool will list all numeric fields found in the
Input Features To Match dataset. If the tool doesn't find corresponding fields for the Candidate Features you will see a warning
indicating the missing attributes were dropped from the analysis. If all of the Attributes of Interest are dropped, the tool has nothing
to use for matching and you will get an error indicating the tool cannot perform the analysis.
 All of the attributes used for matching are written to the Output Features . The Fields To Append To Output parameter allows you to
include other fields in the output table, if desired. Because numeric Attributes of Interest fields are probably not effective identifiers,
you may want to append a name or other identifier field for each solution match. If you need to decide among several matching
solutions, you may want to append other nonnumeric attributes as well. If the solution you are seeking must be one of several land-
use types, for example, appending a categorical land-use attribute will help you hone in on solutions that meet this requirement.
Sometimes you will want to include additional numeric attributes in the output table for reference purposes only. Suppose, for
example, you are looking for suitable habitat for a particular animal. You can use known locations where the species is successful for
the Input Features To Match. You can select Attributes of Interest that relate to species success. In addition, you might append a
numeric area attribute to the Output Features, not because you want to actually match on the area value of the target, but because
ultimately you are looking for solutions with the largest areas possible.
 All of the Input Features To Match and solution matches are written to the Output Features along with Attributes of Interest and the
Fields To Append To Output. In addition, the following fields are included in the Output Features:
Field Name Field Alias Description Notes
MATCH_ID MATCH_ID All of the target features in the Input Features To When the Output Features
Match layer are listed first with their OID or FID is a shapefile, NULL values
identifier written to the MATCH_ID field. Solution are represented by a very
matches have NULL values for this field. large negative number
(such as -21474836).
CAND_ID CAND_ID All of the solution matches are listed next and this When the Output Features
value is their OID or FID identifier. The target is a shapefile, NULL values
features in the Input Features To Match layer have are represented by a very
NULL values for this field. large negative number
(such as -21474836).
SIMRANK Similarity Rank When you select MOST_SIMILAR or BOTH for the This field is only included
Match Method parameter, all of the solution matches in the Output Features
are ranked from most similar to least similar. The when you select
most similar solution match has a rank value of 1. MOST_SIMILAR or BOTH for
the Match Method
parameter.
DSIMRANK Dissimilarity Rank When you select LEAST_SIMILAR or BOTH for the This field is only included
Match Method parameter, all of the solution matches in the Output Features
are ranked from least similar to most similar. The when you select
solution that is least similar gets a rank value of 1. LEAST_SIMILAR or BOTH for
the Match Method
parameter.
SIMINDEX Sum of Squared This field quantifies how similar each solution match If there is only one Input
Value Differences, is to the target feature. Features To Match, the
Sum of Squared Rank  When you specify ATTRIBUTE_VALUES for the target feature is this
Differences, or Cosine Match Method the field alias is Sum of Squared feature. When more than
Similarity Value Differences. one Input Features To
Match is specified, the
 When you specify RANKED_ATTRIBUTE_VALUES for target feature is a
the Match Method the field alias is Sum of temporary feature created
Squared Rank Differences. with averaged values for
 When you specify ATTRIBUTE_PROFILES for the all of the Attributes Of
Match Method the field alias is Cosine Interest.
Similarity.
For more information about how these indices are
computed, see How Similarity Search Works.
LABELRANK Render Rank This field is used for display purposes only. The tool
uses this field to provide default rendering of the
analysis results.
 When this tool runs in ArcMap, the Output Features are automatically added to the table of contents with default rendering applied to
the LABELRANK field. The rendering applied is defined by a layer file in <ArcGIS>/Desktop10.x/ArcToolbox/Templates/Layers. You
can reapply the default rendering, if needed, by importing the template layer symbology.
Note: The default sample size is 10,000 records. When the Number Of Results is larger than this
default, you will want to increase the sampling size to render all of the results.
Syntax
SimilaritySearch_stats (Input_Features_To_Match, Candidate_Features, Output_Features, Collapse_Output_To_Points,
Most_Or_Least_Similar, Match_Method, Number_Of_Results, Attributes_Of_Interest, {Fields_To_Append_To_Output})

Input_Features_To_Match Feature Layer
The layer (or a selection on a layer) containing the features you want to
match; you are searching for other features that look like these features.
When more than one feature is provided, matching is based on attribute
averages.
Tip: When your Input Features To Match and Candidate Features come
from a single dataset,
 Right-click the layer and choose Selection followed by Create Layer
From Selected Features. Use the new layer created for this parameter.
 Next, right-click the layer again and choose Selection followed by
Switch Selection to get the layer you will use for your Candidate
Features.
Candidate_Features The layer (or a selection on a layer) containing candidate matching Feature Layer
features. The tool will look for features most like (or most dislike) the Input
Features To Match among these candidates.
Tip: When your Input Features To Match and Candidate Features come
from a single dataset,
 Right-click the layer and choose Selection followed by Create Layer
From Selected Features. Use the new layer created for this parameter.
 Next, right-click the layer again and choose Selection followed by
Switch Selection to get the layer you will use for this parameter.
Output_Features The output feature class contains a record for each of the Input Features Feature Class
To Match and for all of the solution matching features found.
Collapse_Output_To_Points Specify whether you want the geometry for the Output_Features to be Boolean
points or to match the geometry (lines or polygons) of the input features.
This option is only available when the Input_Features_To_Match and the
Candidate_Features are both lines or both polygons. Choosing COLLAPSE
for large line or polygon datasets will improve tool performance.
 NO_COLLAPSE —The output geometry will match the line or polygon
geometry of the input features. This is the default.
 COLLAPSE —The line and polygon features will be represented as
feature centroids (points).
Most_Or_Least_Similar Choose whether you are interested in features that are most alike or most String
different to the Input Features To Match.
 MOST_SIMILAR —Find the features that are most alike.
 LEAST_SIMILAR —Find the features that are most different.
 BOTH —Find both the features that are most alike and the features that
are most different.
Match_Method Choose whether matching should be based on values, ranks, or cosine String
relationships.
 ATTRIBUTE_VALUES —Similarity or dissimilarity will be based on the
sum of squared standardized attribute value differences for all of the
Attributes Of Interest.
 RANKED_ATTRIBUTE_VALUES —Similarity or dissimilarity will be based
on the sum of squared rank differences for all of the Attributes Of
Interest.
 ATTRIBUTE_PROFILES —Similarity or dissimilarity will be computed as
a function of cosine similarity for all of the Attributes Of Interest.
Number_Of_Results Long
The number of solution matches to find. Entering zero or a number larger
than the total number of Candidate Features will return rankings for all of
the candidate features.
Attributes_Of_Interest A list of numeric attributes representing the matching criteria. Field
[field,...]
Fields_To_Append_To_Output Field
An optional list of attributes to include with the Output Features. You might
[field,...] want to include a name identifier, categorical field, or date field, for
(Optional) example. These fields are not used to determine similarity; they are only
included in the Output Features for your reference.
Code Sample
SimilaritySearch example 1 (Python window)
The following Python window script demonstrates how to use the SimilaritySearch tool.
import arcpy
arcpy.env.workspace = r"C:\Analysis"
SS.SimilaritySearch ("Crime_selection", "AllCrime", "c:\\Analysis\\CrimeMatches",
"NO_COLLAPSE", "MOST_SIMILAR", "ATTRIBUTE_VALUES", 4,
"HEIGHT;WEIGHT;SEVERITY;DST2CHPSHP", "Name;WEAPON")
SimilaritySearch example 2 (stand-alone Python script)

The following stand-alone Python script demonstrates how to use the SimilaritySearch tool.
# Similarity Search of crime data in a metropolitan area

import arcpy, os
# Set geoprocessor object property to overwrite existing output

try:
arcpy.env.workspace = r"C:\Analysis"
# Make a layer from the crime feature class

arcpy.MakeFeatureLayer_management("AllCrime", "Crime_selection")
# Select the target crime to match

# Process: Select By Attribute
arcpy.SelectLayerByAttribute_management("Crime_selection","NEW_SELECTION",
'"OBJECTID" = 1230043')
# Use Similarity Search to find to create groups based on different variables

# or analysis fields
# Process: Group Similar Features
SS.SimilaritySearch("Crime_selection","AllCrime","CJMatches","NO_COLLAPSE",
"MOST_SIMILAR","ATTRIBUTE_VALUES",4,
"HEIGHT;WEIGHT;SEVERITY;DST2CHPSHP","Name;WEAPON")
except:
Environments
Related Topics
How Similarity Search works
Grouping Analysis
How Hot Spot Analysis (Getis-Ord Gi*) works
Locate topic
The Hot Spot Analysis tool calculates the Getis-Ord Gi* statistic (pronounced G-i-star) for each feature in a dataset. The resultant z-scores and
p-values tell you where features with either high or low values cluster spatially. This tool works by looking at each feature within the context of
neighboring features. A feature with a high value is interesting but may not be a statistically significant hot spot. To be a statistically significant
hot spot, a feature will have a high value and be surrounded by other features with high values as well. The local sum for a feature and its
neighbors is compared proportionally to the sum of all features; when the local sum is very different from the expected local sum, and when
that difference is too large to be the result of random chance, a statistically significant z-score results. When the FDR correction is applied,
statistical significance is adjusted to account for multiple testing and spatial dependency.
Calculations
Interpretation
The Gi* statistic returned for each feature in the dataset is a z-score. For statistically significant positive z-scores, the larger the z-score is,
the more intense the clustering of high values (hot spot). For statistically significant negative z-scores, the smaller the z-score is, the more
intense the clustering of low values (cold spot). For more information about determining statistical significance and correcting for multiple
testing and spatial dependency, see What is a z-score? What is a p-value?
Output
This tool creates a new Output Feature Class with a z-score, p-value, and confidence level bin (Gi_Bin) for each feature in the Input Feature
Class. If there is a selection set applied to the Input Feature Class, only selected features will be analyzed, and only selected features will
appear in the Output Feature Class.
When this tool runs in ArcMap, the Output Feature Class is automatically added to the table of contents with default rendering applied to
the Gi_Bin field. The hot to cold rendering applied is defined by a layer file in <ArcGIS>/ArcToolbox/Templates/Layers. You can reapply
the default rendering, if needed, by importing the template layer symbology.
Hot spot analysis considerations

There are three things to consider when undertaking any hot spot analysis:
1. What is the Analysis Field (Input Field)? The hot spot analysis tool assesses whether high or low values (the number of crimes,
accident severity, or dollars spent on sporting goods, for example) cluster spatially. The field containing those values is your
Analysis Field. For point incident data, however, you may be more interested in assessing incident intensity than in analyzing the
spatial clustering of any particular value associated with the incidents. In that case, you will need to aggregate your incident data
prior to analysis. There are several ways to do this:
 If you have polygon features for your study area, you can use the Spatial Join tool to count the number of events in each
polygon. The resultant field containing the number of events in each polygon becomes the Input Field for analysis.
 Use the Create Fishnet tool to construct a polygon grid over your point features. Then use the Spatial Join tool to count the
number of events falling within each grid polygon. Remove any grid polygons that fall outside your study area. Also, in cases
where many of the grid polygons within the study area contain zeros for the number of events, increase the polygon grid
size, if appropriate, or remove those zero-count grid polygons prior to analysis.
 Alternatively, if you have a number of coincident points or points within a short distance of one another, you can use
Integrate with the Collect Events tool to (1) snap features within a specified distance of each other together, then (2) create
a new feature class containing a point at each unique location with an associated count attribute to indicate the number of
events/snapped points. Use the resultant ICOUNT field as your Input Field for analysis.
Note: If you are concerned that your coincident points may be redundant records, the Find
Identical tool can help you to locate and remove duplicates.
Strategies for aggregating incident data
2. Which Conceptualization of Spatial Relationships is appropriate? What Distance Band or Threshold Distance value is best?
The recommended (and default) Conceptualization of Spatial Relationships for the Hot Spot Analysis (Getis-Ord Gi*) tool is
Fixed Distance Band. Space-Time Window, Zone of Indifference, Contiguity, K Nearest Neighbor, and Delaunay Triangulation
may also work well. For a discussion of best practices and strategies for determining an analysis distance value, see Selecting a
Conceptualization of Spatial Relationships and Selecting a Fixed Distance. For more information about space-time hot spot
analysis, see Space-Time Analysis.
3. What is the question?
This may seem obvious, but how you construct the Input Field for analysis determines the types of questions you can ask. Are
you most interested in determining where you have lots of incidents, or where high/low values for a particular attribute cluster
spatially? If so, run Hot Spot Analysis on the raw values or raw incident counts. This type of analysis is particularly helpful for
resource allocation types of problems. Alternatively (or in addition), you may be interested in locating areas with unexpectedly
high values in relation to some other variable. If you are analyzing foreclosures, for example, you probably expect more
foreclosures in locations with more homes (said another way, at some level, you expect the number of foreclosures to be a
function of the number of houses). If you divide the number of foreclosures by the number of homes, then run the Hot Spot
Analysis tool on this ratio, you are no longer asking Where are there lots of foreclosures?; instead, you are asking Where are
there unexpectedly high numbers of foreclosures, given the number of homes? By creating a rate or ratio prior to analysis, you
can control for certain expected relationships (for example, the number of crimes is a function of population; the number of
foreclosures is a function of housing stock) and identify unexpected hot/cold spots.

 Does the Input Feature Class contain at least 30 features? Results aren't reliable with less than 30 features.
 Is the Conceptualization of Spatial Relationships you selected appropriate? For this tool, the Fixed Distance Band method is
recommended. For space-time hot spot analysis, see Selecting a Conceptualization of Spatial Relationships.
 Is the Distance Band or Threshold Distance appropriate? See Selecting a Fixed Distance.
 No feature should have all other features as neighbors.

 Especially if the values for the Input Field are skewed, you want features to have about eight neighbors each.
Applications can be found in crime analysis, epidemiology, voting pattern analysis, economic geography, retail analysis, traffic incident
analysis, and demographics. Some examples include the following:
 Where is the disease outbreak concentrated?
 Where are kitchen fires a larger than expected proportion of all residential fires?
 Where should the evacuation sites be located?
 Where/When do peak intensities occur?
 Which locations and at during what time periods should we allocate more of our resources?
Getis, A. and J.K. Ord. 1992. "The Analysis of Spatial Association by Use of Distance Statistics" in Geographical Analysis 24(3).
Ord, J.K. and A. Getis. 1995. "Local Spatial Autocorrelation Statistics: Distributional Issues and an Application" in Geographical Analysis 27
(4).
The spatial statistics resource page has short videos, tutorials, web seminars, articles and a variety of other materials to help you get
started with spatial statistics.
Scott, L. and N. Warmerdam. Extend Crime Analysis with ArcGIS Spatial Statistics Tools in ArcUser Online, April–June 2005.
How Cluster and Outlier Analysis (Anselin Local Moran's I) works
Locate topic
Given a set of features (Input Feature Class) and an analysis field (Input Field), the Cluster and Outlier Analysis tool identifies spatial clusters
of features with high or low values. The tool also identifies spatial outliers. To do this, the tool calculates a local Moran's I value, a z-score, a p-
value, and a code representing the cluster type for each statistically significant feature. The z-scores and p-values represent the statistical
significance of the computed index values.
Calculations
View additional mathematics for the local Moran's I statistic.
Interpretation
A positive value for I indicates that a feature has neighboring features with similarly high or low attribute values; this feature is part of a
cluster. A negative value for I indicates that a feature has neighboring features with dissimilar values; this feature is an outlier. In either
instance, the p-value for the feature must be small enough for the cluster or outlier to be considered statistically significant. For more
information on determining statistical significance, see What is a z-score? What is a p-value? Note that the local Moran's I index (I) is a
relative measure and can only be interpreted within the context of its computed z-score or p-value. The z-scores and p-values reported in
the output feature class are uncorrected for multiple testing or spatial dependency.
The cluster/outlier type (COType) field distinguishes between a statistically significant cluster of high values (HH), cluster of low values (LL),
outlier in which a high value is surrounded primarily by low values (HL), and outlier in which a low value is surrounded primarily by high
values (LH). Statistical significance is set at the 95 percent confidence level. When no FDR correction is applied, features with p-values
smaller than 0.05 are considered statistically significant. The FDR correction reduces this p-value threshold from 0.05 to a value that better
reflects the 95 percent confidence level given multiple testing.
Output
This tool creates a new output feature class with the following attributes for each feature in the input feature class: local Moran's I index, z-
score, p-value, and COType.
When this tool runs in ArcMap, the output feature class is automatically added to the table of contents (TOC) with default rendering applied
to the COType field. The rendering applied is defined by a layer file in <ArcGIS>/ArcToolbox/Templates/Layers. You can reapply the
default rendering, if needed, by importing the template layer symbology.

 Results are only reliable if the input feature class contains at least 30 features.
 This tool requires an input field such as a count, rate, or other numeric measurement. If you are analyzing point data, where each point
represents a single event or incident, you might not have a specific numeric attribute to evaluate (a severity ranking, count, or other
measurement). If you are interested in finding locations with many incidents (hot spots) and/or locations with very few incidents (cold
spots), you will need to aggregate your incident data prior to analysis. The Hot Spot Analysis (Getis-Ord Gi*) tool is also effective for
finding hot and cold spots. Only the Cluster and Outlier Analysis (Anselin Local Moran's I) tool, however, will identify statistically
significant spatial outliers (a high value surrounded by low values or a low value surrounded by high values).
 Select an appropriate conceptualization of spatial relationships.
 When you select the SPACE_TIME_WINDOW conceptualization, you can identify space-time clusters and outliers. See Space-Time Analysis
for more information.
 Select an appropriate distance band or threshold distance.
 No feature should have all other features as a neighbor.

 Especially if the values for the input field are skewed, each feature should have about eight neighbors.
The Cluster and Outlier Analysis (Anselin Local Moran's I) tool identifies concentrations of high values, concentrations of low values, and
spatial outliers. It can help you answer questions such as these:
 Where are the sharpest boundaries between affluence and poverty in a study area?
 Are there locations in a study area with anomalous spending patterns?
 Where are the unexpectedly high rates of diabetes across the study area?
Applications can be found in many fields including economics, resource management, biogeography, political geography, and demographics.
Anselin, Luc. "Local Indicators of Spatial Association—LISA," Geographical Analysis 27(2): 93–115, 1995.
How Grouping Analysis works
Locate topic
Whenever we look at the world around us, it is very natural for us to organize, group, differentiate, and catalog what we see to help us make
better sense of it; this type of mental classification process is fundamental to learning and comprehension. Similarly, to help you learn about
and better comprehend your data, you can use the Grouping Analysis tool. It performs a classification procedure that tries to find natural
clusters in your data. Given the number of groups to create, it will look for a solution where all the features within each group are as similar as
possible, and all the groups themselves are as different as possible. Feature similarity is based on the set of attributes that you specify for the
Analysis Fields parameter and may optionally incorporate spatial properties or space-time properties. When space or space-time Spatial
Constraints is specified, the algorithm employs a connectivity graph (minimum spanning tree) to find natural groupings. When
NO_SPATIAL_CONSTRAINT is specified, the Grouping Analysis tool uses a K Means algorithm.
While hundreds of cluster analysis algorithms such as these exist, all of them are classified as NP-hard. This means that the only way to ensure
that a solution will perfectly maximize both within-group similarities and between-group differences is to try every possible combination of the
features you want to group. While this might be feasible with a handful of features, the problem quickly becomes intractable.
Not only is it intractable to ensure that you've found an optimal solution, it is also unrealistic to try to identify a grouping algorithm that will
perform best for all possible data scenarios. Groups come in all different shapes, sizes, and densities; attribute data can include a variety of
ranges, symmetry, continuity, and measurement units. This explains why so many different cluster analysis algorithms have been developed
over the past 50 years. It is most appropriate, therefore, to think of Grouping Analysis as an exploratory tool that can help you learn more
about underlying structures in your data.
Some of the ways that this tool might be applied are listed here:
 Suppose you have salmonella samples from farms around your state and attributes including the type/class, location, and date/time. To
better understand how the bacteria is transmitted and spread, you can use the Grouping Analysis tool to partition the samples into
individual "outbreaks". You might decide to use a space-time constraint because samples for the same outbreak would be near each
other in both space and time and would also be associated with the same type or class of bacteria. Once the groups are determined,
you can use other spatial pattern analysis tools such as Standard Deviational Ellipse, Mean_Center, or Near to analyze each outbreak.
 If you've collected data on animal sightings to better understand their territories, the Grouping Analysis tool might be helpful.
Understanding where and when salmon at different life stages congregate, for example, could assist with designing protected areas that
may help ensure successful breeding.
 As an agronomist, you may want to classify different types of soils in your study area. Using Grouping Analysis on the soil
characteristics found for a series of samples can help you identify clusters of distinct, spatially contiguous soil types.
 Grouping customers by their buying patterns, demographic characteristics, and travel patterns may help you design an efficient
marketing strategy for your company's products.
 Urban planners often need to divide cities into distinct neighborhoods to efficiently locate public facilities and promote local activism and
community engagement. Using Grouping Analysis on the physical and demographic characteristics of city blocks can help planners
identify spatially contiguous areas of their city that have similar physical and demographic characteristics.
 Ecological Fallacy is a well-known problem for statistical inference whenever analysis is performed on aggregated data. Often, the
aggregation scheme used for analysis has nothing to do with what we want to analyze. Census data, for example, is aggregated based
on population distributions that may not be the best choice for analyzing wildfires. Partitioning the smallest aggregation units possible
into homogeneous regions for a set of attributes that accurately relate to the analytic questions at hand is an effective method for
reducing aggregation bias and avoiding Ecological Fallacy.
Inputs
This tool takes point, polyline, or polygon Input Features, a unique ID field, a path for the Output Feature Class, one or more Analysis
Fields, an integer value representing the Number of Groups to create, and the type of Spatial Constraint—if any—that should be applied
within the grouping algorithm. There are also a number of optional parameters including one that allows you to create a PDF Output Report
File.
Analysis fields
Select fields that are numeric reflecting ratio, interval, or ordinal measurement systems. While Nominal data may be represented using
dummy (binary) variables, these generally do not work as well as the other numeric variable types. For example, you could create a
variable called Rural and assign to each feature (each census tract, for example) a 1 if it is mostly rural and a 0 if it is mostly urban. A
better representation for this variable for use with Grouping Analysis, however, would be the amount or proportion of rural acreage
associated with each feature.
You should select variables that you think will distinguish one group of features from another. Suppose, for example, you are interested
in grouping school districts by student performance on standardized achievement tests. You might select Analysis Fields that include
overall test scores, results for particular subjects like math or reading, the proportion of students meeting some minimum test score
threshold, and so forth. When you run the Grouping Analysis tool, an R2 value is computed for each variable. In the summary below, for
example, school districts are grouped based on student test scores, the percentage of adults in the area who didn't finish high school, per
student spending, and average student-to-teacher ratios. Notice that the TestScores variable has the highest R2 value. This indicates that
this variable divides the school districts into groups most effectively. The R2 value reflects how much of the variation in the original
TestScores data was retained after the grouping process, so the larger the R2 value is for a particular variable, the better that variable is
at discriminating among your features.
Dive-in: R2 is computed as:

(TSS - ESS) / TSS
where TSS is the total sum of squares and ESS is the explained sum of squares. TSS is
calculated by squaring and then summing deviations from the global mean value for a variable.
ESS is calculated the same way, except deviations are group by group: every value is
subtracted from the mean value for the group it belongs to, then squared and summed.
Number of groups
Sometimes you will know the number of groups most appropriate to your question or problem. If you have five sales managers and want
to assign each to their own contiguous region, for example, you would use 5 for the Number of Groups parameter. In many cases,
however, you won't have any criteria for selecting a specific number of groups; instead, you just want the number that best distinguishes
feature similarities and differences. To help you in this situation, you can check on the Evaluate Optimal Number of Groups parameter
and let the Grouping Analysis tool assess the effectiveness of dividing your features into 2, 3, 4, and up to 15 groups. Grouping
effectiveness is measured using the Calinski-Harabasz pseudo F-statistic, which is a ratio reflecting within-group similarity and between-
group difference:
Suppose you want to create four spatially contiguous groups. In this case, the tool will create a minimum spanning tree reflecting both
the spatial structure of your features and their associated analysis field values. The tool then determines the best place to cut the tree to
create two separate groupings. Next, it decides which one of the two resultant groups should be divided to yield the best three group
solution. One of the two groups will be divided, the other group remains intact. Finally, it determines which of the resultant three
groupings should be divided in order to provide the best four group solutions. For each division, the best solution is the one that
maximizes both within-group similarity and between-group difference. A group can no longer be divided (except arbitrarily) when the
analysis field values for all the features within that group are identical. In the case where all resultant groups have features within them
that are identical, the Grouping Analysis tool stops creating new groups even if it has not yet reached the Number of Groups you have
specified. There is no basis for dividing a group when all of the Analysis Fields have identical values.
Spatial constraint
If you want the resultant groups to be spatially proximal, specify a spatial constraint. The CONTIGUITY options are enabled for polygon
feature classes and indicate that features can only be part of the same group if they share an edge (CONTIGUITY_EDGES_ONLY) or if they
share either an edge or a vertex (CONTIGUITY_EDGES_CORNERS) with another member of the group. The polygon contiguity options are
not good choices, however, if your dataset includes clusters of discontiguous polygons or polygons with no contiguous neighbors at all:
The DELAUNAY_TRIANGULATION and K_NEAREST_NEIGHBORS options are both appropriate for point or polygon features; these options
indicate that a feature will only be included in a group if at least one other group member is a natural neighbor (Delaunay Triangulation)
or a K Nearest Neighbor. If you select K_NEAREST_NEIGHBORS and enter a 12 for the Number of Neighbors parameter, for example, every
feature in a group will be within 12 nearest neighbors of at least one other feature in the group.
The DELAUNAY_TRIANGULATION option shouldn't be used for datasets with coincident features. Also, because the Delaunay Triangulation
method converts features to Thiessen polygons to determine neighbor relationships, especially with polygon features and sometimes with
peripheral features in your dataset, the results from using this option may not always be what you expect. In the illustration below,
notice that some of the grouped original polygons are not contiguous; when they are converted to Thiessen polygons, however, all the
grouped features do, in fact, share an edge:
For Delaunay Triangulation, Thiessen polygon contiguity defines neighbor relationships.
If you want the resultant groups to be both spatially and temporally proximal, create a spatial weights matrix file (SWM) using the
Generate_Spatial_Weights_Matrix tool and select SPACE_TIME_WINDOW for the Conceptualization of Spatial Relationships parameter. You
can then specify the SWM file you created with the Generate Spatial Weights Matrix tool for the Weights Matrix File parameter when you
run Grouping Analysis.
Note: While the spatial relationships among your features are stored in an SWM file and used by the
Grouping Analysis tool to impose spatial constraints, there is no actual weighting involved in
the grouping process. The SWM file is only used to keep track of which features can and cannot
be included in the same group.
For many analyses, imposing a spatial or space-time constraint is neither required nor helpful. Suppose, for example, you want to group
crime incidents by perpetrator attributes (height, age, severity of the crime, and so forth). While crimes committed by the same person
may tend to be proximal, it is unlikely that you would find that all the crimes in a particular area were committed by the same person.
For this type of analysis, you would select NO_SPATIAL_CONSTRAINT for the Spatial Constraints parameter. You might, however, elect to
include some spatial variables (proximity to banks, for example) in your list of Analysis Fields to capture some of the spatial aspects of
the crimes you're analyzing.
K Means
When you select NO_SPATIAL_CONSTRAINT for the Spatial Constraints parameter, a K Means algorithm is used for grouping. The goal of the
K Means algorithm is to partition features so the differences among the features in a group, over all groups, are minimized. Because the
algorithm is NP-hard, a greedy heuristic is employed to group features. The greedy algorithm will always converge to a local minimum but
will not always find the global (most optimal) minimum.
The K Means algorithm works by first identifying seed features used to grow each group. Consequently, the number of seeds will always
match the Number of Groups. The first seed is selected randomly. Selection of remaining seeds, however, while still employing a random
component, applies a weighting that favors selection of subsequent seeds farthest in data space from the existing set of seed features (this
part of the algorithm is called K Means ++). Because of the random component in finding seed features whenever you select
FIND_SEED_LOCATIONS or USE_RANDOM_SEEDS for the Initialization Method, you might get variations in grouping results from one run of the
tool to the next.
Once the seed features are identified, all features are assigned to the closest seed feature (closest in data space). For each cluster of
features, a mean data center is computed, and each feature is reassigned to the closest center. The process of computing a mean data
center for each group and then reassigning features to the closest center continues until group membership stabilizes (up to a maximum
number of 100 iterations).
Minimum spanning tree

When you specify a spatial constraint to limit group membership to contiguous or proximal features, the tool first constructs a connectivity
graph representing the neighborhood relationships among features. From the connectivity graph, a minimum spanning tree is devised that
summarizes both feature spatial relationships and feature data similarity. Features become nodes in the minimum spanning tree connected
by weighted edges. The weight for each edge is proportional to the similarity of the objects it connects. After building the minimum
spanning tree, a branch (edge) in the tree is pruned, creating two minimum spanning trees. The edge to be pruned is selected so that it
minimizes dissimilarity in the resultant groups, while avoiding (if possible) singletons (groups with only one feature). At each iteration one
of the minimum spanning trees is divided by this pruning process until the Number of Groups specified is obtained. The published method
employed is called SKATER (Spatial "K"luster Analysis by Tree Edge Removal). While the branch that optimizes group similarity is selected
for pruning at each iteration, there is no guarantee that the final result will be optimal.
Outputs
A number of outputs are created by the Grouping Analysis tool. All these (including the optional PDF report file) can be accessed from the
Results window. If you disable background processing, results will also be written to the Progress dialog box. These messages (shown
below) summarize information presented in the optional PDF report (described below).
The default output for the Grouping Analysis tool is a new Output Feature Class containing the fields used in the analysis plus a new Integer
field named SS_GROUP identifying which group each feature belongs to. This output feature class is added to the table of contents with a
unique color rendering scheme applied to the SS_GROUP field. Hollow rendering indicates features that could not be added to any group,
usually because they have no neighboring features. If you specify NO_SPATIAL_CONSTRAINT for the Spatial Constraints parameter, an
additional field, SS_SEED, is added to the output feature class to indicate which seed features were used to grow groups.
Grouping with Contiguity Spatial Constraint
Grouping analysis report file

If you specify a path for the Output Report File parameter, a PDF is created summarizing the groups that were created.
Note: Creating the optional report file can add substantial processing time. Consequently, while
Grouping Analysis will always create an output feature class showing group membership, the
PDF report file will not be created if you specify more than 15 groups or more than 15
variables.
Box plots are included throughout the report, so the first element in the report is a graphic showing you how to interpret them (see below).
The box plots in the Grouping Analysis report graphically depict nine summary values for each analysis field and group: minimum data
value, lower quartile, median, upper quartile, maximum data value, data outliers (values smaller or larger than 1.5 times the interquartile
range), group minimum, group mean, and group maximum. Any + marks falling outside the upper or lower whisker represent data
outliers.
Dive-in: The interquartile range (IQR) is the upper quartile minus the lower quartile. Low outliers would
be values less than 1.5*IQR (Q1-1.5*IQR), and high outliers would be values greater than
1.5*IQR (Q3+1.5*IQR). Outliers appear in the box plots as + symbols.
The first page of the report compares the variables (the Analysis Fields) within each group to each other. In the report below, for
example, Grouping Analysis was performed on census tracts to create four groups. Summary statistics for each group are printed using a
different color (blue, red, green, and gold). The first set of summary statistics are printed in black because these are the global Mean,
Standard Deviation (Std.Dev.), Minimum, Maximum, and R2 values for all data in each analysis field. The larger the R2 value is for a
particular variable, the better that variable is at discriminating among your features. After the global summaries, the Mean, Standard
Deviation, Minimum, Maximum, and Share values are reported for each variable in each group. In the report below, for example, you see
that Group 1 (Blue) contains 52 percent of the range of values in the global AGE_UNDER5 variable; the global range of values is from 0
to 1,453 children under the age of 5, and the blue group contains tracts with from 488 to 1,246 children under the age of 5. The mean
number of children under 5 for the tracts in the blue group is 805.3750. The box plot to the right of the blue group statistical summary
shows how the group's values relate to the global values for that same analysis field. Notice that the blue dot on the box plot falls outside
the upper quartile and that the first blue vertical line (representing the minimum value for the blue group tracts) is above the global
mean for this field. In fact, looking at where the blue dots fall in the box plots for all the variables, you can see that, except for the
MEDIANRENT variable, the mean values in all the analysis fields is above the upper quartile. This group has the highest range of values
compared to the other groups.
Dive-in: The Share value is the ratio of the group and global range. For group 1 and the AGE_UNDER5
variable, for example, the 52 percent share is obtained by dividing the group range (1246-
488=758) by the global range (1453-0=1453), yielding 0.52 when rounded to two significant
digits.
Section 1 of the output report
The second section of the report compares the variable ranges for each group, one analysis field (variable) at a time. With this view of
the data, it is easy to see which group has the highest and lowest range of values within each variable. The group minimum, mean, and
maximum values are superimposed on top of the box plot reflecting all values. Notice that group 4 (orange) has the lowest values for the
MEDIANRENT variable. The minimum, mean, and maximum values for this group are lower than for any other group.
Section 2 of the output report
The parallel box plot graph summarizes both the groups and the variables within them. From the graph below, notice that group 1 (blue)
reflects tracts with average rents, the highest values for female-headed households with children (FHH_CHILD), the highest values for
number of housing units (HSE_UNITS), and the highest values for children under the age of 5. Group 2 (red) reflects tracts with the
highest median rents, lowest number of female-headed households with children, more than the average number of housing units
(though fewer than the tracts in groups 1 or 3), and the fewest children under the age of 5.
Parallel box plot in the output report
When you check on the Evaluate Optimal Number of Groups parameter, the PDF report file will include a graph of pseudo F-statistic
values. The circled point on the graph is the largest F-statistic, indicating how many groups will be most effective at distinguishing the
features and variables you specified. In the graph below, the F-statistic associated with four groups is highest. Five groups, with a high
pseudo F-statistic, would also be a good choice.
Pseudo F-statistic plot in the output report
Best practices
While there is a tendency to want to include as many Analysis Fields as possible, for Grouping Analysis, it works best to start with a single
variable and build. Results are much easier to interpret with fewer analysis fields. It is also easier to determine which variables are the best
discriminators when there are fewer fields.
In many scenarios, you will likely run the Grouping Analysis tool a number of times looking for the optimal Number of Groups, most
effective Spatial Constraints, and the combination of Analysis Fields that best separate your features into groups. Because creating the
Output Report can add substantial processing time, you will likely not want to create the report while you are experimenting with different
input parameters.
Additional Resources
Duque, J. C., R. Ramos, and J. Surinach. 2007. "Supervised Regionalization Methods: A Survey" in International Regional Science Review
30: 195–220.
Assuncao, R. M., M. C. Neves, G. Camara, and C. Da Costa Freitas. 2006. "Efficient Regionalisation Techniques for Socio-economic
Geographical Units using Minimum Spanning Trees" in International Journal of Geographical Information Science 20 (7): 797–811.
Jain, A. K. 2009. "Data Clustering: 50 years beyond K-Means." Pattern Recognition Letters.
Hinde, A., T. Whiteway, R. Ruddick, and A. D. Heap. 2007. "Seascapes of the Australian Margin and adjacent sea floor: Keystroke
Methodology." in Geoscience Australia, Record 2007/10, 58pp.
How Optimized Hot Spot Analysis Works
Locate topic
Optimized Hot Spot Analysis executes the Hot Spot Analysis (Getis-Ord Gi*) tool using parameters derived from characteristics of your input
data. Similar to the way that the automatic setting on a digital camera will use lighting and subject versus ground readings to determine an
appropriate aperture, shutter speed, and focus, the Optimized Hot Spot Analysis tool interrogates your data to obtain the settings that will
yield optimal hot spot results. If, for example, the Input Features dataset contains incident point data, the tool will aggregate the incidents into
weighted features. Using the distribution of the weighted features, the tool will identify an appropriate scale of analysis. The statistical
significance reported in the Output Features will be automatically adjusted for multiple testing and spatial dependence using the False
Discovery Rate (FDR) correction method.
Each of the decisions the tool makes in order to give you the best results possible is reported to the Results window and an explanation for
these decisions is documented below. Right-clicking on the Messages entry in the Results window and selecting View will display this tool
runtime information in a Message dialog box.
Just like your camera has a manual mode that allows you to override the automatic settings, the Hot Spot Analysis (Getis-Ord Gi*) tool gives
you full control over all parameter options. Running the Optimized Hot Spot Analysis tool and noting the parameter settings it uses may help
you refine the parameters you provide to the full control Hot Spot Analysis (Getis-Ord Gi*) tool.
The workflow for the Optimized Hot Spot Analysis tool includes the following components. The calculations and algorithms used within each of
these components are described below.
Initial data assessment

In this component, the Input Features and the optional Analysis Field, Bounding Polygons Defining Where Incidents Are Possible, and
Polygons For Aggregating Incidents Into Points are scrutinized to ensure there are sufficient features and adequate variation in the values
to be analyzed. If the tool encounters records with corrupt or missing geometry, or if an Analysis Field is specified and null values are
present, the associated records will be listed as bad records and excluded from analysis.
The Optimized Hot Spot Analysis tool uses the Getis-Ord Gi* (pronounced Gee Eye Star) statistic and, similar to many statistical methods,
the results are not reliable when there are less than 30 features. If you provide polygon Input Features or point Input Features and an
Analysis Field, you will need a minimum of 30 features to use this tool. The minimum number of Polygons For Aggregating Incidents Into
Points is also 30. The feature layer representing Bounding Polygons Defining Where Incidents Are Possible may include one or more
polygons.
The Gi* statistic also requires values to be associated with each feature it analyzes. When the Input Features you provide represent
incident data (when you don't provide an Analysis Field), the tool will aggregate the incidents and the incident counts will serve as the
values to be analyzed. After the aggregation process completes, there still must be a minimum of 30 features, so with incident data you will
want to start with more than 30 features. The table below documents the minumum number of features for each Incident Data Aggregation
Method:
Minimum Number of
Minimum Number
Aggregation Method Features After
of Incidents
Aggregation
60 COUNT INCIDENTS WITHIN FISHNET POLYGONS, without specifying 30

Bounding Polygons Defining Where Incidents Are Possible
30 COUNT INCIDENTS WITHIN FISHNET POLYGONS, when you do provide a 30

feature class for the Bounding Polygons Defining Where Incidents Are
Possible parameter
30 COUNT INCIDENTS WITHINN AGGREGATION POLYGONS 30
60 SNAP NEARBY INCIDENTS TO CREATE WEIGHTED POINTS 30
The Gi* statistic was also designed for an Analysis Field with a variety of different values. The statistic is not appropriate for binary data, for
example. The Optimized Hot Spot Analysis tool will check the Analysis Field to make sure that the values have at least some variation.
If you specify a path for the Density Surface, this component of the tool workflow will also check the raster analysis mask environment
setting. If no raster analysis mask is set, it will construct a convex hull around the incident points to use for clipping the output Density
Surface raster layer. The Density Surface parameter is only enabled when your Input Features are points and you have the ArcGIS Spatial
Analyst extension. It is disabled for all but the SNAP_NEARBY_INCIDENTS_TO_CREATE_WEIGHTED_POINTS Incident Data Aggregation Method.
Locational outliers are features that are much farther away from neighboring features than the majority of features in the dataset. Think of
an urban environment with large, densely populated cities in the center, and smaller, less densely populated cities at the periphery. If you
computed the average nearest neighbor distance for these cities you would find that the result would be smaller if you excluded the
peripheral locational outliers and focused only on the cities near the urban center. This is an example of how locational outliers can have a
strong impact on spatial statistics such as Average Nearest Neighbor. Since the Optimized Hot Spot Analysis tool uses the average and the
median nearest neighbor calculations for aggregation and also to identify an appropriate scale of analysis, the Initial Data Assessment
component of the tool will also identify any locational outliers in the Input Features or Polygons For Aggregating Incidents Into Points and
will report the number it encounters. To do this, the tool computes each feature's average nearest neighbor distance and evaluates the
distribution of all of these distances. Features that are more than a three standard deviation distance away from their closest noncoincident
neighbor are considered locational outliers.
Incident Aggregation
For incident data the next component in the workflow aggregates your data. There are three possible approaches based on the Incident
Data Aggregation Method you select. The algorithms for each of these approaches are described below.
 COUNT_INCIDENTS_WITHIN_FISHNET_POLYGONS:
1. Collapse coincident points yielding a single point at each unique location in the dataset, using the same method employed by
the Collect Events tool.
2. Compute both the average and median nearest neighbor distances on all of the unique location points, excluding locational
outliers. The average nearest neighbor distance (ANN) is computed by summing the distance to each feature's nearest
neighbor and dividing by the number of features (N). The median nearest neighbor distance (MNN) is computed by sorting
the nearest neighbor distances smallest to largest and selecting the distance that falls in the middle of the sorted list.
3. Set the initial cell size (CS) to the larger of either ANN or MNN.
4. Adjust the cell size to account for coincident points. Smaller = MIN(ANN,MNN); Larger = MAX(ANN,MNN). Scalar = MAX
((Larger/Smaller),2). The adjusted cell size becomes CS * Scalar.
5. Construct a fishnet polygon mesh using the adjusted cell size and overlay the mesh with the incident points.
6. Count the incidents in each polygon cell.
7. When you provide Bounding Polygons Defining Where Incidents Are Possible, all polygon cells within the bounding polygons
are retained. When you do not provide Bounding Polygons Defining Where Incidents Are Possible, polygon cells with zero
incidents are removed.
8. If the aggregation process results in less than 30 polygon cells or if the counts in all the polygon cells are identical, you will
get a message indicating the Input Features you provided are not appropriate for the Incident Data Aggregation Method
selected; otherwise, the aggregation component for this method completes successfully.
 COUNT_INCIDENTS_WITHIN_AGGREGATION_POLYGONS:
1. For this Incident Data Aggregation Method, a Polygons For Aggregating Incidents Into Points feature layer is required.
These aggregation polygons overlay the incident points.
2. Count the incidents within each polygon.
3. Ensure there is sufficient variation in the incident counts for analysis. If the aggregation process results in all polygons
having the same number of incidents, you will get a message indicating the data is not appropriate for the Incident Data
Aggregation Method you selected.
 SNAP_NEARBY_INCIDENTS_TO_CREATE_WEIGHTED_POINTS:
1. Collapse coincident points yielding a single point at each unique location in the dataset, using the same method employed by
the Collect Events tool. Count the number of unique location features (UL).
2. Compute both the average and the median nearest neighbor distances on all of the unique location points, excluding
locational outliers. The average nearest neighbor distance (ANN) is computed by summing the distance to each feature's
nearest neighbor and dividing by the number of features (N). The median nearest neighbor distance (MNN) is computed by
sorting the nearest neighbor distances smallest to largest and selecting the distance that falls in the middle of the sorted list.
3. Set the initial snap distance (SD) to the smaller of either ANN or MNN.
4. Adjust the snap distance to account for coincident points. Scalar = (UL/N) where N is the number of features in the Input
Features layer. The adjusted snap distance becomes SD * Scalar.
5. Integrate the incident points in three iterations first using the adjusted snap distance times 0.10, then using the adjusted
snap distance times 0.25, and finally integrating with a snap distance equal to the fully adjusted snap distance. Performing
the integrate step in three passes minimizes distortion of the original point locations.
6. Collapse the snapped points yielding a single point at each location with a weight to indicate the number of incidents that
were snapped together. This part of the aggregation process uses the Collect_Events method.
7. If the aggregation process results in less than 30 weighted points or if the counts for all of the points are identical, you will
get a message indicating the Input Features you provided are not appropriate for the Incident Data Aggregation Method
selected; otherwise, the aggregation component for this method completes successfully.
Scale of analysis
This next component of the Optimized Hot Spot Analysis workflow is applied to weighted features either because you provided Input
Features with an Analysis Field or because the Incident Aggregation procedure has created weights from incident counts. The next step is
to identify an appropriate scale of analysis. The ideal scale of analysis is a distance that matches the scale of the question you are asking (if
you are looking for hot spots of a disease outbreak and know that the mosquito vector has a range of 10 miles, for example, using a 10-
mile distance would be most appropriate). When you can't justify any specific distance to use for your scale of analysis, there are some
strategies to help with this. The Optimized Hot Spot Analysis tool employs these strategies.
The first strategy tried is Incremental Spatial Autocorrelation. Whenever you see spatial clustering in the landscape, you are seeing
evidence of underlying spatial processes at work. The Incremental Spatial Autocorrelation tool performs the Global Moran's I statistic for a
series of increasing distances, measuring the intensity of spatial clustering for each distance. The intensity of clustering is determined by
the z-score returned. Typically, as the distance increases, so does the z-score, indicating intensification of clustering. At some particular
distance, however, the z-score generally peaks. Peaks reflect distances where the spatial processes promoting clustering are most
pronounced. The Optimized Hot Spot Analysis tool looks for peak distances using Incremental Spatial Autocorrelation. If a peak distance is
found, this distance becomes the scale for analysis. If multiple peak distances are found, the first peak distance is selected.
When no peak distance is found, Optimized Hot Spot Analysis examines the spatial distribution of the features and computes the average
distance that would yield K neighbors for each feature. K is computed as 0.05 * N, where N is the number of features in the Input Features
layer. K will be adjusted so that it is never smaller than three or larger than 30. If the average distance that would yield K neighbors
exceeds one standard distance, the scale of analysis will be set to one standard distance; otherwise, it will reflect the K neighbor average
distance.
The Incremental Spatial Autocorrelation step can take a long time to finish for large, dense datasets. Consequently, when a feature with
500 or more neighbors is encountered, the incremental analysis is skipped, and the average distance that would yield 30 neighbors is
computed and used for the scale of analysis.
The distance reflecting the scale of analysis will be reported to the Results window and will be used to perform the hot spot analysis. If you
provide a path for the Density Surface parameter, this optimal distance will also serve as the search radius with the Kernel Density tool.
This distance corresponds to the Distance Band or Threshold Distance parameter used by the Hot Spot Analysis (Getis-Ord Gi*) tool.
Hot spot analysis

At this point in the Optimized Hot Spot Analysis workflow all of the checks and parameter settings have been made. The next step is to run
the Getis-Ord Gi* statistic. Details about the mathematics for this statistic are outlined in How Hot Spot Analysis (Getis-Ord Gi*) works.
Results from the Gi* statistic will be automatically corrected for multiple testing and spatial dependence using the False Discovery Rate
(FDR) correction method. Messages to the Results window summarize the number of features identified as statistically significant hot or
cold spots, after the FDR correction is applied.
Output
The last component of the Optimized Hot Spot Analysis tool is to create the Output Features and, if specified, the Density Surface raster
layer. If the Input Features represent incident data requiring aggregation, the Output Features will reflect the aggregated weighted features
(fishnet polygon cells, the aggregation polygons you provided for the Polygons For Aggregating Incidents Into Points parameter, or
weighted points). Each feature will have a z-score, p-value, and Gi_Bin result.
When specified, the Density Surface is created using the Kernel Density tool. The search radius for this tool is the same as the scale of
analysis distance used for hot spot analysis. The default rendering is stretched values along a gray scale color ramp. If a raster analysis
mask is specified in the environment settings, the output Density Surface will be clipped to the analysis mask. If the raster analysis mask
isn't specified, the Density Surface will be clipped to a convex hull around the Input Features centroids.
License: The Kernel Density tool is used to create the density surface; because this tool is part of the
ArcGIS Spatial Analyst extension, the Density Surface parameter remains disabled if you don't
have this extension.
Getis, A. and J.K. Ord. 1992. "The Analysis of Spatial Association by Use of Distance Statistics" in Geographical Analysis 24(3).
Ord, J.K. and A. Getis. 1995. "Local Spatial Autocorrelation Statistics: Distributional Issues and an Application" in Geographical Analysis 27
(4).
The spatial statistics resource page has short videos, tutorials, web seminars, articles and a variety of other materials to help you get
started with spatial statistics.
How Similarity Search works
Locate topic
The Similarity Search tool identifies which Candidate Features are most similar (or most dissimilar) to one or more Input Features To Match.
Similarity is based on a specified list of numeric attributes (Attributes Of Interest). If more than one Input Features To Match is specified,
similarity is based on averages for each of the Attributes Of Interest. The output feature class (Output Features) will contain the Input
Features To Match along with all of the matching Candidate Features that were found, ordered by similarity (as specified by the Most Or Least
Similar parameter). The number of matches returned is based on the value for the Number Of Results parameter.
 You might use the Similarity Search tool to find other cities that are just like your own city in terms of population, education, and
proximity to specific recreational opportunities.
 Local officials may want to promote their city to potential businesses in order to increase tax-based revenues. The Similarity Search tool
will help them identify other cities like theirs so they can compare themselves with regard to attractor attributes (such as low crime and
high growth). These officials might also be interested in finding locations just like them, but either larger or smaller (cosine similarity).
Finding they are similar to smaller or larger places that have been attractive to the businesses they want to entice will allow them to
point out the similarities while either emphasizing the advantages of being smaller (less congestion, small town flavor) or of being larger
(more potential customers). These officials might also be interested in cities that are least like them. If any of the least similar places
represent competition for the businesses they want to attract, this analysis will provide information they need to present a comparison.
 A human resources manager may want to be able to justify company salary ranges. Once she identifies cities that are similar in terms
of size, cost of living, and amenities, she can examine the salary ranges for those cities to see if they are in line.
 A crime analyst wants to search the database to see if a crime is part of a larger pattern or trend.
 An after-school fitness program was extremely successful in Town A. Promoters want to find other towns with similar characteristics as
candidates for program expansion.
 A law enforcement agency has uncovered areas where drugs are being grown or manufactured. Identifying locations with similar
characteristics may help them target future searches.
 A large retailer has several successful stores and a few underperformers. Finding locations with similar demographic and contextual
characteristics (accessibility, visibility, complementary businesses, and so on) will help identify the best locations for a new store.

Matching methods
Matching may be based on attribute values, ranked attribute values, or attribute profiles (cosine similarity). The algorithm employed for
each of these methods is described below. For all methods if there is more than one Input Features To Match, the attributes for all features
are averaged to create a composite target feature to use for the matching process:
Attribute values
When you select ATTRIBUTE_VALUES for the Match Method parameter, the tool first standardizes all of the Attributes of Interest. For each
candidate it then subtracts the standardized values from those of the target, squares the differences, and adds the squared differences
together. This sum becomes the similarity index for that candidate. Once all candidates have been processed, candidates are ranked from
smallest index (most similar) to largest index (least similar).
Dive-in: Standarization of the attribute values involves a z-transform where each value is subtracted
from the mean for all values and divided by the standard deviation for all values.
Standardization puts all of the attributes on the same scale even when they are represented by
very different types of numbers: rates (numbers from 0 to 1.0), population (with values larger
than 1 million), and distances (kilometers, for example).
Ranked attribute values

When you select RANKED_ATTRIBUTE_VALUES for the Match Method parameter, the tool will begin by ranking each of the Attributes of
Interest both for the target feature and all of the candidates. For each candidate it then sums the squared difference for each attribute in
relation to the target feature. If the population value for the target is the 10th largest among all candidates, and the population for the
candidate being considered is 15th largest, the sum of the squared rank population difference for this candidate would be 10 - 15 = -5
and -5**2 is 25. The sum of squared rank differences for all of the Attributes of Interest becomes the similarity index for this candidate.
Once all candidates have been processed, candidates are ranked from smallest index (most similar) to largest index (least similar).
Attribute profiles
When you select ATTRIBUTE_PROFILES for the Match Method parameter, the tool first standardizes all of the Attributes of Interest (a
minimum of two Attributes of Interest is required for this method). It then uses cosine similarity mathematics to compare the vector of
standardized attributes for each candidate to the vector of standardized attributes for the target feature being matched. The cosine
similarity of two vectors, A and B, is computed as:
Cosine similarity is not concerned with the matching of attribute magnitudes but rather this method focuses on the relationships among
the attributes. If you created a profile (line graph) of the standardized attributes in the vectors being compared (the target and one of the
candidates), you might see very similar profiles or very different profiles:
The cosine similarity index ranges from 1.0 (perfect similarity) to -1.0 (perfect dissimilarity) and is reported in the SIMINDEX (Cosine
Similarity) field. You would use this similarity method to find places that have the same characteristics but perhaps at a larger or smaller
scale.
Best practices
Mapping similarity patterns

If you set the Number of Results parameter to zero, the tool will rank all of the candidate features. The output for this analysis will show
you the spatial pattern of similarity. Notice that when you rank all candidates you get information about similarity and about dissimilarity.
Including spatial variables

Suppose you know the locations (polygon areas) where a particular endangered species is doing well and you want to find other locations
where it might also thrive. You would be looking for locations similar to the successful ones, but might also need locations large enough
and compact enough to ensure species success. For this analysis you could compute a compactness metric for each polygon area
(common compactness measurements are based on the area of a polygon in relation to the area of a circle with the same perimeter). You
could then include your compactness measurement and an attribute reflecting polygon size (Shape_Area) in the Fields To Append To
Output parameter when you run the Similarity Search tool. Sorting the top ten solution matches in terms of both compactness and area
will help you identify the most appropriate locations for species reintroduction.
Perhaps you are a retailer interested in expanding. If you have existing stores that have been successful you can use attributes reflecting
the key characteristics of success to help you find candidate locations for expansion. Suppose that the products you sell will be most
attractive to college students and that you want to avoid locations near your current stores or near competitors. Before running the
Similarity Search tool you would use the Near tool to create your spatial variables: distance to colleges or places with high densities of
college students, distance to existing stores, and distance to competitors. You could then include these spatial variables in the Fields To
Append To Output parameter when you run the Similarity Search tool.
Space-Time Cluster Analysis
Locate topic
Data has both a spatial and a temporal context: everything happens someplace and occurs at some point in time. Several tools, including Hot
Spot Analysis, Cluster and Outlier Analysis, and Grouping Analysis, allow you to usefully exploit those aspects of your data. When you consider
both the spatial and the temporal context of your data, you can answer questions like the following:
 Where are the space-time crime hot spots? If you are a crime analyst, you might use the results from space-time Hot Spot Analysis to
make sure that your police resources are allocated as effectively as possible. You want those resources to be in the right places at the
right times.
 Where are the spending anomalies? In an effort to identify fraud, you might use Cluster and Outlier Analysis to scrutinize spending
behaviors looking for outliers in space and time. A sudden change in spending patterns or frequency could suggest suspicious activity.
 What are the characteristics of bacteria outbreaks? Suppose you are studying salmonella samples taken from dairy farms in your state.
To characterize individual outbreaks, you can run Grouping Analysis on your sample data, constraining group membership in both space
and time. Samples close in time and space are most likely to be associated with the same outbreak.
Several tools in the Spatial Statistics toolbox work by assessing each feature within the context of their neighboring features. When neighbor
relationships are defined in terms of both space and time, traditional spatial analyses become space-time analyses. To define neighbor
relationships using both spatial and temporal parameters, use the Generate_Spatial_Weights_Matrix tool and select the SPACE_TIME_WINDOW
option for the Conceptualization of Spatial Relationships parameter. Then specify both a Threshold Distance and a time interval (Date/Time
Interval Type and Date/Time Interval Value ). If, for example, you provide a distance of 1 kilometer and a time interval of 7 days, features
found within 1 kilometer that also have a date/time stamp within 7 days of each other will be analyzed together. Similarly, proximal features
within 1 kilometer of each other that do not fall within the 7-day time interval of each other will not be considered neighboring features.
Beyond Time Snapshots

One common approach to understanding spatial and temporal trends in your data is to break it up into a series of time snapshots. You
might, for example, create separate datasets for week one, week two, week three, week four, and week five. You could then analyze each
week separately and present the results of your analysis as either a series of maps or as an animation. While this an effective way to show
trends, how you decide to break up the data is somewhat arbitrary. If you are analyzing your data week to week, for example, how do you
decide where the break falls? Should you break the data between Sunday and Monday? Perhaps Monday through Thursday, and then again
Friday through Sunday? And is there something special about analyzing the data in week-long intervals? Might not daily analysis or monthly
analysis be more effective? The implications might be important if the division (dividing Sunday events from Monday events, for example)
separates features that really should be related. In the example below, 6 features fall within a 1 km and 7-day space-time window of the
feature labeled Jan 31; only one feature will be included as a neighbor, however, if the data is analyzed using monthly snapshots.
Data snapshots can artificially separate features close to each other in space and time.
When you define feature relationships using the SPACE_TIME_WINDOW, you are not creating snapshots of the data. Instead, all the data is
used in the analysis. Features that are near each other in space and time will be analyzed together, because all feature relationships are
assessed relative to the location and time stamp of the target feature; in the example above (A.), a 1 km, 7-day space-time window finds
six neighbors for the feature labeled Jan 31.
Suppose you were analyzing wildfires in a region. If you were to run the Hot Spot Analysis tool using the default FIXED_DISTANCE_BAND
conceptualization to define feature relationships, the result would be a map showing you locations of statistically significant wildfire hot
spots and cold spots. If you then ran the analysis again, but this time defined spatial relationships in terms of a SPACE-TIME WINDOW, you
may find that some of the hot spot areas are seasonal. Understanding this temporal characteristic of wildfires can have important
implications for how you allocate fire resources.
Visualizing Space-Time Results

Heat maps typically show high-intensity areas (hot spots) in red and low-intensity areas (cold spots) in blue. In the graphic below, for
example, the red areas are places getting the largest number of 911 emergency calls. The blue areas are locations getting relatively few
calls. How might you add information about the temporal dimension of 911 call frequencies to the map below? How might you effectively
map things like individual outbreaks, a series of crime sprees, reverberations in the adoption of a new technology, or the seasonal
oscillations of storm patterns?
Representing three-dimensional data (x and y location, plus time) is difficult to do with a two-dimensional map. Notice that in the example
below, you can't discern that there are two distinct hot spots (near each other in space, but separated by time), until the data is viewed in
three dimensions. By extruding the features based on a time field, it becomes clearer which features are related and which are separated by
time.
There are at least two ways to visualize the output from space-time analyses. Three-dimensional visualization is effective with a smaller
study area when you have a limited number of features; this approach allows you to present space-time relationships in a single map.
Another powerful method for portraying space-time processes is through animation. The examples below focus specifically on visualization
of space-time clusters.
Animation
To animate your space-time clusters, enable time on your result features, open the Time Slider from the Tools Toolbar, and click Play .
Set a time window that will allow you to see enough of your data at one time in a single step. If you are new to creating animations,
follow the links below.
What is an animation?
A quick tour of creating animations
3D
Another powerful way to visualize the results of a space-time cluster analysis is to use 3D visualization. With this method, time becomes
the third dimension, with point features extruded to reflect temporal progression. In the 3D graphic above, for example, the oldest events
are nearest to the ground, and the more recent events hover at higher elevations (appearing closer to the viewer).
To create a 3D representation of your data like the one above, you'll need to use ArcGlobe (included with the standard installation of
ArcGIS for Desktop).
First, run your space-time cluster analysis in ArcGlobe, then create a new field in the output feature class to reflect the height of each
feature. For this example, the heights will be based on the number of days that have passed since the first event in the dataset occurred.
To calculate the time lapse, you will use a VB script and the date function called DateDiff, as shown below.
Note: If you have trouble adding a new field to the output feature class because of a lock, save your
ArcGlobe document and reopen it, or export the output feature class to a new dataset, add it to
your map document, and symbolize it to match the output feature class.
Next, sort your features by date so that you can identify the earliest date. You will use this to calculate the new time lapse field values.
Right-click the new field you just created and choose Field Calculator. From the field calculator, click the Date type functions and select
DateDiff from the right-hand side of the calculator, as illustrated below. Type DateDiff ( "d", "3/1/2011", [DateField] ), replacing
the date string with the earliest date in your feature class and specifying your new field name for the [DateField] parameter ("d"
indicates that the difference interval should be in days).
Dive-in: The example above uses VB to compute date/time fields. The equivalent Python statement
would be:
(datetime.datetime.strptime(!Date_Con!, "%m/%d/%Y ").date() - datetime.date(
The next step is to change the ArcGlobe display properties so that the features in your dataset will appear elevated. To do this, right-click
the output feature class and choose Properties . On the properties dialog box, click the Elevation tab. In the Elevation from features
section, choose Use constant value or expression, then click the Calculator button and specify the new field you created with the
DateDiff function. ArcGlobe will now elevate your features based on the time lapse field. If you find that your features are not showing
enough elevation, you may want to try multiplying the time lapse field by a constant. In the Use constant value or expression property of
the Elevation tab, this would look something like this: [TimeLapse] *100, as illustrated below.
You can then use the ArcGlobe navigation tool to tilt and view the cluster results from various angles and viewpoints. The resultant
map might look something like this:
An overview of the Measuring Geographic Distributions toolset
Locate topic
Measuring the distribution of a set of features allows you to calculate a value that represents a characteristic of the distribution, such as the
center, compactness, or orientation. You can use this value to track changes in the distribution over time or compare distributions of different
features.
The Measuring Geographic Distributions toolset addresses questions such as:
 Where's the center?
 What's the shape and orientation of the data?
 How dispersed are the features?
Tool Description
Central Feature Identifies the most centrally located feature in a point, line, or polygon feature class.
Directional Creates standard deviational ellipses to summarize the spatial characteristics of geographic features: central
Distribution tendency, dispersion, and directional trends.
Linear Directional Identifies the mean direction, length, and geographic center for a set of lines.
Mean
Mean Center Identifies the geographic center (or the center of concentration) for a set of features.
Median Center Identifies the location that minimizes overall Euclidean distance to the features in a dataset.
Standard Distance Measures the degree to which features are concentrated or dispersed around the geometric mean center.
Measuring geographic distributions tools
Related Topics
Central Feature (Spatial Statistics)
Locate topic
Summary
Identifies the most centrally located feature in a point, line, or polygon feature class.
Learn more about how Central Feature works
Illustration
Usage
 The feature associated with the smallest accumulated distance to all other features in the dataset is the most centrally located
feature; this feature is selected and copied to a newly created Output Feature Class.
 Accumulated distances are measured using EUCLIDEAN_DISTANCE or MANHATTAN_DISTANCE, as specified by the Distance Method
parameter.
in the analysis.
 The Case Field is used to group features for separate Central Feature computations. The Case Field can be of integer, date, or string
type. Records with NULL values for the Case Field will be excluded from the analysis.
 Self-potential is the distance or weight between a feature and itself. Often this weight is zero, but in some cases you may want to
specify another fixed value or a different value for every feature (perhaps based on polygon size, for example).
information.
Syntax
CentralFeature_stats (Input_Feature_Class, Output_Feature_Class, Distance_Method, {Weight_Field}, {Self_Potential_Weight_Field},
{Case_Field})

The feature class containing a distribution of features from which to identify
the most centrally located feature.
The feature class that will contain the most centrally located feature in the
Input Feature Class.
features.
(as the crow flies)
Weight_Field The numeric field used to weight distances in the origin-destination distance Field
(Optional) matrix.
Self_Potential_Weight_Field Field
The field representing self-potential—the distance or weight between a
(Optional) feature and itself.
Case_Field Field
Field used to group features for separate central feature computations. The
(Optional) case field can be of integer, date, or string type.
Code Sample
CentralFeature example 1 (Python window)
The following Python window script demonstrates how to use the CentralFeature tool.
import arcpy
arcpy.CentralFeature_stats("coffee_shops.shp", "coffee_CENTRALFEATURE.shp", "EUCLIDEAN_DISTANCE", "NUM_EMP"
CentralFeature example 2 (stand-alone Python script)

The following stand-alone Python script demonstrates how to use the CentralFeature tool.
# Measure geographic distribution characteristics of coffee house locations weighted by the number of employees

import arcpy
input_FC = "coffee_shops.shp"
CF_output = "coffee_CENTRALFEATURE.shp"
MEAN_output = "coffee_MEANCENTER.shp"
MED_output = "coffee_MEDIANCENTER.shp"
weight_field = "NUM_EMP"
try:
# Set the workspace to avoid having to type out full path names
# Process: Central Feature...

arcpy.CentralFeature_stats(input_FC, CF_output, "EUCLIDEAN_DISTANCE", weight_field, "#", "#")
# Process: Mean Center...

arcpy.MeanCenter_stats(input_FC, MEAN_output, weight_field, "#", "#")
# Process: Median Center...

arcpy.MedianCenter_stats(input_FC, MED_output, weight_field, "#", "#")
except:
Environments
Feature geometry is projected to the Output Coordinate System prior to analysis.
Related Topics
Using weights
Mean Center
Median Center
Directional Distribution (Standard Deviational Ellipse) (Spatial Statistics)
Locate topic
Summary
Creates standard deviational ellipses to summarize the spatial characteristics of geographic features: central tendency, dispersion, and
directional trends.
Learn about how Directional Distribution (Standard Deviational Ellipse) works
Illustration
Usage
 The Standard Deviational Ellipse tool creates a new Output Feature Class containing elliptical polygons, one for each case (Case Field
parameter). The attribute values for these ellipse polygons include X and Y coordinates for the mean center, two standard distances
(long and short axes), and the orientation of the ellipse. The fieldnames are CenterX, CenterY, XStdDist, YStdDist, and Rotation.
When a Case Field is provided, this field is added to the Output Feature Class, as well.
 Calculations based on either Euclidean or Manhattan distance require projected data to accurately measure distances.
 When the underlying spatial pattern of features is concentrated in the center with fewer features toward the periphery (a spatial
normal distribution), a one standard deviation ellipse polygon will cover approximately 68 percent of the features; two standard
deviations will contain approximately 95 percent of the features; and three standard deviations will cover approximately 99 percent of
the features in the cluster.
 The value in the output Rotation field represents the rotation of the long axis measured clockwise from noon.
 The Case Field is used to group features prior to analysis. When a Case Field is specified, the input features are first grouped
according to case field values, and then a standard deviational ellipse is computed for each group. The case field can be of integer,
date, or string type. Records with NULL values for the Case Field will be excluded from analysis.
 The standard deviational ellipse calculation may be based on an optional Weight Field (to get the ellipses for traffic accidents
weighted by severity, for example). The weight field should be numeric.
in the analysis.
information.
Syntax
DirectionalDistribution_stats (Input_Feature_Class, Output_Ellipse_Feature_Class, Ellipse_Size, {Weight_Field}, {Case_Field})

Input_Feature_Class A feature class containing a distribution of features for which the standard Feature Layer
deviational ellipse will be calculated.
Output_Ellipse_Feature_Class A polygon feature class that will contain the output ellipse feature. Feature Class
Ellipse_Size The size of output ellipses in standard deviations. The default ellipse size is String
1; valid choices are 1, 2, or 3 standard deviations.
 1_STANDARD_DEVIATION
 2_STANDARD_DEVIATIONS
Weight_Field The numeric field used to weight locations according to their relative Field
(Optional) importance.
Case_Field Field used to group features for separate directional distribution Field
(Optional) calculations. The case field can be of integer, date, or string type.
Code Sample
DirectionalDistribution Example (Python Window)
The following Python Window script demonstrates how to use the DirectionalDistribution tool.
import arcpy
arcpy.DirectionalDistribution_stats("AutoTheft.shp", "auto_theft_SE.shp", "1_STANDARD_DEVIATION", "#"
DirectionalDistribution Example (stand-alone Python script)

The following stand-alone Python script demonstrates how to use the DirectionalDistribution tool.
# Measure the geographic distribution of auto thefts

import arcpy
locations = "AutoTheft.shp"
links = "AutoTheft_links.shp"
standardDistance = "auto_theft_SD.shp"
stardardEllipse = "auto_theft_SE.shp"
linearDirectMean = "auto_theft_LDM.shp"
try:
# Set the workspace (to avoid having to type in the full path to the data every time)
# Process: Standard Distance of auto theft locations...

arcpy.StandardDistance_stats(locations, standardDistance, "1_STANDARD_DEVIATION", "#", "#")
# Process: Directional Distribution (Standard Deviational Ellipse) of auto theft locations...

arcpy.DirectionalDistribution_stats(locations, standardEllipse, "1_STANDARD_DEVIATION", "#", "#"
# Process: Linear Directional Mean of auto thefts...

arcpy.DirectionalMean_stats(links, linearDirectMean, "DIRECTION", "#")
except:
# If an error occurred while running a tool, print the messages
Environments
Output Coordinate System spatial reference.
Related Topics
Using weights
Standard Distance
Mean Center
Linear Directional Mean (Spatial Statistics)
Locate topic
Summary
Identifies the mean direction, length, and geographic center for a set of lines.
Learn more about how Linear Directional Mean works
Illustration
Usage
 The input must be a line feature class.
 Attribute values for the output line feature(s) include CompassA for Compass Angle (clockwise from due North), DirMean for
Directional Mean (counterclockwise from due East), CirVar for Circular Variance (an indication of how much line directions or
orientations deviate from the directional mean), AveX and AveY for Mean Center X and Y Coordinates, and AveLen for Mean Length.
When a Case Field is specified, it also will be added to the Output Feature Class.
 Analogous to a standard deviation measure, the circular variance value tells how well the directional mean vector represents the set
of input vectors. Circular variances range from 0 to 1. If all the input vectors have the exact same (or very similar) directions, the
circular variance is small (near 0). When input vector directions span the entire compass, the circular variance is large (near 1).
 The Case Field is used to group features for separate linear directional mean computations. When a Case Field is specified, the input
line features are first grouped according to case field values, and then an output line feature is created for each group. The case field
can be of integer, date, or string type. Records with NULL values for the Case Field will be excluded from analysis.
 When measuring direction, the tool only considers the first and last points in a line. The tool does not consider all of the vertices along
a line.
in the analysis.
 When this tool runs in ArcMap, the output feature class is automatically added to the Table of Contents (TOC) with default rendering
(directional vectors). The rendering applied is defined by a layer file in <ArcGIS>/ArcToolbox/Templates/Layers. You can reapply
the default rendering, if needed, by importing the template layer symbology.
information.
Syntax
DirectionalMean_stats (Input_Feature_Class, Output_Feature_Class, Orientation_Only, {Case_Field})

The feature class containing vectors for which the mean direction will be
calculated.
A line feature class that will contain the features representing the mean
directions of the input feature class.
Orientation_Only  DIRECTION —The From and To nodes are utilized in calculating the mean Boolean
(default).
 ORIENTATION_ONLY —The From and To node information is ignored.
Case_Field Field
Field used to group features for separate directional mean calculations. The
Code Sample
LinearDirectionalMean Example (Python Window)
The following Python Window script demonstrates how to use the LinearDirectionalMean tool.
import arcpy
arcpy.DirectionalMean_stats("AutoTheft_links.shp", "auto_theft_LDM.shp", "DIRECTION", "#")
LinearDirectionalMean Example (stand-alone Python script)

The following stand-alone python script demonstrates how to use the LinearDirectionalMean tool.

import arcpy
try:



except:
Environments

Related Topics
Using weights
Mean Center
Mean Center (Spatial Statistics)
Locate topic
Summary
Identifies the geographic center (or the center of concentration) for a set of features.
Learn more about how Mean Center works
Illustration
Usage
 The mean center is a point constructed from the average x and y values for the input feature centroids.
 Use projected data with this tool to accurately measure distances.
 The x and y values for the mean center point features are attributes in the Output Feature Class. The values are stored in the fields
XCOORD and YCOORD.
 The Case Field is used to group features for separate mean center computations. When a Case Field is specified, the input features
are first grouped according to case field values, and then a mean center is calculated for each group. The case field can be of integer,
date, or string type. Records with NULL values for the Case Field will be excluded from analysis.
 The Dimension Field is any numeric field in the input feature class. The Mean Center tool will compute the average for all values in
that field and include the result in the output feature class.
in the analysis.
information.
Syntax
MeanCenter_stats (Input_Feature_Class, Output_Feature_Class, {Weight_Field}, {Case_Field}, {Dimension_Field})
A feature class for which the mean center will be calculated.
A point feature class that will contain the features representing the mean
centers of the input feature class.
Weight_Field The numeric field used to create a weighted mean center. Field
(Optional)
Case_Field Field
Field used to group features for separate mean center calculations. The case
(Optional) field can be of integer, date, or string type.
Dimension_Field Field
A numeric field containing attribute values from which an average value will
(Optional) be calculated.
Code Sample
MeanCenter Example (Python Window)

The following Python Window script demonstrates how to use the MeanCenter tool.
import arcpy
arcpy.MeanCenter_stats("coffee_shops.shp", "coffee_MEANCENTER.shp", "NUM_EMP", "#", "#")
MeanCenter Example (Stand-alone Python script)

The following stand-alone Python script demonstrates how to use the MeanCenter tool.

import arcpy
try:

arcpy.CentralFeature_stats(input_FC, CF_output, "Euclidean Distance", weight_field, "#", "#")


except:
Environments
Related Topics
Using weights
Central Feature
Median Center
Standard Distance
Median Center (Spatial Statistics)
Locate topic
Summary
Identifies the location that minimizes overall Euclidean distance to the features in a dataset.
Learn more about how_Median_Center_works
Illustration
Usage
 While the Mean_Center tool returns a point at the average X and average Y coordinate for all feature centroids, the median center
uses an iterative algorithm to find the point that minimizes Euclidean distance to all features in the dataset.
 Both the Mean_Center and Median Center are measures of central tendency. The algorithm for the Median Center tool is less
influenced by data outliers.
 Calculations based on feature distances require projected data to accurately measure distances.
 The Case Field is used to group features for separate median center computations. When a case field is specified, the input features
are first grouped according to case field values, and then a median center is calculated for each group. The case field can be of
integer, date, or string type, and will appear as an attribute in the Output Feature Class. Records with NULL values for the Case Field
will be excluded from analysis.
 The x and y values for the median center feature(s) are attributes in the output feature class. The values are stored in the fields
XCOORD and YCOORD.
 The data median will be computed for all fields specified in the Attribute Field parameter.
in the analysis.
information.
Syntax
MedianCenter_stats (Input_Feature_Class, Output_Feature_Class, {Weight_Field}, {Case_Field}, {Attribute_Field})

Input_Feature_Class A feature class for which the median center will be calculated. Feature Layer
Output_Feature_Class A point feature class that will contain the features representing the median Feature Class
centers of the input feature class.
Weight_Field Field
The numeric field used to create a weighted median center.
(Optional)
Case_Field Field
Field used to group features for separate median center calculations. The
Attribute_Field Field
Numeric field(s) for which the data median value will be computed.
(Optional)
Code Sample
MedianCenter Example (Python Window)
The following Python Window script demonstrates how to use the MedianCenter tool.
import arcpy
arcpy.MedianCenter_stats("coffee_shops.shp", "coffee_MEDIANCENTER.shp", "NUM_EMP", "#", "#")
MedianCenter Example (Stand-alone Python script)

The following stand-alone Python script demonstrates how to use the MedianCenter tool.

import arcpy
try:

arcpy.CentralFeature_stats(input_FC, CF_output, "Euclidean Distance", weight_field, "#", "#")


except:
Environments
Related Topics
Using weights
Mean Center
Central Feature
Standard Distance (Spatial Statistics)
Locate topic
Summary
Measures the degree to which features are concentrated or dispersed around the geometric mean center.
Learn more about how Standard Distance works
Illustration
Usage
 The standard distance is a useful statistic as it provides a single summary measure of feature distribution around their center (similar
to the way a standard deviation measures the distribution of data values around the statistical mean).
 The Standard Distance tool creates a new feature class containing a circle polygon centered on the mean for each case. Each circle
polygon is drawn with a radius equal to the standard distance. The attribute value for each circle polygon is its standard distance
value.
 The Case Field is used to group features prior to analysis. When a Case Field is specified, the input features are first grouped
according to case field values, and then a standard distance circle is computed for each group. The case field can be of integer, date,
or string type, and will appear as an attribute in the Output Feature Class. Records with NULL values for the Case Field will be
excluded from analysis.
 The standard distance calculation may be based on an optional Weight Field (to get the standard distance of businesses weighted by
employees, for example). The Weight Field should be numeric.
 If the underlying spatial pattern of the input features is concentrated in the center with fewer features toward the periphery (spatial
normal distribution), a one standard deviation circle polygon will cover approximately 68 percent of the features; a two standard
deviation circle will contain approximately 95 percent of the features; and three standard deviations will cover approximately 99
percent of the features in the cluster.
 Calculations based on either Euclidean or Manhattan distance require projected data to accurately measure distances.
in the analysis.
information.
Syntax
StandardDistance_stats (Input_Feature_Class, Output_Standard_Distance_Feature_Class, Circle_Size, {Weight_Field}, {Case_Field})

A feature class containing a distribution of features for which the
standard distance will be calculated.
Output_Standard_Distance_Feature_Class Feature Class
A polygon feature class that will contain a circle polygon for each
input center. These circle polygons graphically portray the standard
distance at each center point.
Circle_Size String
The size of output circles in standard deviations. The default circle
size is 1; valid choices are 1, 2, or 3 standard deviations.
 1_STANDARD_DEVIATION
Weight_Field Field
The numeric field used to weight locations according to their relative
(Optional) importance.
Case_Field Field
Field used to group features for separate standard distance
(Optional) calculations. The case field can be of integer, date, or string type.
Code Sample
StandardDistance Example (Python Window)
The following Python Window script demonstrates how to use the StandardDistance tool.
import arcpy
arcpy.StandardDistance_stats("AutoTheft.shp", "auto_theft_SD.shp", "1_STANDARD_DEVIATION", "#", "#")
StandardDistance Example (stand-alone Python script)

The following stand-alone Python script demonstrates how to use the StandardDistance tool.

import arcpy
try:



except:
Environments
Related Topics
Using weights
Mean Center
Using weights
Locate topic
Should you use weights when measuring the distribution of features?

If you want to measure characteristics of feature locations, run your analysis without a Weight Field. The unweighted analysis is often used
for incidents or events that occur at a particular place and time. Examples would be analysis of crime events or disease incidents.
When some features are more important than others, you can use the Weight Field to reflect those feature differences. Suppose you want
to find the best location for a new grocery store warehouse. You want a location that is central, but you also want to find a location that is
most convenient to stores with the highest sales volumes. In this case, an attribute reflecting sales volumes (e.g., store revenues or
perhaps a proxy like store size) can be used as a weight in statistical calculations; stores with larger sales volumes will have a stronger
influence on statistical results than stores with smaller sales volumes. In the illustration below, the larger points have larger sales volumes.
Weighted analysis is common for analyses of stationary features, such as stores or pollution monitoring stations. Unlike incidents or events
(such as crime), the distribution of stationary features is usually predetermined, since they have been placed in their location for a reason.
Consequently, performing an unweighted analysis that only looks at fixed feature locations, may not be very meaningful. Measuring the
spatial characteristics of features when they are weighted by an attribute, however, can be very useful. For example, you might use the
location of pollution monitoring stations and the readings of ozone at each for a given period to calculate the center of highest ozone
concentration.
Specifying a weight
Weights are numeric attributes associated with the features in your dataset. The higher the numeric value, the greater the weight for that
feature. For example, if you wanted to find the most accessible location to hold a seminar for workers in the financial sector, you could
calculate the weighted center of businesses using the number of employees as the Weight Field. Or, an environmental analyst could
compute the weighted mean center for different pollutants using air pollution readings from monitoring stations. The information might be
useful for comparing pollutant centers to potential sources such as factories or truck depots.
Related Topics
Mean Center
Central Feature
Standard Distance
Median Center
How Central Feature works
Locate topic
The Central Feature tool identifies the most centrally located feature in a point, line, or polygon input feature class. Distances from each
feature centroid to every other feature centroid in the dataset are calculated and summed. Then the feature associated with the shortest
accumulative distance to all other features (weighted if a weight is specified) is selected and copied to a newly created output feature class.
Output
The Central Feature tool creates a new feature class containing the most centrally located feature. For example, the following illustration
identifies the most centrally located distribution warehouse. All the points below are warehouses, but the red point is most central. The
output feature class for this analysis would contain one extracted record: the warehouse that is most central.
When a value for Case Field is specified, the output feature class contains a feature for each case. Each record in the output feature class is
a copy of the features found to be most central in the input feature class.
If you wanted to build a performing arts center, for example, you could calculate the central feature for a block group feature class,
weighted by population, to identify which part of town is most accessible and make that census block a top candidate. The Central Feature
tool is useful for finding the center when you want to minimize distance (Euclidean or Manhattan distance) for all features to the center.
How Directional Distribution (Standard Deviational Ellipse) works
Locate topic
A common way of measuring the trend for a set of points or areas is to calculate the standard distance separately in the x- and y-directions.
These two measures define the axes of an ellipse encompassing the distribution of features. The ellipse is referred to as the standard
deviational ellipse, since the method calculates the standard deviation of the x-coordinates and y-coordinates from the mean center to define
the axes of the ellipse. The ellipse allows you to see if the distribution of features is elongated and hence has a particular orientation.
While you can get a sense of the orientation by drawing the features on a map, calculating the standard deviational ellipse makes the trend
clear. You can calculate the standard deviational ellipse using either the locations of the features or the locations influenced by an attribute
value associated with the features. The latter is termed a weighted standard deviational ellipse.
Calculations
Output and interpretation

The Directional Distribution (Standard Deviational Ellipse) tool creates a new feature class containing an elliptical polygon centered on the
mean center for all features (or for all cases when a value is specified for Case Field). The attribute values for these output ellipse polygons
include two standard distances (long and short axes); the orientation of the ellipse; and the case field, if specified. The orientation
represents the rotation of the long axis measured clockwise from noon. You can also specify the number of standard deviations to represent
(1, 2, or 3). When the features have a spatially normal distribution (meaning they are densest in the center and become increasingly less
dense toward the periphery), one standard deviation (the default value) will encompass approximately 68 percent of all input feature
centroids. Two standard deviations will encompass approximately 95 percent of all features, and three standard deviations will cover
approximately 99 percent of all feature centroids.
 Mapping the distributional trend for a set of crimes might identify a relationship to particular physical features (a string of bars or
restaurants, a particular boulevard, and so on).
 Mapping groundwater well samples for a particular contaminant might indicate how the toxin is spreading and, consequently, may be
useful in deploying mitigation strategies.
 Comparing the size, shape, and overlap of ellipses for various racial or ethnic groups may provide insights regarding racial or ethnic
segregation.
 Plotting ellipses for a disease outbreak over time may be used to model its spread.
How Linear Directional Mean works
Locate topic
The trend for a set of line features is measured by calculating the average angle of the lines. The statistic used to calculate the trend is known
as the directional mean. While the statistic itself is termed the directional mean, it is used to measure either direction or orientation.
Many linear features point in a direction—they have a beginning point and an end point. Such lines often represent the paths of objects that
move, such as hurricanes. Other linear features, such as fault lines, have no start or end point. These features are said to have an orientation
but no direction. For example, a fault line might have a northwest–southeast orientation. The Linear Directional Mean tool lets you calculate
the mean direction or the mean orientation for a set of lines.
Measuring direction or orientation

In a GIS, every line is assigned a start and end point and has a direction. The direction is set when the line feature is created by digitizing
or by importing a list of coordinates. You can see the direction of each line by displaying it with an arrowhead symbol. If you're calculating
the mean direction, ensure that the directions of the lines are correct. If you're calculating the mean orientation, the direction of the lines is
ignored.
The mean direction is calculated for features that move from a starting point to an end point, such as storms, while mean orientation is
calculated for stationary features, such as fault lines. There may be situations where you'll want to calculate the mean orientation of lines
that represent movement. A wildlife biologist interested in where elk start and end their seasonal migration would calculate the mean
direction of the paths the elk take during each season. However, the biologist would calculate the mean orientation if he or she were
interested in the characteristics of the migration routes themselves to determine what makes a good route, rather than where the elk start
and end. The biologist could calculate the mean orientation using the elk paths in both directions (coming and going) and capture more
information about their movement.
It is important to remember that while most lines have many vertices between the starting point and the ending point, this tool uses only
the start point and the end point to determine direction.
Calculations
Output
The Linear Directional Mean tool creates a new output feature class containing a line feature centered on the mean center for all input
vector centroids, with length equal to the mean length of all input vectors and with either the mean orientation or the mean direction of all
input vectors. Attribute values for the new line features include Compass Angle (clockwise from due North), Directional Mean
(counterclockwise from due east), Circular Variance (an indication of how much directions/orientations deviate from the directional mean),
Mean Center X and Y Coordinates, and Mean Length.
 Comparing two or more sets of lines—For example, a wildlife biologist studying the movement of elk and moose in a stream valley could
calculate the directional trend of migration routes for the two species.
 Comparing features for different time periods—For example, an ornithologist could calculate the trend for falcon migration month by
month. The directional mean summarizes the flight paths of several individuals and smooths out daily movements. That makes it easy
to see during which month the birds travel farthest and when the migration ends.
 Evaluating felled trees in a forest to understand wind patterns and direction.
 Analyzing glacial striations, which are indicative of glacial movement.
 Identifying the general direction of auto thefts and stolen vehicle recoveries.
How Mean Center works
Locate topic
The mean center is the average x- and y-coordinate of all the features in the study area. It's useful for tracking changes in the distribution or
for comparing the distributions of different types of features.
Calculations
Output
The Mean Center tool creates a new point feature class where each feature represents a mean center (one for each case when a Case Field
is specified). The X and Y mean center values, case, and mean dimension field are included as output feature attributes. The Dimension
Field is any numeric field in the dataset; the Dimension value in the Output Feature Class is the mean for the values in that field. In the
illustration below, the mean center is computed for disease cases in order to identify the possible origin for an epidemic.
 A crime analyst might want to see if the mean center for burglaries shifts when evaluating daytime versus nighttime incidents. This can
help police departments better allocate resources.
 A wildlife biologist can calculate the mean center of elk observations within a park over several years to see where elk congregate in
summer and winter to provide better information to park visitors.
 A GIS analyst can assess level of service by comparing the mean center for 911 emergency calls to the location of emergency response
stations. Or, the analyst can evaluate the mean center weighted by individuals over the age of 65 to determine ideal locations for senior
services.
How Median Center works
Locate topic
The Median Center tool is a measure of central tendency that is robust to outliers. It identifies the location that minimizes travel from it to all
other features in the dataset. For example, if you were to compute the Mean Center for a compact cluster of points, the result would be a
location at the center of the cluster. If you then added a new point far away from the cluster and recomputed the Mean Center, you would
notice that the result would be pulled toward the new outlier. If you were to perform this same experiment using the Median Center tool,
however, you would see that the new outlier has a much smaller impact on the result location. The Median Center tool allows you to specify a
Weight Field. You can think of weights as the number of trips associated with each feature (for example, if the weight for a feature is 3.2, the
number of trips would be 3.2). The weighted median center is the location that minimizes distance for all trips.
The method used to calculate the Median Center is an iterative procedure introduced by Kuhn and Kuenne (1962) and further outlined in Burt
and Barber (1996). At each step (t) in the algorithm, a candidate Median Center is found (Xt, Yt) and then refined until it represents the
location that minimizes the Euclidean Distance d to all features (or all weighted features) (i) in the dataset.
Calculations
Note: While the Median Center tool only returns a single point, there may be more than one location
(solution) that would minimize the distance to all features.
Output
The Median Center tool creates a new Output Feature Class with a single median center point feature, or single point feature for each case
when a Case Field is specified. The X and Y median center values, case, and attribute field median values (one field for each Attribute Field
specified) are attributes in the Output Feature Class. The value for each Attribute Field is the computed median for all the field values. The
median for a set of numbers is the middle value: 1/2 of the values in the dataset are smaller, and 1/2 are larger.
You would use the Median Center tool when you want a measure of central tendency that is robust to spatial outliers. You might use it to
compute the Median Center of fire activities when you don't want rare peripheral fire events to pull the result center location away from
core fire activities. Often, it is interesting to compare Mean Center to Median Center results to see the impact peripheral features have on
your result. For many applications, the Median Center is a more representative measure of central tendency than the Mean Center is.
The following references have further information about this tool:
Burt, J. E., and G. Barber. (1996). Elementary statistics for geographers. Guilford, New York.
Kuhn, H. W., and R. E. Kuenne (1962). An efficient algorithm for the numerical solution of the Generalized Weber Problem in spatial
economics. Journal of Regional Science, 4(2):21–33.
How Standard Distance works
Locate topic
Measuring the compactness of a distribution provides a single value representing the dispersion of features around the center. The value is a
distance, so the compactness of a set of features can be represented on a map by drawing a circle with the radius equal to the standard
distance value. The Standard Distance tool creates a circle polygon.
Calculations
Output
The Standard Distance tool creates a new feature class containing a circle polygon centered on the mean center (one center and one circle
per case, if a Case Field is specified). Each circle polygon is drawn with a radius equal to the standard distance value. Attribute values for
each circle polygon are the circle mean center x-coordinate, mean center y-coordinate, and standard distance (circle radius).
 You can use the values for two or more distributions to compare them. A crime analyst, for example, could compare the compactness of
assaults and auto thefts. Knowing how the different types of crimes are distributed may help police develop strategies for addressing
the crime. If the distribution of crimes in a particular area is compact, stationing a single car near the center of the area might suffice. If
the distribution is dispersed, having several police cars patrol the area might be more effective in responding to the crimes.
 You can also compare the same type of feature over different time periods—for example, a crime analyst could compare daytime and
nighttime burglaries to see if burglaries are more dispersed or more compact during the day than at night.
 You can also compare the distributions of features to stationary features. For example, you could measure the distribution of emergency
calls over several months for each responding fire station in a region and compare them to see which stations respond over a wider
area.
An overview of the Modeling Spatial Relationships toolset
Locate topic
Beyond analyzing spatial patterns, GIS analysis can be used to examine or quantify relationships among features. The Modeling Spatial
Relationships tools construct spatial weights matrices or model spatial relationships using regression analyses.
Tools that construct spatial weights matrix files measure how features in a dataset relate to each other in space. A spatial weights matrix is a
representation of the spatial structure of your data: the spatial relationships that exist among the features in your dataset.
True spatial statistics integrate information about space and spatial relationships into their mathematics. Some of the tools in the Spatial
Statistics toolbox that accept a spatial weights matrix file are Spatial Autocorrelation (Global Moran's I), Cluster and Outlier Analysis (Anselin
Local Moran's I), and Hot Spot Analysis (Getis-Ord Gi*).
The regression tools provided in the Spatial Statistics Toolbox model relationships among data variables associated with geographic features,
allowing you to make predictions for unknown values or to better understand key factors influencing a variable you are trying to model.
Regression methods allow you to verify relationships and to measure how strong those relationships are. Exploratory Regression allows you to
examine a large number of Ordinary_Least_Squares (OLS) models quickly, summarizing variable relationships, and determining if any
combination of candidate explanatory variables satisfy all of the requirements of the OLS method.
Tool Description
The Exploratory Regression tool evaluates all possible combinations of the input candidate explanatory
variables, looking for OLS models that best explain the dependent variable within the context of user-
specified criteria.
Generate Network Constructs a spatial weights matrix file (.swm) using a Network dataset, defining feature spatial
Spatial Weights relationships in terms of the underlying network structure.
Generate Spatial Constructs a spatial weights matrix (.swm) file to represent the spatial relationships among features in a
Weights Matrix dataset.
Geographically Performs Geographically Weighted Regression (GWR), a local form of linear regression used to model
Weighted Regression spatially varying relationships.
Ordinary Least Squares Performs global Ordinary Least Squares (OLS) linear regression to generate predictions or to model a
dependent variable in terms of its relationships to a set of explanatory variables.
Modeling spatial relationships tools
Related Topics
Exploratory Regression (Spatial Statistics)
Locate topic
Summary
The Exploratory Regression tool evaluates all possible combinations of the input candidate explanatory variables, looking for OLS models
that best explain the dependent variable within the context of user-specified criteria.
Learn more about how Exploratory Regression works
Illustration
Given a set of candidate explanatory variables, finds properly specified OLS models.
Usage
 The primary output for this tool is a report file which is written to the Results window. Right-clicking on the Messages entry in the
Results window and selecting View will display the Exploratory Regression summary report in a Message dialog box.
 This tool will optionally create a text file report summarizing results. This report file will be added to the table of contents (TOC) and
may be viewed in ArcMap by right-clicking on it and selecting Open.
 This tool also produces an optional table of all models meeting your maximum coefficient p-value cutoff and Variance Inflation Factor
(VIF) value criteria. A full explanation of the report elements and table is provided in Interpreting Exploratory Regression Results.
 This tool uses Ordinary Least Squares (OLS) and Spatial Autocorrelation (Global Moran's I). The optional spatial weights matrix file is
used with the Spatial Autocorrelation (Global Moran's I) tool to assess model residuals; it is not used by the OLS tool at all.
 This tool tries every combination of the Candidate Explanatory Variables entered, looking for a properly specified OLS model. Only
when it finds a model that meets your threshold criteria for Minimum Acceptable Adj R Squared, Maximum Coefficient p-value Cutoff ,
Maximum VIF Value Cutoff and Minimum Acceptable Jarque-Bera p-value will it run the Spatial Autocorrelation (Global Moran's I) tool
on the model residuals to see if the under/over-predictions are clustered or not. In order to provide at least some information about
residual clustering in the case where none of the models pass all of these criteria, the Spatial Autocorrelation (Global Moran's I) test
is also applied to the residuals for the three models that have the highest Adjusted R2 values and the three models that have the
largest Jarque-Bera p-values.
 Especially when there is strong spatial structure in your dependent variable, you will want to try to come up with as many candidate
spatial explanatory variables as you can. Some examples of spatial variables would be distance to major highways, accessibility to job
opportunities, number of local shopping opportunities, connectivity measurements, or densities. Until you find explanatory variables
that capture the spatial structure in your dependent variable, model residuals will likely not pass the spatial autocorrelation test.
Significant clustering in regression residuals, as determined by the Spatial Autocorrelation (Global Moran's I) tool, indicates model
misspecification. Strategies for dealing with misspecification are outlined in What they don't tell you about regression analysis.
 Because the Spatial Autocorrelation (Global Moran's I) is not run for all of the models tested (see the previous usage tip), the optional
Output Results Table will have missing data for the SA (Spatial Autocorrelation) field. Because .dbf files do not store null values,
these appear as very, very small (negative) numbers (something like -1.797693e+308). For geodatabase tables, these missing
values appear as null values. A missing value indicates that the residuals for the associated model were not tested for spatial
autocorrelation because the model did not pass all of the other model search criteria.
 The default spatial weights matrix file used to run the Spatial Autocorrelation (Global Moran's I) tool is based on an 8 nearest
neighbor conceptualization of spatial relationships. This default was selected primarily because it executes fairly quickly. To define
neighbor relationships differently, however, you can simply create your own spatial weights matrix file using the Generate Spatial
Weights Matrix File tool, then specify the name of that file for the Input Spatial Weights Matrix File parameter. Inverse Distance,
Polygon Contiguity, or K Nearest Neighbors, are all appropriate Conceptualizations of Spatial Relationships for testing regression
residuals.
Note: The spatial weights matrix file is only used to test model residuals for spatial structure.
When a model is properly specified, the residuals are spatially random (large residuals are
intermixed with small residuals; large residuals do not cluster together spatially).
Note: When there are 8 or less features in the Input Features, the default spatial weights matrix
file used to run the Spatial Autocorrelation (Global Moran's I) tool is based on K nearest
neighbors where K is the number of features minus 2. In general, you will want to have a
minimum of 30 features when you use this tool.
Syntax
ExploratoryRegression_stats (Input_Features, Dependent_Variable, Candidate_Explanatory_Variables, {Weights_Matrix_File},
{Output_Report_File}, {Output_Results_Table}, {Maximum_Number_of_Explanatory_Variables},
{Minimum_Number_of_Explanatory_Variables}, {Minimum_Acceptable_Adj_R_Squared}, {Maximum_Coefficient_p_value_Cutoff},
{Maximum_VIF_Value_Cutoff}, {Minimum_Acceptable_Jarque_Bera_p_value}, {Minimum_Acceptable_Spatial_Autocorrelation_p_value})
Data
Parameter Explanation
Type
Input_Features Feature
The feature class or feature layer containing the dependent
Layer
and candidate explanatory variables to analyze.
Dependent_Variable Field
The numeric field containing the observed values you want to
model using OLS.
Candidate_Explanatory_Variables Field
A list of fields to try as OLS model explanatory variables.
[Candidate_Explanatory_Variables,...]
A file containing spatial weights that define the spatial
(Optional) relationships among your input features. This file is used to
assess spatial autocorrelation among regression residuals. You
can use the Generate Spatial Weights Matrix File tool to create
this. When you do not provide a spatial weights matrix file,
residuals are assessed for spatial autocorrelation based on
each feature's 8 nearest neighbors.
Note: The spatial weights matrix file is only used to analyze
spatial structure in model residuals; it is not used to build or to
calibrate any of the OLS models.
Output_Report_File The report file contains tool results, including details about any File
(Optional) models found that passed all the search criteria you entered.
This output file also contains diagnostics to help you fix
common regression problems in the case that you don't find
any passing models.
Output_Results_Table Table
The optional output table created containing the explanatory
(Optional) variables and diagnostics for all of the models within the
Coefficient p-value and VIF value cutoffs.
Maximum_Number_of_Explanatory_Variables Long
All models with explanatory variables up to the value entered
(Optional) here will be assessed. If, for example, the
Minimum_Number_of_Explanatory_Variables is 2 and the
Maximum_Number_of Explanatory_Variables is 3, the
Exploratory Regression tool will try all models with every
combination of two explanatory variables, and all models with
every combination of three explanatory variables.
Minimum_Number_of_Explanatory_Variables This value represents the minimum number of explanatory Long
(Optional) variables for models evaluated. If, for example, the
Minimum_Number_of_Explanatory_Variables is 2 and the
Maximum_Number_of_Explanatory_Variables is 3, the
Exploratory Regression tool will try all models with every
combination of two explanatory variables, and all models with
every combination of three explanatory variables.
Minimum_Acceptable_Adj_R_Squared Double
This is the lowest Adjusted R-Squared value you consider a
(Optional) passing model. If a model passes all of your other search
criteria, but has an Adjusted R-Squared value smaller than the
value entered here, it will not show up as a Passing Model in
the Output Report File. Valid values for this parameter range
from 0.0 to 1.0. The default value is 0.5, indicating that
passing models will explain at least 50 percent of the variation
in the dependent variable.
Maximum_Coefficient_p_value_Cutoff For each model evaluated, OLS computes explanatory variable Double
(Optional) coefficient p-values. The cutoff p-value you enter here
represents the confidence level you require for all coefficients
in the model in order to consider the model passing. Small p-
values reflect a stronger confidence level. Valid values for this
parameter range from 1.0 down to 0.0, but will most likely be
0.1, 0.05, 0.01, 0.001, and so on. The default value is 0.05,
indicating passing models will only contain explanatory
variables whose coefficients are statistically at the 95 percent
confidence level (p-values smaller than 0.05). To relax this
default you would enter a larger p-value cutoff, such as 0.1. If
you are getting lots of passing models, you will likely want to
make this search criteria more stringent by decreasing the
default p-value cutoff from 0.05 to 0.01 or smaller.

Maximum_VIF_Value_Cutoff Double
This value reflects how much redundancy (multicollinearity)
(Optional) among model explanatory variables you will tolerate. When the
VIF (Variance Inflation Factor) value is higher than about 7.5,
multicollinearity can make a model unstable; consequently, 7.5
is the default value here. If you want your passing models to
have less redundancy, you would enter a smaller value, such
as 5.0, for this parameter.
Minimum_Acceptable_Jarque_Bera_p_value Double
The p-value returned by the Jarque-Bera diagnostic test
(Optional) indicates whether the model residuals are normally distributed.
If the p-value is statistically significant (small), the model
residuals are not normal and the model is biased. Passing
models should have large Jarque-Bera p-values. The default
minimum acceptable p-value is 0.1. Only models returning p-
values larger than this minimum will be considered passing. If
you are having trouble finding unbiased passing models, and
decide to relax this criterion, you might enter a smaller
minimum p-value such as 0.05.
Minimum_Acceptable_Spatial_Autocorrelation_p_value Double
For models that pass all of the other search criteria, the
(Optional) Exploratory Regression tool will check model residuals for
spatial clustering using Global Moran's I. When the p-value for
this diagnostic test is statistically significant (small), it
indicates the model is very likely missing key explanatory
variables (it isn't telling the whole story). Unfortunately, if you
have spatial autocorrelation in your regression residuals, your
model is misspecified, so you cannot trust your results. Passing
models should have large p-values for this diagnostic test. The
default minimum p-value is 0.1. Only models returning p-
values larger than this minimum will be considered passing. If
you are having trouble finding properly specified models
because of this diagnostic test, and decide to relax this search
criteria, you might enter a smaller minimum such as 0.05.
Code Sample
ExploratoryRegression example 1 (Python window)
The following Python window script demonstrates how to use the ExploratoryRegression tool.
import arcpy, os
arcpy.env.workspace = r"C:\ER"
arcpy.ExploratoryRegression_stats("911CallsER.shp",
"Calls",
"Pop;Jobs;LowEduc;Dst2UrbCen;Renters;Unemployed;Businesses;NotInLF;
ForgnBorn;AlcoholX;PopDensity;MedIncome;CollGrads;PerCollGrd; \
PopFY;JobsFY;LowEducFY",
"BG_911Calls.swm", "BG_911Calls.txt", "",
"MAX_NUMBER_ONLY", "5", "1", "0.5", "0.05", "7.5", "0.1", "0.1")
ExploratoryRegression example 2 (stand-alone Python script)

The following stand-alone Python script demonstrates how to use the ExploratoryRegression tool.
# Exploratory Regression of 911 calls in a metropolitan area

# using the Exploratory Regression Tool

import arcpy, os

try:
arcpy.env.workspace = r"C:\ER"
# Join the 911 Call Point feature class to the Block Group Polygon feature class
# Process: Spatial Join
fieldMappings = arcpy.FieldMappings()
fieldMappings.addTable("BlockGroups.shp")
fieldMappings.addTable("911Calls.shp")
sj = arcpy.SpatialJoin_analysis("BlockGroups.shp", "911Calls.shp", "BG_911Calls.shp",

"JOIN_ONE_TO_ONE",
"KEEP_ALL",
fieldMappings,
"COMPLETELY_CONTAINS", "", "")
# Delete extra fieldsto clean up the data

# Process: Delete Field
arcpy.DeleteField_management("BG_911Calls.shp", "OBJECTID;INC_NO;DATE_;MONTH_;STIME; \
SD_T;DISP_REC;NFPA_TYP;CALL_TYPE;RESP_COD;NFPA_SF; \
SIT_FND;FMZ_Q;FMZ;RD;JURIS;COMPANY;COMP_COD;RESP_YN; \
DISP_DT;DAY_;D1_N2;RESP_DT;ARR_DT;TURNOUT;TRAVEL; \
RESP_INT;ADDRESS_ID;CITY;CO;AV_STATUS;AV_SCORE; \
AV_SIDE;Season;DayNight")

# Process: Generate Spatial Weights Matrix
swm = arcpy.GenerateSpatialWeightsMatrix_stats("BG_911Calls.shp", "TARGET_FID", "BG_911Calls.swm"
"CONTIGUITY_EDGES_CORNERS",
"EUCLIDEAN", "1", "", "", "ROW_STANDARDIZATION", "", ""
# Exploratory Regression Analysis for 911 Calls

# Process: Exploratory Regression
er = arcpy.ExploratoryRegression_stats("BG_911Calls.shp",
"Calls",
"Pop;Jobs;LowEduc;Dst2UrbCen;Renters;Unemployed;Businesses;NotInLF;
ForgnBorn;AlcoholX;PopDensity;MedIncome;CollGrads;PerCollGrd; \
PopFY;JobsFY;LowEducFY",
"BG_911Calls.swm", "BG_911Calls.txt", "",
"MAX_NUMBER_ONLY", "5", "1", "0.5", "0.05", "7.5", "0.1", "0.1
except:
Environments
Current_workspace, Scratch_workspace
Related Topics
How Exploratory Regression works
What they don't tell you about regression analysis
Interpreting Exploratory Regression results
Interpreting OLS results
Geographically Weighted Regression (GWR)
How OLS regression works
Generate Network Spatial Weights (Spatial Statistics)
Locate topic
Summary
Constructs a spatial weights matrix file (.swm) using a Network dataset, defining feature spatial relationships in terms of the underlying
network structure.
Learn more about how Generate Network Spatial Weights works
Illustration
Usage
 Output from this tool is a spatial weights matrix file (.swm). Tools that require you to specify a Conceptualization of Spatial
Relationships option will accept a spatial weights matrix file; select GET_SPATIAL_WEIGHTS_FROM_FILE for the Conceptualization of
Spatial Relationships parameter and, for the Weights Matrix File parameter, specify the full path to the spatial weights file created
using this tool.
 This tool was designed to work with point Input Feature Class data only.
 A spatial weights matrix quantifies the spatial relationships that exist among the features in your dataset. Many tools in the Spatial
Statistics toolbox evaluate each feature within the context of its neighboring features. The spatial weights matrix file defines those
neighbor relationships. For this tool, neighbor relationships are based on the time or distance between features, in the case where
travel is restricted to a network. For more information about spatial weights and spatial weights matrix files, see Spatial_weights.
Tip: ESRI Data & Maps, free to ArcGIS users, contains StreetMap data including a prebuilt
network dataset in SDC format. The coverage for this dataset is the United States and
Canada. These network datasets can be used directly by this tool.
 The Unique ID field is linked to feature relationships derived from running this tool. Consequently, the Unique ID values must be
unique for every feature and typically should be in a permanent field that remains with the feature class. If you don't have a unique
ID field, you can create one by adding a new integer field (Add Field) to your feature class table and calculating the field values to be
equal to the FID or OBJECTID field (Calculate Field). Because the FID and OBJECTID field values may change when you copy or edit a
feature class, you cannot use these fields directly for the Unique ID parameter.
 The Maximum Number of Neighbors parameter for this tool specifies the exact number of neighbors that will be associated with each
feature. The Impedance Cutoff overrides the number of neighbors parameter, so some features may have fewer neighbors if the
number of neighbors specified cannot be found within the cutoff distance/time.
 You can define spatial relationships using the hierarchy in the network dataset, if it has one, by checking the Use Hierarchy in
Analysis parameter. The hierarchy classifies network edges into primary, secondary, and local roads. When using the hierarchy of the
network to create spatial relationships among features, preference will be given to travel on primary roads more than secondary
roads and secondary roads more than local roads.
 This tool does not honor the output coordinate system environment setting. All feature geometry is projected to match the spatial
reference associated with the Network Dataset prior to analysis. The resultant spatial weights matrix file created by this tool will
reflect spatial relationships defined using the Network Dataset spatial reference. It is recommended that when performing analyses
using a network spatial weights matrix file, the input feature class be projected to match the coordinate system of the network
dataset used to create the network SWM.
information.
Syntax
GenerateNetworkSpatialWeights_stats (Input_Feature_Class, Unique_ID_Field, Output_Spatial_Weights_Matrix_File, Input_Network,
Impedance_Attribute, {Impedance_Cutoff}, {Maximum_Number_of_Neighbors}, {Barriers}, {U-turn_Policy}, {Restrictions},
{Use_Hierarchy_in_Analysis}, {Search_Tolerance}, {Conceptualization_of_Spatial_Relationships}, {Exponent}, {Row_Standardization})

Input_Feature_Class Feature Class
The point feature class for which network spatial relationships among
features will be assessed.
Unique_ID_Field Field
An integer field containing a different value for every feature in the
input feature class. If you don't have a Unique ID field, you can
create one by adding an integer field to your feature class table and
Output_Spatial_Weights_Matrix_File File
The output network spatial weights matrix (.swm) file.
Input_Network Network
The network dataset for which spatial relationships among features
in the input feature class will be defined. Dataset Layer
Impedance_Attribute The type of cost units to use as impedance in the analysis. String
Impedance_Cutoff Specifies a cutoff value for INVERSE and FIXED conceptualizations of Double
(Optional)
spatial relationships. Enter this value using the units specified by the
Impedance_Attribute parameter.
A value of zero indicates that no threshold is applied. When this

parameter is left blank, a default threshold value is computed based
on input feature class extent and the number of features.
Maximum_Number_of_Neighbors An integer reflecting the maximum number of neighbors to find for Long
(Optional) each feature.
Barriers The name of a point feature class with features representing blocked Feature Layer
(Optional) intersections, road closures, accident sites, or other locations where
travel is blocked along the network.
U-turn_Policy String
Specifies optional U-turn restrictions.
(Optional)
 ALLOW_UTURNS —U-turns will be possible anywhere. This is the
default.
 NO_UTURNS —No U-turns will be allowed during navigation.
 ALLOW_DEAD_ENDS_ONLY —U-turns will be possible only at the
dead ends (that is, single-valent junctions).
Restrictions String
A list of restrictions. Check the restrictions to be honored in spatial
[Restriction,...] relationship computations.
(Optional)
Use_Hierarchy_in_Analysis Boolean
Specifies whether or not to use a hierarchy in the analysis.
(Optional)
 USE_HIERARCHY —Will use the network dataset's hierarchy
attribute in a heuristic path algorithm to speed analysis.
 NO_HIERARCHY —Will use an exact path algorithm instead. If
there is no hierarchy attribute, this option does not affect
analysis.
Search_Tolerance The search threshold used to locate features in the Linear unit
(Optional) Input_Feature_Class onto the network dataset. This parameter
includes a search value and the units for the tolerance.
Conceptualization_of_Spatial_Relationships Specifies how the weighting associated with each spatial relationship String
(Optional) is specified.
 INVERSE —Features farther away have a smaller weight than
features nearby.
 FIXED —Features within the Impendance_Cutoff are neighbors
(weight of 1); features outside the Impendance_Cutoff are not
weighted (weight of 0).
Exponent Parameter for the INVERSE Double
(Optional) Conceptualization_of_Spatial_Relationships calculation.
Typical values are 1 or 2. Weights drop off quicker with distance as
this exponent value increases.
Row_Standardization Boolean
Row standardization is recommended whenever feature distribution
(Optional) is potentially biased due to sampling design or to an imposed
aggregation scheme.
 ROW_STANDARDIZATION —Spatial weights are standardized by
row. Each weight is divided by its row sum.
 NO_STANDARDIZATION —No standardization of spatial weights
is applied.
Code Sample
GenerateNetworkSpatialWeights example 1 (Python window)
The following Python window script demonstrates how to use the GenerateNetworkSpatialWeights tool.
import arcpy
arcpy.env.workspace = "c:/data"
arpcy.GenerateNetworkSpatialWeights_stats("Hospital.shp", "MyID","network6Neighs.swm",
"Streets_ND","MINUTES", 10, 6, "#",
"ALLOW_UTURNS","#", "USE_HIERARCHY",
"#", "INVERSE", 1,"ROW_STANDARDIZATION")
GenerateNetworkSpatialWeights example 2 (stand-alone Python script)

The following stand-alone Python script demonstrates how to use the GenerateNetworkSpatialWeights tool.
# Create a Spatial Weights Matrix based on Network Data

import arcpy
# Set the geoprocessor object property to overwrite existing output

# Check out the ArcGIS Network Analyst extension (required for the Generate Network Spatial Weights tool)
arcpy.CheckOutExtension("Network")
try:
# Create Spatial Weights Matrix based on Network Data

# Process: Generate Network Spatial Weights...
nwm = arcpy.GenerateNetworkSpatialWeights_stats("Hospital.shp", "MyID",
"network6Neighs.swm", "Streets_ND",
"MINUTES", 10, 6, "#", "ALLOW_UTURNS",
"#", "USE_HIERARCHY", "#", "INVERSE",
1, "ROW_STANDARDIZATION")
# Create Spatial Weights Matrix based on Euclidean Distance

swm = arcpy.GenerateSpatialWeightsMatrix_stats("Hospital.shp", "MYID",
"#", "#", "#", 6)
# Calculate Moran's Index of Spatial Autocorrelation for

# average hospital visit times using Network Spatial Weights
moransINet = arcpy.SpatialAutocorrelation_stats("Hospital.shp", "VisitTime",
"network6Neighs.swm")

# average hospital visit times using Euclidean Spatial Weights
moransIEuc = arcpy.SpatialAutocorrelation_stats("Hospital.shp", "VisitTime",
except:
Environments
Related Topics
Grouping Analysis
Spatial weights
Generate Spatial Weights Matrix
What is a network dataset?
Generate Spatial Weights Matrix (Spatial Statistics)
Locate topic
Summary
Constructs a spatial weights matrix (.swm) file to represent the spatial relationships among features in a dataset.
Learn more about how Generate Spatial Weights Matrix works
Illustration
Spatial relationships based on polygon contiguity,

Queen's case: shared edges or nodes.
Usage
 Output from this tool is a spatial weights matrix file (.swm). Tools, such as Hot_Spot_Analysis, that require you to specify a
Conceptualization of Spatial Relationships will accept a spatial weights matrix file; select GET_SPATIAL_WEIGHTS_FROM_FILE for the
Conceptualization of Spatial Relationships parameter, and for the Weights Matrix File parameter, specify the full path to the spatial
weights file you create using this tool.
 This tool also reports characteristics of the resultant spatial weights matrix file: number of features, connectivity, minimum,
maximum and average number of neighbors. This summary is accessible from the Results window and may be viewed by right-
clicking on the Messages entry in the Results window and selecting View. Using this summary, ensure that all features have at least
1 neighbor. In general, especially with large datasets, a minimum of 8 neighbors and a low value for feature connectivity is desirable.
 For space/time analyses, select SPACE_TIME_WINDOW for the Conceptualization of Spatial Relationships parameter. You define
space by specifying a Threshold Distance value; you define time by specifying a Date/Time Field and both a Date/Time Type (such as
HOURS or DAYS) and a Date/Time Interval Value. The Date/Time Interval Value is an Integer. For example, if you enter 1000 feet,
select HOURS, and provide a Date/Time Interval Value of 3, features within 1,000 feet and occuring within 3 hours of each other
would be considered neighbors.
 The spatial weights matrix file (.swm) was designed to allow you to generate, store, reuse, and share your conceptualization of the
relationships among a set of features. To improve performance the file is created in a binary file format. Feature relationships are
stored as a sparse matrix, so only nonzero relationships are written to the SWM file. In general, tools will perform well even when the
SWM file contains more than 15 million nonzero relationships. If a memory error is encountered when using the SWM file, however,
you should revisit how you are defining your feature relationships. As a rule of thumb, you should aim for a spatial weights matrix
where every feature has at least 1 neighbor, most have about 8 neighbors, and no feature has more than about 1,000 neighbors.
 When chordal distances are used in the analysis, the Threshold Distance parameter, if specified, should be given in meters.
 The Unique ID field is linked to feature relationships derived from running this tool. Consequently, the Unique ID values must be
unique for every feature and typically should be in a permanent field that remains with the feature class. If you don't have a unique
ID field, you can create one by adding a new integer field (Add Field) to your feature class table and calculating the field values to be
equal to the FID or OBJECTID field (Calculate Field). Because the FID and OBJECTID field values may change when you copy or edit a
feature class, you cannot use these fields directly for the Unique ID parameter.
 The Number of Neighbors parameter may override the Threshold Distance parameter for Inverse or Fixed Distance
Conceptualizations of Spatial Relationships. If you specify a threshold distance of 10 miles and 3 for the number of neighbors, all
features will receive a minimum of 3 neighbors even if the threshold has to be increased to find them. The threshold distance is only
increased in those cases where the minimum number of neighbors is not met.
 The CONVERT_TABLE option for the Conceptualization of Spatial Relationships parameter may be used to convert an ASCII spatial
weights matrix file to a SWM formatted spatial weights matrix file. First, you will need to put your ASCII weights into a formatted
table (using Excel, for example).
Caution: If your table includes weights for self-potential, they will be omitted from the SWM output
file, and the default self-potential value will be used in analyses. The default self-potential
value for the Hot_Spot_Analysis tool is one, but this value can be overwritten by specifying
a Self-Potential Field value; for all other tools, the default self-potential value is zero.
 For polygon features, you will almost always want to choose ROW for the Row Standardization parameter. Row Standardization
mitigates bias when the number of neighbors each feature has is a function of the aggregation scheme or sampling process, rather
than reflecting the actual spatial distribution of the variable you are analyzing.
 The tools that can use a spatial weights matrix file project feature geometry to the output coordinate system prior to analysis and all
mathematical computations are based on the output coordinate system. Consequently, if the output coordinate system setting does
not match the input feature class spatial reference, either make sure, for all analyses using the spatial weights matrix file, that the
output coordinate system matches the settings used when the spatial weights matrix file was created, or project the input feature
class so that it does match the spatial reference associated with the spatial weights matrix file.
information.
Syntax
GenerateSpatialWeightsMatrix_stats (Input_Feature_Class, Unique_ID_Field, Output_Spatial_Weights_Matrix_File,
Conceptualization_of_Spatial_Relationships, {Distance_Method}, {Exponent}, {Threshold_Distance}, {Number_of_Neighbors},
{Row_Standardization}, {Input_Table}, {Date_Time_Field}, {Date_Time_Interval_Type}, {Date_Time_Interval_Value})
Input_Feature_Class The feature class for which spatial relationships of features will be Feature Class
assessed.
Unique_ID_Field An integer field containing a different value for every feature in the Field
input feature class. If you don't have a Unique ID field, you can
create one by adding an integer field to your feature class table and
Output_Spatial_Weights_Matrix_File File
The full path for the spatial weights matrix file (.swm) you want to
create.
Conceptualization_of_Spatial_Relationships String
Specifies how spatial relationships among features are
conceptualized.
 INVERSE_DISTANCE —The impact of one feature on another
feature decreases with distance.
 FIXED_DISTANCE —Everything within a specified critical distance
of each feature is included in the analysis; everything outside
the critical distance is excluded.
 K_NEAREST_NEIGHBORS —The closest k features are included
in the analysis; k is a specified numeric parameter.
 CONTIGUITY_EDGES_ONLY —Polygon features that share a
boundary are neighbors.
boundary and/or share a node are neighbors.
 DELAUNAY_TRIANGULATION —A mesh of nonoverlapping
triangles is created from feature centroids; features associated
with triangle nodes that share edges are neighbors.
 SPACE_TIME_WINDOW —Features within a specified critical
distance and specified time interval of each other are neighbors.
 CONVERT_TABLE —Spatial relationships are defined in a table.
Note: Polygon Contiguity methods are only available with an ArcGIS
for Desktop Advanced license.
Distance_Method Specifies how distances are calculated from each feature to String
(Optional) neighboring features.
 EUCLIDEAN —The straight-line distance between two points (as
the crow flies)
 MANHATTAN —The distance between two points measured along
axes at right angles (city block); calculated by summing the
Exponent Double
Parameter for inverse distance calculation. Typical values are 1 or 2.
(Optional)
Threshold_Distance Specifies a cutoff distance for Inverse Distance and Fixed Distance Double
(Optional) conceptualizations of spatial relationships. Enter this value using the
units specified in the environment output coordinate system. Defines
the size of the Space window for the Space Time Window
conceptualization of spatial relationships.
A value of zero indicates that no threshold distance is applied. When
this parameter is left blank, a default threshold value is computed
based on output feature class extent and the number of features.
Number_of_Neighbors Contiguity Long
(Optional) An integer reflecting either the minimum or the exact number of
neighbors. For K Nearest Neighbors, each feature will have exactly
this specified number of neighbors. For Inverse Distance or Fixed
Distance each feature will have at least this many neighbors (the
threshold distance will be temporarily extended to ensure this many
neighbors, if necessary). When one of the contiguity
Conceptualizations of Spatial Relationships is selected, then each
polygon will be assigned this minimum number of neighbors. For
polygons with fewer than this number of contiguous neighbors,
additional neighbors will be based on feature centroid proximity.
Row_Standardization Boolean
Row standardization is recommended whenever feature distribution
(Optional) is potentially biased due to sampling design or to an imposed
aggregation scheme.
 ROW_STANDARDIZATION —Spatial weights are standardized by
row. Each weight is divided by its row sum.
 NO_STANDARDIZATION —No standardization of spatial weights
is applied.
Input_Table Table
A table containing numeric weights relating every feature to every
(Optional) other feature in the input feature class. Required fields are the Input
Feature Class Unique ID field, NID (neighbor ID), and WEIGHT.
Date_Time_Field A date field with a timestamp for each feature. Field
(Optional)
Date_Time_Interval_Type String
The units to use for measuring time.
(Optional)
 SECONDS —Seconds
 MINUTES —Minutes
 HOURS —Hours
 DAYS —Days
 WEEKS —Weeks
 MONTHS —30 Days
 YEARS —Years
Date_Time_Interval_Value An Integer reflecting the number of time units comprising the time Long
(Optional) window.
For example, if you select HOURS for the Date/Time Interval Type
and 3 for the Date/Time Interval Value, the time window would be 3
hours; features within the specified space window and within the
specified time window would be neighbors.
Code Sample
GenerateSpatialWeightsMatrix example 1 (Python window)
The following Python window script demonstrates how to use the GenerateSpatialWeightsMatrix tool.
import arcpy
arcpy.env.workspace = "C:/data"
arcpy.GenerateSpatialWeightsMatrix_stats("911Count.shp", "MYID","euclidean6Neighs.swm","K_NEAREST_NEIGHBORS"
GenerateSpatialWeightsMatrix example 2 (stand-alone Python script)

The following stand-alone Python script demonstrates how to use the GenerateSpatialWeightsMatrix tool.


import arcpy

try:
"#", 0, 0, 0)


"911Count.shp")

"#", "#", "#", 6,

"#", "#", "euclidean6Neighs.swm")
except:
Environments
Feature geometry is projected to the Output Coordinate System prior to analysis, so values entered for the Threshold Distance
parameter should match those specified in the Output Coordinate System. All mathematical computations are based on the spatial
reference of the Output Coordinate System. When the Output Coordinate System is based on degrees, minutes, and seconds,
geodesic distances are estimated using chordal distances in meters.
Related Topics
Grouping Analysis
Spatial weights
Generate Network Spatial Weights
Geographically Weighted Regression (GWR) (Spatial Statistics)
Locate topic
Summary
Performs Geographically Weighted Regression (GWR), a local form of linear regression used to model spatially varying relationships.
Learn more about how Geographically Weighted Regression works
Illustration
GWR is a local regression model. Coefficients are allowed to vary.
Usage
 GWR constructs a separate equation for every feature in the dataset incorporating the dependent and explanatory variables of
features falling within the bandwidth of each target feature. The shape and extent of the bandwidth is dependent on user input for the
Kernel type , Bandwidth method, Distance, and Number of neighbors parameters with one restriction: when the number of
neighboring features would exceed 1000, only the closest 1000 are incorporated into each local equation.
 GWR should be applied to datasets with several hundred features for best results. It is not an appropriate method for small datasets.
The tool does not work with multipoint data.
Note: The GWR tool produces a variety of different outputs. Right-clicking on the Messages entry
in the Results window and selecting View will display a GWR tool execution summary
report .

The GWR tool also produces an Output feature class and a table with the tool execution summary report diagnostic values. The name
of this table is automatically generated using the output feature class name with _supp suffix. The Output feature class is automatically
added to the table of contents with a hot/cold rendering scheme applied to model residuals. A full explanation of each output is
provided in Interpreting GWR results.
 The _supp file is always created in the same location as the Output feature class unless the output feature class is created inside a
Feature Dataset. When the output feature class is inside a feature dataset, the _supp table is created in the geodatabase containing
the feature dataset.
 Using projected data is always recommended; it is especially important whenever distance is a component of the analysis, as it is for
GWR when you select Fixed for Kernel type. It is strongly recommended that your data is projected using a Projected Coordinate
System (rather than a Geographic Coordinate System).
 Some of the computations done by the GWR tool take advantage of multiple CPUs in order to increase performance and will
automatically use up to 8 threads/CPUs for processing.
 You should always begin regression analysis with Ordinary Least Squares (OLS) regression. First find a properly specified OLS model,
then use the same explanatory variables to run GWR (excluding any "dummy" explanatory variables representing different spatial
regimes).
 Dependent and Explanatory variables should be numeric fields containing a variety of values. Linear regression methods, like GWR,
are not appropriate for predicting binary outcomes (e.g., all of the values for the dependent variable are either 1 or 0).
 In global regression models, such as Ordinary Least Squares Regression (OLS), results are unreliable when two or more variables
exhibit multicollinearity (when two or more variables are redundant or together tell the same "story"). GWR builds a local regression
equation for each feature in the dataset. When the values for a particular explanatory variable cluster spatially, you will very likely
have problems with local multicollinearity. The Condition Number field (COND) in the output feature class indicates when results are
unstable due to local multicollinearity. As a rule of thumb, do not trust results for features with a condition number larger than 30,
equal to Null or, for shapefiles, equal to -1.7976931348623158e+308.
 Caution should be used when including nominal or categorical data in a GWR model. Where categories cluster spatially, there is
strong risk of encountering local multicollinearity issues. The condition number included in the GWR output indicates when local
collinearity is a problem (a condition number less than zero, greater than 30, or set to Null). Results in the presence of local
multicollinearity are unstable.
 Do not use "dummy" explanatory variables to represent different spatial regimes in a GWR model (e.g., census tracts outside the
urban core are assigned a value of 1, while all others are assigned a value of 0). Because GWR allows explanatory variable
coefficients to vary, these spatial regime explanatory variables are unnecessary, and if included, will create problems with local
multicollinearity.
 To better understand regional variation among the coefficients of your explanatory variables, examine the optional raster coefficient
surfaces created by GWR. These raster surfaces are created in the Coefficient raster workspace, if you specify one. For polygon data,
you can use graduated color or cold-to-hot rendering on each coefficient field in the Output feature class to examine changes across
your study area.
 You may use GWR for prediction by supplying a Predictions locations feature class (often this feature class is the same as the Input
feature class ), the Prediction explanatory variables, and an Output prediction feature class . There must be a one to one
correspondence between the fields used to calibrate the regression model (the values entered for the Explanatory variables field) and
the fields used for prediction (the values entered for the Prediction explanatory variables field). The order of these variables must be
the same. Suppose, for example, you are modeling traffic accidents as a function of speed limits, road conditions, number of lanes,
and number of cars. You can predict the impact that changing speed limits or improving roads might have on traffic accidents by
creating a new variables with the amended speed limits and road conditions. The existing variables would be used to calibrate the
regression model and would be used for the Explanatory variables parameter. The amended variables would be used for predictions
and would be entered as your Prediction explanatory variables.
 If a Prediction locations feature class is provided, but no Prediction explanatory variables are specified, the Output prediction feature
class is created with computed coefficients for each location only (no predictions).
 A regression model is misspecified if it is missing a key explanatory variable. Statistically significant spatial autocorrelation of the
regression residuals and/or unexpected spatial variation among the coefficients of one or more explanatory variables suggests that
your model is misspecified. You should make every effort (through OLS residual analysis and GWR coefficient variation analysis, for
example) to discover what these key missing variables are so they may be included in the model.
 Always question whether or not it makes sense for an explanatory variable to be nonstationary. For example, suppose you are
modeling the density of a particular plant species as a function of several variables including ASPECT. If you find that the coefficient
for the ASPECT variable changes across the study area, you are likely seeing evidence of a key missing explanatory variable (perhaps
prevalence of competing vegetation, for example). You should make every effort to include all key explanatory variables in your
regression model.
Caution: Whenever using shapefiles, keep in mind that they cannot store null values. Tools or other
procedures that create shapefiles from nonshapefile inputs may, consequently, store null
values as zero or as some very small negative number (-DBL_MAX = -
1.7976931348623158e+308). This can lead to unexpected results. See also: Geoprocessing
considerations for shapefile output.
 When the result of a computation is infinity or undefined, the result for nonshapefiles will be Null; for shapefiles the result will be -
DBL_MAX = -1.7976931348623158e+308.
 When you select either Akaike Information Criterion or Cross Validation (AICc or CV in Python) for the Bandwidth Method parameter,
GWR will find the optimal distance (for a fixed kernel) or optimal number of neighbors (for an adaptive kernel). Problems with local
multicollinearity, however, will prevent both the Akaike Information Criterion and Cross Validation bandwidth methods from resolving
an optimal distance/number of neighbors. If you get an error indicating severe model design problems, try specifying a particular
distance or neighbor count, then examining the condition numbers in the output feature class to see which features are associated
with local collinearity problems
 Problems with local collinearity will prevent both the Akaike Information Criterion and Cross Validation bandwidth methods from
resolving an optimal distance and number of neighbors. If you get an error indicating severe model design problems, try specifying a
particular distance or neighbor count, then examining the condition numbers in the Output feature class to see which features are
associated with local multicollinearity problems.
 Severe model design errors, or errors indicating local equations do not include enough neighbors, often indicate a problem with
global or local multicollinearity. To determine where the problem is, run your model using OLS and examine the VIF value for each
explanatory variable. If some of the VIF values are large (above 7.5, for example), global multicollinearity is preventing GWR from
solving. More likely, however, local multicollinearity is the problem. Try creating a thematic map for each explanatory variable. If the
map reveals spatial clustering of identical values, consider removing those variables from the model or combining those variables with
other explanatory variables in order to increase value variation. If, for example, you are modeling home values and have variables for
both bedrooms and bathrooms, you may want to combine these to increase value variation, or to represent them as
bathroom/bedroom square footage. Avoid using spatial regime dummy variables, spatially clustering categorical/nominal variables, or
variables with very few possible values when constructing GWR models.
 GWR is a linear model subject to the same requirements as OLS. Review the How Regression Models Go Bad section in the Regression
Analysis Basics document as a check that your GWR model is properly specified.
Syntax
GeographicallyWeightedRegression_stats (in_features, dependent_field, explanatory_field, out_featureclass, kernel_type,
bandwidth_method, {distance}, {number_of_neighbors}, {weight_field}, {coefficient_raster_workspace}, {cell_size},
{in_prediction_locations}, {prediction_explanatory_field}, {out_prediction_featureclass})
in_features Feature Layer
The feature class containing the dependent and independent variables.
dependent_field Field
The numeric field containing values for what you are trying to model.
explanatory_field Field
A list of fields representing independent explanatory variables in your
[explanatory_field,...] regression model.
out_featureclass Feature Class
The output feature class to receive dependent variable estimates and
residuals.
kernel_type String
Specifies if the kernel is constructed as a fixed distance, or if it is allowed
to vary in extent as a function of feature density.
 FIXED —The spatial context (the Gaussian kernel) used to solve each
local regression analysis is a fixed distance.
 ADAPTIVE —The spatial context (the Gaussian kernel) is a function of a
specified number of neighbors. Where feature distribution is dense, the
spatial context is smaller; where feature distribution is sparse, the
spatial context is larger.
bandwidth_method String
Specifies how the extent of the kernel should be determined. When Akaike
Information Criterion or Cross Validation are selected (AICc or CV in
Python) the tool will find the optimal distance or number of neighbors for
you. Typically you will select either Akaike Information Criterion or Cross
Validation if you don't know what to use for the Distance or Number of
neighbors parameter.
 AICc —The extent of the kernel is determined using the Akaike

Information Criterion (AICc).
 CV —The extent of the kernel is determined using Cross Validation.
 BANDWIDTH_PARAMETER —The extent of the kernel is determined by
a fixed distance or a fixed number of neighbors. You must specify a
value for either the Distance or Number of Neighbors parameters.
distance The distance whenever the kernel_type is FIXED and bandwidth_method Double
(Optional) is BANDWIDTH_PARAMETER.
number_of_neighbors Long
The exact number of neighbors to include in the local bandwidth of the
(Optional) Gaussian kernel when kernel_type is ADAPTIVE and the
bandwidth_method is BANDWIDTH_PARAMETER.
weight_field Field
The numeric field containing a spatial weighting for individual features. This
(Optional) weight field allows some features to be more important in the model
calibration process than others. Primarily useful when the number of
samples taken at different locations varies, values for the dependent and
independent variables are averaged, and places with more samples are
more reliable (should be weighted higher). If you have an average of 25
different samples for one location, but an average of only 2 samples for
another location, you can use the number of samples as your weight field
so that locations with more samples have a larger influence on model
calibration than locations with few samples.
coefficient_raster_workspace Folder
A full pathname to the workspace where all of the coefficient rasters will be
(Optional) created. When this workspace is provided, rasters are created for the
intercept and every explanatory variable.
cell_size Analysis Cell Size
The cell size (a number) or reference to the cell size (a pathname to a
(Optional) raster dataset) to use when creating the coefficient rasters.
The default cell size is the shortest of the width or height of the extent
specified in the geoprocessing environment output coordinate system,
divided by 250.
in_prediction_locations Feature Layer
A feature class containing features representing locations where estimates
(Optional) should be computed. Each feature in this dataset should contain values for
all of the explanatory variables specified; the dependent variable for these
features will be estimated using the model calibrated for the input feature
class data.
prediction_explanatory_field Field
A list of fields representing explanatory variables in the Prediction locations
[prediction_explanatory_field,...] feature class. These field names should be provided in the same order (a
(Optional) one-to-one correspondence) as those listed for the input feature class
Explanatory variables parameter. If no prediction explanatory variables are
given, the output prediction feature class will only contain computed
coefficient values for each prediction location.
out_prediction_featureclass Feature Class
The output feature class to receive dependent variable estimates for each
(Optional) feature in the Prediction locations feature class.
Code Sample
GeographicallyWeightedRegression Example (Python Window)
The following Python Window script demonstrates how to use the GeographicallyWeightedRegression tool.
import arcpy
arcpy.GeographicallyWeightedRegression_stats("CallData.shp", "Calls","BUS_COUNT;RENTROCC00;NoHSDip",
"CallsGWR.shp", "ADAPTIVE", "BANDWIDTH PARAMETER",
"#", "25", "#","CoefRasters", "135", "PredictionPoints"
"#", "GWRCallPredictions.shp")
GeographicallyWeightedRegression Example (stand-alone Python script)

The following stand-alone Python script demonstrates how to use the GeographicallyWeightedRegression tool.
# Model 911 emergency calls using GWR

import arcpy

try:
# 911 Calls as a function of {number of businesses, number of rental units,

# number of adults who didn't finish high school}
# Process: Geographically Weighted Regression...
gwr = arcpy.GeographicallyWeightedRegression_stats("CallData.shp", "Calls",
"BUS_COUNT;RENTROCC00;NoHSDip",
"CallsGWR.shp", "ADAPTIVE", "BANDWIDTH PARAMETER","#", "25", "#",
"CoefRasters", "135", "PredictionPoints", "#", "GWRCallPredictions.shp")
# Create Spatial Weights Matrix to use with Global Moran's I tool

swm = arcpy.GenerateSpatialWeightsMatrix_stats("CallsGWR.shp", "UniqID",
"CallData25Neighs.swm",
"#", "#", "#", 25)

moransI = arcpy.SpatialAutocorrelation_stats("CallsGWR.shp", "StdResid",
"CallData25Neighs.swm")
except:
Environments
Output_coordinate_system, Geograpic_transformations, Current_workspace, Scratch_workspace, Cell_size, Snap_raster
Feature geometry is projected to the Output Coordinate System after analysis is complete. Consequently, the value entered for the
Distance parameter should be specified in the same units as the Input feature class. Values entered for the Output cell size should
be specified in the same units as the Output Coordinate System.
Related Topics
Ordinary Least Squares (OLS)
Interpreting GWR results
How GWR works
Ordinary Least Squares (OLS) (Spatial Statistics)
Locate topic
Summary
Performs global Ordinary Least Squares (OLS) linear regression to generate predictions or to model a dependent variable in terms of its
relationships to a set of explanatory variables.
Learn more about how Ordinary Least Squares regression works
Illustration
Ordinary Least Squares Regression: predicted

values in relation to observed values
Usage
Note: The primary output for this tool is the OLS summary report which is written to the Results
window or optionally written, with additional graphics, to the Output Report File you specify.
Double-clicking the PDF report file in the Results window will open it. Right-clicking on the
Messages entry in the Results window and selecting View will also display the OLS summary
report in a Message dialog box.
The OLS tool also produces an output feature class and optional tables with coefficient information and diagnostics. All of these are
accessible from the Results window. The output feature class is automatically added to the table of contents, with a hot/cold rendering
scheme applied to model residuals. A full explanation of each output is provided in Interpreting_OLS_results.
Note: If this tool is part of a custom model tool, the optional tables will only appear in the Results
window if they are set as model parameters prior to running the tool.
 Results from OLS regression are only trustworthy if your data and regression model satisfy all of the assumptions inherently required
by this method. Consult the table Common Regression Problems, Consequences, and Solutions in Regression Analysis Basics to
ensure your model is properly specified.
 Dependent and Explanatory variables should be numeric fields containing a variety of values. OLS cannot solve when variables have
all the same value (all the values for a field are 9.0, for example). Linear regression methods, like OLS, are not appropriate for
predicting binary outcomes (for example, all of the values for the dependent variable are either 1 or 0).
 The Unique ID field links model predictions to each feature. Consequently, the Unique ID values must be unique for every feature,
and typically should be a permanent field that remains with the feature class. If you don't have a Unique ID field, you can easily
create one by adding a new integer field to your feature class table and calculating the field values to be equal to the FID/OID field.
You cannot use the FID/OID field directly for the Unique ID parameter.
 Whenever there is statistically significant spatial autocorrelation of the regression residuals the OLS model will be considered
misspecified and, consequently, results from OLS regression are unreliable. Be sure to run the Spatial Autocorrelation tool on your
regression residuals to assess this potential problem. Statistically significant spatial autocorrelation of regression residuals almost
always indicates one or more key explanatory variables are missing from the model.
 You should visually inspect the over- and underpredictions evident in your regression residuals to see if they provide clues about
potential missing variables from your regression model. It sometimes helps to run Hot Spot Analysis on the residuals to help you
visualize spatial clustering of the over- and underpredictions.
 When misspecification is the result of trying to model nonstationary variables using a global model (OLS is a global model), then
Geographically Weighted Regression may be used to improve predictions and to better understand the nonstationarity (regional
variation) inherent in your explanatory variables.
 When the result of a computation is infinity or undefined, the output for nonshapefiles will be Null; for shapefiles the output will be -
DBL_MAX (-1.7976931348623158e+308, for example).
 Model summary diagnostics are written to the OLS summary report and the optional diagnostic output table. Both include diagnostics
for the corrected Akaike Information Criterion (AICc), Coefficient of Determination, Joint F statistic, Wald statistic, Koenker's Breusch-
Pagan statistic, and the Jarque-Bera statistic. The diagnostic table also includes uncorrected AIC and Sigma-squared values.
 The optional coefficient and/or diagnostic output tables, if they already exist, will be overwritten when the Geoprocessing Option to
overwrite the outputs of geoprocessing operations is checked ON.
in the analysis.
information.
Syntax
OrdinaryLeastSquares_stats (Input_Feature_Class, Unique_ID_Field, Output_Feature_Class, Dependent_Variable, Explanatory_Variables,
{Coefficient_Output_Table}, {Diagnostic_Output_Table}, {Output_Report_File})

The feature class containing the dependent and independent variables for
analysis.
Unique_ID_Field An integer field containing a different value for every feature in the Input Field
Feature Class.
Output_Feature_Class The output feature class to receive dependent variable estimates and Feature Class
residuals.
Dependent_Variable Field
The numeric field containing values for what you are trying to model.
Explanatory_Variables Field
A list of fields representing explanatory variables in your regression model.
[Explanatory_Variables,...]
Coefficient_Output_Table The full path to an optional table that will receive model coefficients, Table
(Optional) standardized coefficients, standard errors, and probabilities for each
explanatory variable.
Diagnostic_Output_Table Table
The full path to an optional table that will receive model summary
(Optional) diagnostics.
Output_Report_File File
The path to the optional PDF file you want the tool to create. This report file
(Optional) includes model diagnostics, graphs, and notes to help you interpret the OLS
results.
Code Sample
OrdinaryLeastSquares example 1 (Python window)
The following Python window script demonstrates how to use the OrdinaryLeastSquares tool.
import arcpy
arcpy.OrdinaryLeastSquares_stats("USCounties.shp", "MYID","olsResults.shp",
"GROWTH","LOGPCR69;SOUTH;LPCR_SOUTH;PopDen69",
"olsCoefTab.dbf","olsDiagTab.dbf")
OrdinaryLeastSquares example 2 (stand-alone Python script)

The following stand-alone Python script demonstrates how to use the OrdinaryLeastSquares tool.
# Analyze the growth of regional per capita incomes in US

# Counties from 1969 -- 2002 using Ordinary Least Squares Regression

import arcpy

try:
arcpy.workspace = workspace
# Growth as a function of {log of starting income, dummy for South

# counties, interaction term for South counties, population density}
# Process: Ordinary Least Squares...
ols = arcpy.OrdinaryLeastSquares_stats("USCounties.shp", "MYID",
"olsResults.shp", "GROWTH",
"LOGPCR69;SOUTH;LPCR_SOUTH;PopDen69",
"olsCoefTab.dbf",
"olsDiagTab.dbf")
# Create Spatial Weights Matrix (Can be based off input or output FC)
"#", "#", "#", 6)

moransI = arcpy.SpatialAutocorrelation_stats("olsResults.shp", "Residual",
except:
Environments
Related Topics
Geographically Weighted Regression (GWR)
Locate topic
The Spatial Statistics toolbox provides effective tools for quantifying spatial patterns. Using the Hot Spot Analysis tool, for example, you can
ask questions like these:
 Are there places in the United States where people are persistently dying young?
 Where are the hot spots for crime, 911 emergency calls (see graphic below), or fires?
 Where do we find a higher than expected proportion of traffic accidents in a city?
Analysis of 911 emergency call data showing call hot spots (red), call cold spots (blue), and locations of the fire/police units
responsible for responding (green crosses)
Each of the questions above asks "where?" The next logical question for the types of analyses above involves "why?"
 Why are there places in the United States where people persistently die young? What might be causing this?
 Can we model the characteristics of places that experience a lot of crime, 911 calls, or fire events to help reduce these incidents?
 What are the factors contributing to higher than expected traffic accidents? Are there policy implications or mitigating actions that might
reduce traffic accidents across the city and/or in particular high accident areas?
Tools in the Modeling Spatial Relationships toolset help you answer this second set of why questions. These tools include Ordinary Least
Squares (OLS) regression and Geographically Weighted Regression.
Spatial relationships
Regression analysis allows you to model, examine, and explore spatial relationships and can help explain the factors behind observed
spatial patterns. You may want to understand why people are persistently dying young in certain regions of the country or what factors
contribute to higher than expected rates of diabetes. By modeling spatial relationships, however, regression analysis can also be used for
prediction. Modeling the factors that contribute to college graduation rates, for example, enables you to make predictions about upcoming
workforce skills and resources. You might also use regression to predict rainfall or air quality in cases where interpolation is insufficient due
to a scarcity of monitoring stations (for example, rain gauges are often lacking along mountain ridges and in valleys).
OLS is the best known of all regression techniques. It is also the proper starting point for all spatial regression analyses. It provides a
global model of the variable or process you are trying to understand or predict (early death/rainfall); it creates a single regression equation
to represent that process. Geographically weighted regression (GWR) is one of several spatial regression techniques, increasingly used in
geography and other disciplines. GWR provides a local model of the variable or process you are trying to understand/predict by fitting a
regression equation to every feature in the dataset. When used properly, these methods provide powerful and reliable statistics for
examining and estimating linear relationships.
Linear relationships are either positive or negative. If you find that the number of search and rescue events increases when daytime
temperatures rise, the relationship is said to be positive; there is a positive correlation. Another way to express this positive relationship is
to say that search and rescue events decrease as daytime temperatures decrease. Conversely, if you find that the number of crimes goes
down as the number of police officers patrolling an area goes up, the relationship is said to be negative. You can also express this negative
relationship by stating that the number of crimes increases as the number of patrolling officers decreases. The graphic below depicts both
positive and negative relationships, as well as the case where there is no relationship between two variables:
Scatterplots: a positive relationship, a negative relationship, and a case where two variables are unrelated
Correlation analyses, and their associated graphics depicted above test the strength of the relationship between two variables. Regression
analyses, on the other hand, make a stronger claim: they attempt to demonstrate the degree to which one or more variables potentially
promote positive or negative change in another variable.
Regression analysis applications

Regression analysis can be used for a large variety of applications:
 Modeling high school retention rates to better understand the factors that help keep kids in school.
 Modeling traffic accidents as a function of speed, road conditions, weather, and so forth, to inform policy aimed at decreasing accidents.
 Modeling property loss from fire as a function of variables such as degree of fire department involvement, response time, or property
values. If you find that response time is the key factor, you might need to build more fire stations. If you find that involvement is the
key factor, you may need to increase equipment and the number of officers dispatched.
There are three primary reasons you might want to use regression analysis:
 To model some phenomenon to better understand it and possibly use that understanding to effect policy or make decisions about
appropriate actions to take. The basic objective is to measure the extent that changes in one or more variables jointly affect changes in
another. Example: Understand the key characteristics of the habitat for some particular endangered species of bird (perhaps
precipitation, food sources, vegetation, predators) to assist in designing legislation aimed at protecting that species.
 To model some phenomenon to predict values at other places or other times. The basic objective is to build a prediction model that is
both consistent and accurate. Example: Given population growth projections and typical weather conditions, what will the demand for
electricity be next year?
 You can also use regression analysis to explore hypotheses. Suppose you are modeling residential crime to better understand it and
hopefully implement policy that might prevent it. As you begin your analysis, you probably have questions or hypotheses that you want
to examine:
 "Broken window theory" indicates that defacement of public property (graffiti, damaged structures, and so on) invite other crimes.
Will there be a positive relationship between vandalism incidents and residential burglary?
 Is there a relationship between illegal drug use and burglary (might drug addicts steal to support their habits)?
 Are burglars predatory? Might there be more incidents in residential neighborhoods with higher proportions of elderly or female-
headed households?
 Are persons at greater risk for burglary if they live in a rich or a poor neighborhood?
You can use regression analysis to explore these relationships and answer your questions.
Regression analysis terms and concepts

It is impossible to discuss regression analysis without first becoming familiar with a few terms and basic concepts specific to regression
statistics:
Regression equation: This is the mathematical formula applied to the explanatory variables to best predict the dependent variable you
are trying to model. Unfortunately for those in the geosciences who think of x and y as coordinates, the notation in regression equations for
the dependent variable is always y and for the independent or explanatory variables is always X. Each independent variable is associated
with a regression coefficient describing the strength and the sign of that variable's relationship to the dependent variable. A regression
equation might look like this (y is the dependent variable, the Xs are the explanatory variables, and the βs are regression coefficients; each
of these components of the regression equation are explained further below):
Elements of an OLS regression equation
 Dependent variable (y): This is the variable representing the process you are trying to predict or understand (residential burglary,
foreclosure, rainfall). In the regression equation, it appears on the left side of the equal sign. While you can use regression to predict
the dependent variable, you always start with a set of known y-values and use these to build (or to calibrate) the regression model. The
known y-values are often referred to as observed values.
 Independent/Explanatory variables (X): These are the variables used to model or to predict the dependent variable values. In the
regression equation, they appear on the right side of the equal sign and are often referred to as explanatory variables. The dependent
variable is a function of the explanatory variables. If you are interested in predicting annual purchases for a proposed store, you might
include in your model explanatory variables representing the number of potential customers, distance to competition, store visibility,
and local spending patterns, for example.
 Regression coefficients (β): Coefficients are computed by the regression tool. They are values, one for each explanatory variable,
that represent the strength and the type of relationship the explanatory variable has to the dependent variable. Suppose you are
modeling fire frequency as a function of solar radiation, vegetation, precipitation, and aspect. You might expect a positive relationship
between fire frequency and solar radiation (in other words, the more sun, the more frequent the fire incidents). When the relationship is
positive, the sign for the associated coefficient is also positive. You might expect a negative relationship between fire frequency and
precipitation (in other words, places with more rain have fewer fires). Coefficients for negative relationships have negative signs. When
the relationship is a strong one, the coefficient is relatively large (relative to the units of the explanatory variable it is associated with).
Weak relationships are associated with coefficients near zero; β0 is the regression intercept. It represents the expected value for the
dependent variable if all the independent (explanatory) variables are zero.
P-values: Most regression methods perform a statistical test to compute a probability, called a p-value, for the coefficients associated with
each independent variable. The null hypothesis for this statistical test states that a coefficient is not significantly different from zero (in
other words, for all intents and purposes, the coefficient is zero and the associated explanatory variable is not helping your model). Small
p-values reflect small probabilities and suggest that the coefficient is, indeed, important to your model with a value that is significantly
different from zero (in other words, a small p-value indicates the coefficient is not zero). You would say that a coefficient with a p-value of
0.01, for example, is statistically significant at the 99 percent confidence level; the associated variable is an effective predictor. Variables
with coefficients near zero do not help predict or model the dependent variable; they are almost always removed from the regression
equation, unless there are strong theoretical reasons to keep them.
R 2/R-squared: Multiple R-squared and adjusted R-squared are both statistics derived from the regression equation to quantify model
performance. The value of R-squared ranges from 0 to 100 percent. If your model fits the observed dependent variable values perfectly, R-
squared is 1.0 (and you, no doubt, have made an error; perhaps you've used a form of y to predict y). More likely, you will see R-squared
values like 0.49, for example, which you can interpret by saying, "This model explains 49 percent of the variation in the dependent
variable". To understand what the R-squared value is getting at, create a bar graph showing both the estimated and observed y-values
sorted by the estimated values. Notice how much overlap there is. This graphic provides a visual representation of how well the model's
predicted values explain the variation in the observed dependent variable values. View an illustration. The adjusted R-squared value is
always a bit lower than the multiple R-squared value because it reflects model complexity (the number of variables) as it relates to the
data. Consequently, the adjusted R-squared value is a more accurate measure of model performance.
Residuals: These are the unexplained portion of the dependent variable, represented in the regression equation as the random error
term ε. View an illustration. Known values for the dependent variable are used to build and to calibrate the regression model. Using known
values for the dependent variable (y) and known values for all of the explanatory variables (the Xs), the regression tool constructs an
equation that will predict those known y-values as well as possible. The predicted values will rarely match the observed values exactly,
however. The difference between the observed y-values and the predicted y-values are called the residuals. The magnitude of the residuals
from a regression equation is one measure of model fit. Large residuals indicate poor model fit.
Building a regression model is an iterative process that involves finding effective independent variables to explain the dependent variable
you are trying to model or understand, running the regression tool to determine which variables are effective predictors, then repeatedly
removing and/or adding variables until you find the best regression model possible. While the model building process is often exploratory, it
should never be a "fishing expedition". You should identify candidate explanatory variables by consulting theory, experts in the field, and
common sense. You should be able to state and justify the expected relationship between each candidate explanatory variable and the
dependent variable prior to analysis, and should question models where these relationships do not match.
Note: If you've not used regression analysis before, this would be a very good time to download the
Regression Analysis Tutorial and work through steps 1–5.
Regression analysis issues

OLS regression is a straightforward method, has well-developed theory behind it, and has a number of effective diagnostics to assist with
interpretation and troubleshooting. OLS is only effective and reliable, however, if your data and regression model meet/satisfy all the
assumptions inherently required by this method (see the table below). Spatial data often violates the assumptions and requirements of OLS
regression, so it is important to use regression tools in conjunction with appropriate diagnostic tools that can assess whether regression is
an appropriate method for your analysis, given the structure of the data and the model being implemented.
How regression models go bad
A serious violation for many regression models is misspecification. A misspecified model is one that is not complete—it is missing important
explanatory variables, so it does not adequately represent what you are trying to model or trying to predict (the dependent variable, y). In
other words, the regression model is not telling the whole story. Misspecification is evident whenever you see statistically significant spatial
autocorrelation in your regression residuals or, said another way, whenever you notice that the over- and underpredictions (residuals) from
your model tend to cluster spatially so that the overpredictions cluster in some portions of the study area and the underpredictions cluster
in others. Mapping regression residuals or the coefficients associated with Geographically Weighted Regressionanalysis will often provide
clues about what you've missed. Running Hot Spot Analysis on regression residuals might also help reveal different spatial regimes that can
be modeled in OLS with regional variables or can be remedied using the geographically weighted regression method. Suppose when you
map your regression residuals you see that the model is always overpredicting in the mountain areas and underpredicting in the valleys—
you will likely conclude that your model is missing an elevation variable. There will be times, however, when the missing variables are too
complex to model or impossible to quantify or too difficult to measure. In these cases, you may be able to move to GWR or to another
spatial regression method to get a well-specified model.
The following table lists common problems with regression models and the tools available in ArcGIS to help address them:
Common regression problems, consequences, and solutions
Omitted explanatory variables When key explanatory variables are Map and examine OLS residuals and GWR
(misspecification). missing from a regression model, coefficients or run Hot Spot Analysis on
coefficients and their associated p-values OLS regression residuals to see if this
cannot be trusted. provides clues about possible missing
variables.
Nonlinear relationships. View an OLS and GWR are both linear methods. If Create a scatter plot matrix graph to
illustration. the relationship between any of the elucidate the relationships among all
explanatory variables and the dependent variables in the model. Pay careful
variable is nonlinear, the resultant model attention to relationships involving the
will perform poorly. dependent variable. Curvilinearity can
often be remedied by transforming the
variables. View an illustration.
Alternatively, use a nonlinear regression
method.
Data outliers. View an illustration. Influential outliers can pull modeled Create a scatter plot matrix and other
regression relationships away from their graphs (histograms) to examine extreme
true best fit, biasing regression data values. Correct or remove outliers if
coefficients. they represent errors. When outliers are
correct/valid values, they cannot/should
not be removed. Run the regression with
and without the outliers to see how much
they are affecting your results.
Nonstationarity. You might find that an If relationships between your dependent The OLS tool in ArcGIS automatically
income variable, for example, has strong and explanatory variables are inconsistent tests for problems associated with
explanatory power in region A but is across your study area, computed nonstationarity (regional variation) and
insignificant or even switches signs in standard errors will be artificially inflated. computes robust standard error values.
region B. View an illustration. View an illustration. When the probability
associated with the Koenker test is small
(< 0.05, for example), you have
statistically significant regional variation
and should consult the robust
probabilities to determine if an
explanatory variable is statistically
significant or not. Often you will improve
model results by using the Geographically
Weighted Regression tool.
Multicollinearity. One or a combination Multicollinearity leads to an overcounting The OLS tool in ArcGIS automatically
of explanatory variables is redundant. type of bias and an unstable/unreliable checks for redundancy. Each explanatory
View an illustration. model. variable is given a computed VIF value.
When this value is large (> 7.5, for
example), redundancy is a problem and
the offending variables should be
removed from the model or modified by
creating an interaction variable or
increasing the sample size. View an
illustration.
Inconsistent variance in residuals. It When the model predicts poorly for some The OLS tool in ArcGIS automatically
may be that the model predicts well for range of values, results will be biased. tests for inconsistent residual variance
small values of the dependent variable (called heteroscedasticity) and computes
but becomes unreliable for large values. standard errors that are robust to this
View an illustration. problem. When the probability associated
with the Koenker test is small (< 0.05,
for example), you should consult the
robust probabilities to determine if an
explanatory variable is statistically
significant or not. View an illustration.
Spatially autocorrelated residuals. View When there is spatial clustering of the Run the Spatial Autocorrelation tool on
an illustration. under-/overpredictions coming out of the the residuals to ensure they do not
model, it introduces an overcounting type exhibit statistically significant spatial
of bias and renders the model unreliable. clustering. Statistically significant spatial
autocorrelation is almost always a
symptom of misspecification (a key
variable is missing from the model). View
an illustration.
Normal distribution bias. View an When the regression model residuals are The OLS tool in ArcGIS automatically
illustration. not normally distributed with a mean of tests whether the residuals are normally
zero, the p-values associated with the distributed. When the Jarque-Bera
coefficients are unreliable. statistic is significant (< 0.05, for
example), your model is likely
misspecified (a key variable is missing
from the model) or some of the
relationships you are modeling are
nonlinear. Examine the output residual
map and perhaps GWR coefficient maps
to see if this exercise reveals the key
variables missing from the analysis. View
scatterplot matrix graphs and look for
nonlinear relationships.
Common regression problems and solutions
It is important to test for each of the problems listed above. Results can be 100 percent wrong (180 degrees different) if problems above
are ignored.
Note: If you've not used regression analysis before, this would be a very good time to download and
work through the Regression Analysis Tutorial.
Spatial regression
Spatial data exhibits two properties that make it difficult (but not impossible) to meet the assumptions and requirements of traditional
(nonspatial) statistical methods, like OLS regression:
 Geographic features are more often than not spatially autocorrelated; this means that features near each other tend to be more
similar than features that are farther away. This creates an overcount type of bias for traditional (nonspatial) regression methods.
 Geography is important, and often the processes most important to what you are modeling are nonstationary; these processes behave
differently in different parts of the study area. This characteristic of spatial data can be referred to as regional variation or
nonstationarity.
True spatial regression methods were developed to robustly manage these two characteristics of spatial data and even to incorporate these
special qualities of spatial data to improve their ability to model data relationships. Some spatial regression methods deal effectively with
the first characteristic (spatial autocorrelation), others deal effectively with the second (nonstationarity). At present, no spatial regression
methods are effective for both characteristics. For a properly specified GWR model, however, spatial autocorrelation is typically not a
problem.
Spatial autocorrelation
There seems to be a big difference between how a traditional statistician views spatial autocorrelation and how a spatial statistician views
spatial autocorrelation. The traditional statistician sees it as a bad thing that needs to be removed from the data (through resampling, for
example) because spatial autocorrelation violates underlying assumptions of many traditional (nonspatial) statistical methods. For the
geographer or GIS analyst, however, spatial autocorrelation is evidence of important underlying spatial processes at work; it is an integral
component of the data. Removing space removes data from its spatial context; it is like getting only half the story. The spatial processes
and spatial relationships evident in the data are a primary interest and one of the reasons GIS users get so excited about spatial data
analysis. To avoid an overcounting type of bias in your model, however, you must identify the full set of explanatory variables that will
effectively capture the inherent spatial structure in your dependent variable. If you cannot identify all of these variables, you will very likely
see statistically significant spatial autocorrelation in the model residuals. Unfortunately, you cannot trust your regression results until this is
remedied. Use the Spatial Autocorrelation tool to test for statistically significant spatial autocorrelation in your regression residuals.
There are at least three strategies for dealing with spatial autocorrelation in regression model residuals:
1. Resample until the input variables no longer exhibit statistically significant spatial autocorrelation. While this does not ensure the
analysis is free of spatial autocorrelation problems, they are far less likely when spatial autocorrelation is removed from the
dependent and explanatory variables. This is the traditional statistician's approach to dealing with spatial autocorrelation and is
only appropriate if spatial autocorrelation is the result of data redundancy (the sampling scheme is too fine).
2. Isolate the spatial and nonspatial components of each input variable using a spatial filtering regression method. Space is removed
from each variable, but then it is put back into the regression model as a new variable to account for spatial effects/spatial
structure. ArcGIS currently does not provide spatial filtering regression methods.
3. Incorporate spatial autocorrelation into the regression model using spatial econometric regression methods. Spatial econometric
regression methods will be added to ArcGIS in a future release.
Regional variation
Global models, like OLS regression, create equations that best describe the overall data relationships in a study area. When those
relationships are consistent across the study area, the OLS regression equation models those relationships well. When those relationships
behave differently in different parts of the study area, however, the regression equation is more of an average of the mix of relationships
present, and in the case where those relationships represent two extremes, the global average will not model either extreme well. When
your explanatory variables exhibit nonstationary relationships (regional variation), global models tend to fall apart unless robust methods
are used to compute regression results. Ideally, you will be able to identify a full set of explanatory variables to capture the regional
variation inherent in your dependent variable. If you cannot identify all of these spatial variables, however, you will again notice statistically
significant spatial autocorrelation in your model residuals and/or lower than expected R-squared values. Unfortunately, you cannot trust
your regression results until this is remedied.
There are at least four ways to deal with regional variation in OLS regression models:
1. Include a variable in the model that explains the regional variation. If you see that your model is always overpredicting in the
north and underpredicting in the south, for example, add a regional variable set to 1 for northern features and set to 0 for
southern features.
2. Use methods that incorporate regional variation into the regression model such as geographically weighted regression.
3. Consult robust regression standard errors and probabilities to determine if variable coefficients are statistically significant. See
Interpreting OLS regression results. Geographically weighted regression is still recommended.
4. Redefine/Reduce the size of the study area so that the processes within it are all stationary (so they no longer exhibit regional
variation).
For more information about using the regression tools, see the following:
Learn more about OLS regression
Learn more about GWR regression
Interpreting OLS regression results
Interpreting GWR regression results
Related Topics
Locate topic
Regression analysis is probably the most commonly used statistic in the social sciences. Regression is used to evaluate relationships between
two or more feature attributes. Identifying and measuring relationships lets you better understand what's going on in a place, predict where
something is likely to occur, or begin to examine causes of why things occur where they do.
Ordinary Least Squares (OLS) is the best known of all regression techniques. It is also the proper starting point for all spatial regression
analyses. It provides a global model of the variable or process you are trying to understand or predict; it creates a single regression equation
to represent that process.
There are a number of good resources to help you learn more about both OLS regression and Geographically Weighted Regression. Start by
reading the Regression Analysis Basics documentation and/or watching the free one-hour ESRI Virtual Campus Regression Analysis Basics Web
seminar. Next, work through a Regression Analysis tutorial. Once you begin creating your own regression models, you may want to refer to the
Interpreting OLS Regression Results documentation to help you understand OLS output and diagnostics.
Wooldridge, J. M. Introductory Econometrics: A Modern Approach. South-Western, Mason, Ohio, 2003.
Hamilton, Lawrence C. Regression with Graphics. Brooks/Cole, 1992.
Locate topic
Output generated from the OLS Regression tool includes the following:
 Output feature class
Map of OLS Residuals
 Message window report of statistical results
OLS Statistical Report
 Optional PDF report file
 Optional table of explanatory variable coefficients
Coefficient table of OLS Model Results
 Optional table of regression diagnostics
OLS Model Diagnostics Table
Each of these outputs is shown and described below as a series of steps for running OLS regression and interpreting OLS results.
(A) To run the OLS tool, provide an Input Feature Class with a Unique ID Field, the Dependent Variable you want to model/explain/predict,
and a list of Explanatory Variables. You will also need to provide a path for the Output Feature Class and, optionally, paths for the Output
Report File, Coefficient Output Table, and Diagnostic Output Table.
Ordinary Least Squares tool dialog box
After OLS runs, the first thing you will want to check is the OLS summary report, which is written to the Results window. Right-clicking the
Messages entry in the Results window and selecting View will display the summary in a Message dialog box. If you execute the OLS tool in the
foreground, the summary will also be displayed in the progress dialog box.
(B) Examine the summary report using the numbered steps described below:
Components of the OLS Statistical Report
Dissecting the Statistical Report

1. Assess model performance. Both the Multiple R-Squared and Adjusted R-Squared values are measures of model performance.
Possible values range from 0.0 to 1.0. The Adjusted R-Squared value is always a bit lower than the Multiple R-Squared value,
because it reflects model complexity (the number of variables) as it relates to the data and is consequently a more accurate
measure of model performance. Adding an additional explanatory variable to the model will likely increase the Multiple R-Squared
value but may decrease the Adjusted R-Squared value. Suppose you are creating a regression model of residential burglary (the
number of residential burglaries associated with each census block is your dependent variable, y). An Adjusted R-Squared value
of 0.39 would indicate that your model (your explanatory variables modeled using linear regression) explains approximately 39
percent of the variation in the dependent variable. Said another way, your model tells approximately 39 percent of the residential
burglary story.
R-Squared Values Quantify Model Performance
2. Assess each explanatory variable in the model: Coefficient, Probability or Robust Probability, and Variance Inflation Factor (VIF).
The coefficient for each explanatory variable reflects both the strength and type of relationship the explanatory variable has to the
dependent variable. When the sign associated with the coefficient is negative, the relationship is negative (for example, the larger
the distance from the urban core, the smaller the number of residential burglaries). When the sign is positive, the relationship is
positive (for example, the larger the population, the larger the number of residential burglaries). Coefficients are given in the
same units as their associated explanatory variables (a coefficient of 0.005 associated with a variable representing population
counts may be interpreted as 0.005 people). The coefficient reflects the expected change in the dependent variable for every 1
unit change in the associated explanatory variable, holding all other variables constant (for example, a 0.005 increase in
residential burglary is expected for each additional person in the census block, holding all other explanatory variables constant).
The T test is used to assess whether or not an explanatory variable is statistically significant. The null hypothesis is that the
coefficient is, for all intents and purposes, equal to zero (and consequently is not helping the model). When the probability or
robust probability (p-value) is very small, the chance of the coefficient being essentially zero is also small. If the Koenker test
(see below) is statistically significant, use the robust probabilities to assess explanatory variable statistical significance.
Statistically significant probabilities have an asterisk (*) next to them. An explanatory variable associated with a statistically
significant coefficient is important to the regression model if theory/common sense supports a valid relationship with the
dependent variable, if the relationship being modeled is primarily linear, and if the variable is not redundant to any other
explanatory variables in the model. The VIF measures redundancy among explanatory variables. As a rule of thumb, explanatory
variables associated with VIF values larger than about 7.5 should be removed (one by one) from the regression model. If, for
example, you have a population variable (the number of people) and an employment variable (the number of employed persons)
in your regression model, you will likely find them to be associated with large VIF values indicating that both of these variables
are telling the same story; one of them should be removed from your model.
Assess which variables are statistically significant.
3. Assess model significance. Both the Joint F-Statistic and Joint Wald Statistic are measures of overall model statistical
significance. The Joint F-Statistic is trustworthy only when the Koenker (BP) statistic (see below) is not statistically significant. If
the Koenker (BP) statistic is significant, you should consult the Joint Wald Statistic to determine overall model significance. The
null hypothesis for both of these tests is that the explanatory variables in the model are not effective. For a 95 percent confidence
level, a p-value (probability) smaller than 0.05 indicates a statistically significant model.
Assess the overall statistical significance of the regression model.
4. Assess Stationarity. The Koenker (BP) Statistic (Koenker's studentized Bruesch-Pagan statistic) is a test to determine whether
the explanatory variables in the model have a consistent relationship to the dependent variable both in geographic space and in
data space. When the model is consistent in geographic space, the spatial processes represented by the explanatory variables
behave the same everywhere in the study area (the processes are stationary). When the model is consistent in data space, the
variation in the relationship between predicted values and each explanatory variable does not change with changes in explanatory
variable magnitudes (there is no heteroscedasticity in the model). Suppose you want to predict crime, and one of your
explanatory variables is income. The model would have problematic heteroscedasticity if the predictions were more accurate for
locations with small median incomes than they were for locations with large median incomes. The null hypothesis for this test is
that the model is stationary. For a 95 percent confidence level, a p-value (probability) smaller than 0.05 indicates statistically
significant heteroscedasticity and/or nonstationarity. When results from this test are statistically significant, consult the robust
coefficient standard errors and probabilities to assess the effectiveness of each explanatory variable. Regression models with
statistically significant nonstationarity are often good candidates for Geographically Weighted Regression (GWR) analysis.
Assess stationarity: if the Koenker test is statistically significant (*), consult the robust probabilities to determine whether
explanatory variable coefficients are significant or not.
5. Assess model bias. The Jarque-Bera statistic indicates whether or not the residuals (the observed/known dependent variable
values minus the predicted/estimated values) are normally distributed. The null hypothesis for this test is that the residuals are
normally distributed, so if you were to construct a histogram of those residuals, they would resemble the classic bell curve, or
Gaussian distribution. When the p-value (probability) for this test is small (smaller than 0.05 for a 95 percent confidence level, for
example), the residuals are not normally distributed, indicating your model is biased. If you also have statistically significant
spatial autocorrelation of your residuals (see below), the bias may be the result of model misspecification (a key variable is
missing from the model). Results from a misspecified OLS model are not trustworthy. A statistically significant Jarque-Bera test
can also occur if you are trying to model nonlinear relationships, if your data include influential outliers, or when there is strong
heteroscedasticity.
Assess model bias.
6. Assess residual spatial autocorrelation. Always run the Spatial Autocorrelation (Moran's I) tool on the regression residuals to
ensure that they are spatially random. Statistically significant clustering of high and/or low residuals (model under- and
overpredictions) indicates a key variable is missing from the model (misspecification). OLS results cannot be trusted when the
model is misspecified.
Use the Spatial Autocorrelation tool to ensure that model residuals are not spatially autocorrelated.
7. Finally, review the section titled How Regression Models Go Bad in the Regression Analysis Basics document as a check that your
OLS regression model is properly specified. If you are having trouble finding a properly specified regression model, the
Exploratory Regression tool can be very helpful. The Notes on Interpretation at the end of the OLS summary report are there
to help you remember the purpose of each statistical test and to guide you toward a solution when your model fails one or more
of the diagnostics.
The OLS report includes Notes to help you interpret diagnostic output.
(C) If you provide a path for the optional Output Report File, a PDF will be created that contains all of the information in the summary
report plus additional graphics to help you assess your model. The first page of the report provides information about each explanatory
variable. Similar to the first section of the summary report (see number 2 above) you would use the information here to determine if the
coefficients for each explanatory variable are statistically significant and have the expected sign (+/-). If the Koenker test is statistically
significant (see number 4 above), you can only trust the robust probabilities to decide if a variable is helping your model or not. Statistically
significant coefficients will have an asterisk next to their p-values for the probabilities and/or robust probabilities columns. You can also tell
from the information on this page of the report whether any of your explanatory variables are redundant (exhibit problematic
multicollinearity). Unless theory dictates otherwise, explanatory variables with elevated Variance Inflation Factor (VIF) values should be
removed one by one until the VIF values for all remaining explanatory variables are below 7.5.
The next section in the Output Report File lists results from the OLS diagnostic checks. This page also includes Notes on Interpretation
describing why each check is important. If your model fails one of these diagnostics, refer to the table of common regression problems
outlining the severity of each problem and suggesting potential remediation. The graphs on the remaining pages of the report will also help
you identify and remedy problems with your model.
The third section of the Output Report File includes histograms showing the distribution of each variable in your model, and scatterplots
showing the relationship between the dependent variable and each explanatory variable. If you are having trouble with model bias
(indicated by a statistically significant Jarque-Bera p-value), look for skewed distributions among the histograms, and try transforming
these variables to see if this eliminates bias and improves model performance. The scatterplots show you which variables are your best
predictors. Use these scatterplots to also check for nonlinear relationships among your variables. In some cases, transforming one or more
of the variables will fix nonlinear relationships and eliminate model bias. Outliers in the data can also result in a biased model. Check both
the histograms and the scatterplots for these data values and/or data relationships. Try running the model with and without an outlier to
see how much it is impacting your results. You may discover that the outlier is invalid data (entered or recorded in error) and be able to
remove the associated feature from your dataset. If the outlier reflects valid data and is having a very strong impact on the results of your
analysis, you may decide to report your results both with and without the outlier(s).
When you have a properly specified model, the over- and underpredictions will reflect random noise. If you were to create a histogram of
random noise, it would be normally distributed (think bell curve). The fourth section of the Output Report File presents a histogram of the
model over- and underpredictions. The bars of the histogram show the actual distribution, and the blue line superimposed on top of the
histogram shows the shape the histogram would take if your residuals were, in fact, normally distributed. Perfection is unlikely, so you will
want to check the Jarque-Bera test to determine if deviation from a normal distribution is statistically significant or not.
The Koenker diagnostic tells you if the relationships you are modeling either change across the study area (nonstationarity) or vary in
relation to the magnitude of the variable you are trying to predict (heteroscedasticity). Geographically_Weighted_Regression will resolve
issues with nonstationarity; the graph in section 5 of the Output Report File will show you if you have a problem with heteroscedasticity.
This scatterplot graph (shown below) charts the relationship between model residuals and predicted values. Suppose you are modeling
crime rates. If the graph reveals a cone shape with the point on the left and the widest spread on the right of the graph, it indicates your
model is predicting well in locations with low rates of crime, but not doing well in locations with high rates of crime.
The last page of the report records all of the parameter settings that were used when the report was created.
(D) Examine the model residuals found in the Output Feature Class. Over- and underpredictions for a properly specified regression model
will be randomly distributed. Clustering of over- and/or underpredictions is evidence that you are missing at least one key explanatory
variable. Examine the patterns in your model residuals to see if they provide clues about what those missing variables might be. Sometimes
running Hot Spot Analysis on regression residuals helps you identify broader patterns. Additional strategies for dealing with an improperly
specified model are outlined in: What they don't tell you about regression analysis.
OLS Output: Mapped Residuals
(E) View the coefficient and diagnostic tables. Creating the coefficient and diagnostic tables is optional. While you are in the process of
finding an effective model, you may elect not to create these tables. The model-building process is iterative, and you will likely try a large
number of different models (different explanatory variables) until you settle on a few good ones. You can use the Corrected Akaike
Information Criterion (AICc) on the report to compare different models. The model with the smaller AICc value is the better model (that
is, taking into account model complexity, the model with the smaller AICc provides a better fit with the observed data).
You may use the AICc value to compare regression models.
Creating the coefficient and diagnostic tables for your final OLS models captures important elements of the OLS report. The coefficient table
includes the list of explanatory variables used in the model with their coefficients, standardized coefficients, standard errors, and
probabilities. The coefficient is an estimate of how much the dependent variable would change given a 1 unit change in the associated
explanatory variable. The units for the coefficients matches the explanatory variables. If, for example, you have an explanatory variable for
total population, the coefficient units for that variable reflect people; if another explanatory variable is distance (meters) from the train
station, the coefficient units reflect meters. When the coefficients are converted to standard deviations, they are called standardized
coefficients. You can use standardized coefficients to compare the effect diverse explanatory variables have on the dependent variable. The
explanatory variable with the largest standardized coefficient after you strip off the +/- sign (take the absolute value) has the largest effect
on the dependent variable. Interpretations of coefficients, however, can only be made in light of the standard error. Standard errors
indicate how likely you are to get the same coefficients if you could resample your data and recalibrate your model an infinite number of
times. Large standard errors for a coefficient mean the resampling process would result in a wide range of possible coefficient values; small
standard errors indicate the coefficient would be fairly consistent.
The coefficient table includes computed coefficients, standard errors, and variable probabilities.
The diagnostic table includes results for each diagnostic test, along with guidelines for how to interpret those results.
The diagnostic table includes notes for interpreting model diagnostic test results.
There are a number of good resources to help you learn more about OLS regression on the Spatial Statistics Resources page. Start by
reading the Regression Analysis Basics documentation and/or watching the free one-hour Esri Virtual Campus Regression Analysis Basics
web seminar. Next, work through a Regression Analysis tutorial. Apply regression analysis to your own data, referring to the table of
common problems and the article called What they don't tell you about regression analysis for additional strategies. If you are having
trouble finding a properly specified model, the Exploratory Regression tool can be very helpful.
Locate topic
Regression analysis is used to understand, model, predict, and/or explain complex phenomena. It helps you answer why questions like "Why
are there places in the United States with test scores that are consistently above the national average?" or "Why are there areas of the city
with such high rates of residential burglary?" You might use regression analysis to explain childhood obesity, for example, using a set of
related variables such as income, education, and accessibility to healthy food.
Typically, regression analysis helps you answer these why questions so that you can do something about them. If, for example, you discover
that childhood obesity is lower in schools that serve fresh fruits and vegetables at lunch, you can use that information to guide policy and
make decisions about school lunch programs. Likewise, knowing the variables that help explain high crime rates can allow you to make
predictions about future crime so that prevention resources can be allocated more effectively.
These are the things they do tell you about regression analysis.
What they don't tell you about regression analysis is that it isn't always easy to find a set of explanatory variables that will allow you to answer
your question or to explain the complex phenomenon you are trying to model. Childhood obesity, crime, test scores, and almost all the things
that you might want to model using regression analysis are complicated issues that rarely have simple answers. Chances are, if you have ever
tried to build your own regression model, this is nothing new to you.
Fortunately, when you run the Ordinary Least Squares (OLS) regression tool, you are presented with a set of diagnostics that can help you
figure out whether you have a properly specified model; a properly specified model is one you can trust. This document examines the six
checks you'll want to pass to have confidence in your model. Those six checks, and the techniques that you can use to solve some of the most
common regression analysis problems, are resources that can definitely make your work easier.
Tip: Once you understand the information presented below, you might decide to use the Exploratory
Regression tool to help you find a model that meets all the requirements of the ordinary least
squares method.
Getting started
Choosing the variable you want to understand, predict, or model is your first task. This variable is known as the dependent variable.
Childhood obesity, crime, and test scores would be the dependent variables being modeled in the examples described above.
Next you have to decide which factors might help explain your dependent variable. These variables are known as the explanatory variables.
In the childhood obesity example, the explanatory variables might be things such as income, education, and accessibility to healthy food.
You will need to do your research here to identify all the explanatory variables that might be important; consult theory and existing
literature, talk to experts, and always rely on your common sense. The preliminary research you do up front will greatly increase your
chances of finding a good model.
With the dependent variable and the candidate explanatory variables selected, you are ready to run your analysis. Always start your
regression analysis with Ordinary Least Squares or Exploratory Regression because these tools perform important diagnostic tests that let
you know if you've found a useful model or if you still have some work to do.
The OLS tool generates several outputs including a map of the regression residuals and a summary report. The regression residuals map
shows the under- and overpredictions from your model, and analyzing this map is an important step in finding a good model. The summary
report is largely numeric and includes all the diagnostics you will use when going through the six checks below.
Output from the OLS tool includes a summary report and residuals map.
The six checks
Check 1: Are these explanatory variables helping my model?

After consulting theory and existing research, you will have identified a set of candidate explanatory variables. You'll have good reasons
for including each one in your model. However, after running your model, you'll find that some of your explanatory variables are
statistically significant and some are not.
How will you know which explanatory variables are significant? The OLS tool calculates a coefficient for each explanatory variable in the
model and performs a statistical test to determine whether that variable is helping your model or not. The statistical test computes the
probability that the coefficient is actually zero. If the coefficient is zero (or very near zero), the associated explanatory variable is not
helping your model. When the statistical test returns a small probability (p-value) for a particular explanatory variable, on the other hand,
it indicates that it is unlikely (there is a small probability) that the coefficient is zero. When the probability is smaller than 0.05, an
asterisk next to the probability on the OLS summary report indicates the associated explanatory variable is important to your model (in
other words, its coefficient is statistically significant at the 95 percent confidence level). So you are looking for explanatory variables
associated with statistically significant probabilities (look for ones with asterisks).
The OLS tool computes both the probability and the robust probability for each explanatory variable. With spatial data, it is not unusual
for the relationships you are modeling to vary across the study area. These relationships are characterized as nonstationary. When the
relationships are nonstationary, you can only trust robust probabilities to tell you whether an explanatory variable is statistically
significant.
How will you know if the relationships in your model are nonstationary? Another statistical test included in the OLS summary report is the
Koenker (Koenker's studentized Breusch-Pagan) statistic for nonstationarity. An asterisk next to the Koenker p-value indicates the
relationships you are modeling exhibit statistically significant nonstationarity, so be sure to consult the robust probabilities.
Typically you will remove explanatory variables from your model if they are not statistically significant. However, if theory indicates a
variable is very important, or if a particular variable is the focus of your analysis, you might retain it even if it's not statistically
significant.
Note: In the process of looking for a properly specified OLS model, you will likely try a variety of
explanatory variables. Be aware that explanatory variable coefficients (and their statistical
significance) can change radically depending on the combination of variables you include in
your model.
Check 2: Are the relationships what I expected?

Not only is it important to determine whether an explanatory variable is actually helping your model, but you also want to check the sign
(+/-) associated with each coefficient to make sure the relationship is what you were expecting. The sign of the explanatory variable
coefficient indicates whether the relationship is positive or negative. Suppose you were modeling crime, for example, and one of your
explanatory variables is average neighborhood income. If the coefficient for the income variable is a negative number, it means that
crimes tend to decrease as neighborhood incomes increase (a negative relationship). If you were modeling childhood obesity and the
accessibility to fast food variable had a positive coefficient, it would indicate that childhood obesity tends to increase as access to fast
food increases (a positive relationship).
When you create your list of candidate explanatory variables, you should include for each variable the relationship (positive or negative)
you are expecting. You would have a hard time trusting a model reporting relationships that don't match with theory and/or common
sense. Suppose you were building a model to predict forest fire frequencies and your regression model returned a positive coefficient for
the precipitation variable. You probably wouldn't expect forest fires to increase in locations with lots of rain.
Unexpected coefficient signs often indicate other problems with your model that will surface as you continue working through the six
checks. You can only trust the sign and strength of your explanatory variable coefficients if your model passes all of these. If you do find
a model that passes all the checks despite the unexpected coefficient sign, you may have discovered an opportunity to learn something
new. Perhaps there is a positive relationship between forest fire frequency and precipitation because the primary source of forest fires in
your study area is lightning. It may be worthwhile to try to obtain data about lightning for your study area to see if it improves model
performance.
Check 3: Are any of the explanatory variables redundant?

When choosing explanatory variables to include in your analysis, look for variables that get at different aspects of what you are trying to
model; avoid variables that are telling the same story. For example, if you were trying to model home values, you probably wouldn't
include explanatory variables for both home square footage and number of bedrooms. Both of these variables relate to the size of the
home, and including both could make your model unstable. Ultimately, you cannot trust a model that includes redundant variables.
How will you know if two or more variables are redundant? Fortunately, whenever you have more than two explanatory variables, the
OLS tool computes a Variance Inflation Factor (VIF) for each variable. The VIF value is a measure of variable redundancy and can help
you decide which variables can be removed from your model without jeopardizing explanatory power. As a rule of thumb, a VIF value
above 7.5 is problematic. If you have two or more variables with VIF values above 7.5, you should remove them one at a time and rerun
OLS until the redundancy is gone. Keep in mind that you do not want to remove all the variables with high VIF values. In the example of
modeling home values, square footage and number of bedrooms would likely both have inflated VIF values. As soon as you remove one
of those two variables, however, the redundancy is eliminated. Including a variable that reflects home size is important; you just don't
want to model that aspect of home values redundantly.
Check 4: Is my model biased?

This may seem like a tricky question, but the answer is actually very simple. When you have a properly specified OLS model, the model
residuals (the over- and underpredictions) are normally distributed with a mean of zero (think bell curve). When your model is biased,
however, the distribution of the residuals is unbalanced as shown below. You cannot fully trust predicted results when the model is
biased. Luckily, there are several strategies to help you correct this problem.
A statistically significant Jarque-Bera diagnostic (look for the asterisk) indicates your model is biased. Sometimes your model is doing a
good job for low values but is not predicting well for high values (or vice versa). With the childhood obesity example, this would mean
that, in locations with low childhood obesity, the model is doing a great job, but in areas with high childhood obesity, the predictions are
off. Model bias can also be the result of outliers that are influencing model estimation.
To help you resolve model bias, create a scatterplot matrix for all your model variables. A nonlinear relationship between your dependent
variable and one of your explanatory variables is a common cause of model bias. These might look like a curved line in the scatterplot
matrix. Linear relationships look like diagonal lines.
If you see that your dependent variable has a nonlinear relationship with one of your explanatory variables, you have some work to do.
OLS is a linear regression method that assumes the relationships you are modeling are linear. When they aren't, you can try transforming
your variables to see if this creates relationships that are more linear. Common transformations include log and exponential. Check the
Show Histograms option (which turns it on) in the Create Scatterplot Matrix wizard to include a histogram for each variable in the
scatterplot matrix. If some of your explanatory variables are strongly skewed, you might be able to remove model bias by transforming
them as well.
The scatterplot matrix will also reveal data outliers. To see whether an outlier is impacting your model, try running OLS both with and
without an outlier and check to see how much it changes model performance and whether removing it corrects model bias. In some
instances (especially if you think that the outliers represent bad data), you might be able to drop the outliers from your analysis.
Check 5: Have I found all the key explanatory variables?

Often you go into an analysis with hypotheses about which variables are going to be important predictors. Maybe you believe 5 particular
variables will produce a good model, or maybe you have a firm list of 10 variables you think might be related. While it is important to
approach regression analysis with a hypothesis, it is also important to allow your creativity and insight to help you dig deeper. Resist the
inclination to limit yourself to your initial variable list, and try to consider all the possible variables that might impact what you are
modeling. Create thematic maps of each of your candidate explanatory variables and compare those to a map of your dependent
variable. Hit the books again and scan the relevant literature. Use your intuition to look for relationships in your mapped data. Definitely
try to come up with as many candidate spatial variables as you can, such as distance from the urban center, proximity to major
highways, or access to large bodies of water. These kinds of variables will be especially important for analyses where you believe
geographic processes influence relationships in your data. Until you find explanatory variables that effectively capture the spatial
structure in your dependent variable, in fact, your model will be missing key explanatory variables and you will not be able to pass all the
diagnostic checks outlined here.
Evidence that you are missing one or more key explanatory variables is statistically significant spatial autocorrelation of your model
residuals. In regression analysis, issues with spatially autocorrelated residuals usually takes the form of clustering: the overpredictions
cluster together and the underpredictions cluster together. How will you know if you have statistically significant spatial autocorrelation in
your model residuals? Running the Spatial Autocorrelation tool on your regression residuals will tell you if you have a problem with spatial
autocorrelation. A statistically significant z-score indicates you are missing key explanatory variables from your model.
Finding those missing explanatory variables is often as much an art as a science. Try these strategies to see if they provide any clues:
Examine the OLS residual map

The standard output from OLS is a map of the model residuals. Red areas indicate the actual values (your dependent variable) are
larger than your model predicted they would be. Blue areas show where the actual values are lower than predicted. Sometimes just
seeing the residual map will give you a clue about what might be missing. If you notice that you are consistently overpredicting in
urbanized areas, for example, you might want to consider adding a variable that reflects distance to urban centers. If it looks like
overpredictions are associated with mountain peaks or valley bottoms, perhaps you need an elevation variable. Do you see regional
clusters, or can you recognize trends in your data? If so, creating a dummy variable to capture these regional differences may be
effective. The classic example for a dummy variable is one that distinguishes urban and rural features. By assigning all rural features a
value of 1 and all other features a value of 0, you may be able to capture spatial relationships in the landscape that could be important
to your model. Sometimes creating a hot spot map of model residuals will help you visualize broad regional patterns.
Figuring out the missing spatial variables not only has the potential to improve your model, but this process can also help you better
understand the phenomenon you are modeling in new and innovative ways.
Note: While spatial regime dummy variables are great to include in your OLS model, you will want
to remove them when you run Geographically Weighted Regression (GWR) to avoid problems
with local multicollinearity.
Examine nonstationarity
You can also try running Geographically Weighted Regression and creating coefficient surfaces for each of your explanatory variables
and/or maps of the local R2 values. Select the OLS model that is performing well (one with a high adjusted R2 value that is passing all
or most all the other diagnostic checks). Because GWR creates a regression equation for each feature in your study area, the
coefficient surfaces illustrate how the relationships between the dependent variable and each explanatory variable fluctuate
geographically; the map of local R2 values shows variations in model explanatory power. Sometimes seeing these geographic variations
will spark ideas about what variables might be missing: a dip in explanatory power near major freeways, a decline with distance from
the coast, a change in the sign of the coefficients near an industrial region, or a strong east to west trend or boundary—all these would
be clues about spatial variables that may improve your model.
When you examine the coefficient surfaces, be on the lookout for explanatory variables with coefficients that change sign from positive
to negative. This is important because OLS will likely discount the predictive potential of these highly nonstationary variables. Consider,
for example, the relationship between childhood obesity and access to healthy food options. It may be that in low-income areas with
poorer access to cars, being far away from a supermarket is a real barrier to making healthy food choices. In high-income areas with
better access to vehicles, however, having a supermarket within walking distance might actually be undesirable; the distance to the
supermarket might not act as a barrier to buying healthy foods at all. While GWR is capable of modeling these types of complex
relationships, OLS is not. OLS is a global model and is expecting variable relationships to be consistent (stationary) across the study
area. When coefficients change sign, they cancel each other out. Think of it as (+1) + (-1) = 0. When you find variables where the
coefficients are changing dramatically, especially if they are changing signs, you should keep them in your model even if they are not
statistically significant. These types of variables will be effective when you move to GWR.
Try fitting OLS to smaller subset study areas

GWR is tremendously useful when dealing with nonstationarity, and it can be tempting to move directly to GWR without first finding a
properly specified OLS model. Unfortunately, GWR doesn't have all the great diagnostics to help you figure out whether your
explanatory variables are statistically significant, your residuals are normally distributed, or, ultimately, you have a good model. GWR
will not fix an improperly specified model unless you can be sure that the only reason your OLS model is failing the six checks is the
direct result of nonstationarity. Evidence of nonstationarity would be finding explanatory variables that have a strong positive
relationship in some parts of the study area and a strong negative relationship in other parts. Sometimes the issue isn't with individual
explanatory variables but with the set of explanatory variables used in the model. It may be that one set of variables provides a great
model for one part of the study area, but another set of different variables works best everywhere else. To see if this is the case, you
can select several smaller subset study areas and try to fit OLS models to each of these. Select your subset areas based on the
processes you think may be related to your model (high- versus low-income areas, old versus new housing). Alternatively, select areas
based on the GWR map of local R2 values; the locations with poor model performance might be modeled better using a different set of
explanatory variables.
Tip: The Grouping Analysis tool can be very helpful for identifying subregions in your broader
study area.
If you do find properly specified OLS models in several small study areas, you can conclude that nonstationarity is the culprit and move
to GWR using the full set of explanatory variables you found from all subset area models. If you don't find properly specified models in
the smaller subset areas, it may be that you are trying to model something that is too complex to be reduced to a simple series of
numeric measurements and linear relationships. In that case, you probably need to explore alternative analytic methods.
All of this can be a bit of work, but it is also a great exercise in exploratory data analysis and will help you understand your data better,
find new variables to use, and may even result in a great model.
Check 6: How well am I explaining my dependent variable?

Now it's finally time to evaluate model performance. The adjusted R2 value is an important measure of how well your explanatory
variables are modeling your dependent variable. The R2 value is also one of the first things they tell you about regression analysis. So
why are we leaving this important check until the end? What they don't tell you is that you cannot trust your R2 value unless you have
passed all the other checks listed above. If your model is biased, it may be performing well in some areas or with a particular range of
your dependent variable values, but otherwise not performing well at all. The R2 value doesn't reflect that. Likewise, if you have spatial
autocorrelation of your residuals, you cannot trust the coefficient relationships from your model. With redundant explanatory variables
you can get extremely high R2 values, but your model will be unstable; it will not reflect the true relationships you are trying to model
and might produce completely different results with the addition of even a single observation.
Once you have gone through the other checks and feel confident that you have met all the necessary criteria, however, it is time to figure
out how well your model is explaining the values for your dependent variable by assessing the adjusted R2 value. R2 values range
between 0 and 1 and represent a percentage. Suppose you are modeling crime rates and find a model that passes all five of the previous
checks with an adjusted R2 value of 0.65. This lets you know that the explanatory variables in your model are telling 65 percent of the
crime rate story (more technically, the model is explaining 65 percent of the variation in the crime rate dependent variable). Adjusted R2
values have to be judged rather subjectively. In some areas of science, explaining 23 percent of a complex phenomenon will be very
exciting. In other fields, an R2 value may need to be closer to 80 or 90 percent before it gets anyone's attention. Either way, the adjusted
R2 value will help you judge how well your model is performing.
Another important diagnostic to help you assess model performance is the corrected Akaike's information criterion (AICc). The AICc value
is a useful measure for comparing multiple models. For example, you might want to try modeling student test scores using several
different sets of explanatory variables. In one model you might use only demographic variables, while in another model you might select
variables relating to the school and classroom, such as per-student spending and teacher-to-student ratios. As long as the dependent
variable for all the models being compared is the same (in this case, student test scores), you can use the AICc values from each model
to determine which performs better. The model with the smaller AICc value provides a better fit to the observed data.
And don't forget . . .

Keep in mind as you are going through these steps of building a properly specified regression model that the goal of your analysis is
ultimately to understand your data and use that understanding to solve problems and answer questions. The truth is that you could try a
number of models (with and without transformed variables), explore several small study areas, analyze your coefficient surfaces ...and still
not find a properly specified OLS model. But—and this is important—you will still be contributing to the body of knowledge on the
phenomenon you are modeling. If the model you hypothesized would be a great predictor turns out not to be significant at all, discovering
that is incredibly helpful information. If one of the variables you thought would be strong has a positive relationship in some areas and a
negative relationship in others, knowing about this certainly increases your understanding of the issue. The work that you do here, trying to
find a good model using OLS and then applying GWR to explore regional variation among the variables in your model, is always going to be
valuable.
For more information about regression analysis, check out the Spatial Statistics Resources page.
How GWR works
Locate topic
Geographically Weighted Regression (GWR) is one of several spatial regression techniques increasingly used in geography and other
disciplines. GWR provides a local model of the variable or process you are trying to understand/predict by fitting a regression equation to every
feature in the dataset. GWR constructs these separate equations by incorporating the dependent and explanatory variables of features falling
within the bandwidth of each target feature. The shape and size of the bandwidth is dependent on user input for the Kernel type, Bandwidth
method, Distance, and Number of neighbors parameters.
Implementation notes and tips

In global regression models, such as OLS, results are unreliable when two or more variables exhibit multicollinearity (when two or more
variables are redundant or together tell the same "story"). GWR builds a local regression equation for each feature in the dataset. When the
values for a particular explanatory variable cluster spatially, you will very likely have problems with local multicollinearity. The condition
number in the Output feature class indicates when results are unstable due to local multicollinearity. As a rule of thumb, do not trust results
for features with a condition number larger than 30; equal to Null; or, for shapefiles, equal to -1.7976931348623158e+308.
Severe model design errors often indicate a problem with global or local multicollinearity. To determine where the problem is, run the
model using OLS and examine the VIF value for each explanatory variable. If some of the VIF values are large (above 7.5, for example),
global multicollinearity is preventing GWR from solving. More likely, however, local multicollinearity is the problem. Try creating a thematic
map for each explanatory variable. If the map reveals spatial clustering of identical values, consider removing those variables from the
model or combining those variables with other explanatory variables to increase value variation. If, for example, you are modeling home
values and have variables for both bedrooms and bathrooms, you may want to combine these to increase value variation or represent them
as bathroom/bedroom square footage. Avoid using spatial regime dummy/binary variables, spatially clustering categorical/nominal
variables, or variables with very few possible values when constructing GWR models.
Problems with local multicollinearity can also prevent the AIC and CV Bandwidth method from resolving an optimal distance/number of
neighbors. Try specifying a particular distance or a specific neighbor count, then examine the condition numbers in the Output feature class
to see which features are associated with local multicollinearity problems (condition numbers larger than 30). You may want to remove
these problem features temporarily while you find an optimal distance/number of neighbors. Keep in mind that results associated with
Condition Numbers larger than 30 are not reliable.
Condition numbers indicate how sensitive a linear equation solution is to small changes in matrix coefficients. Individual feature results
when the condition number is greater than 30 are not included in the variance of the parameter estimates; this impacts standard error
diagnostics, global sigma, and standardized residuals.
The user may change this condition number threshold by resetting the registry:
[HKEY_CURRENT_USER\Software\ESRI\GeoStatisticalExtension\DefaultParams\GWR]
"ConditionNumberThreshold"="40"
Parameter estimates and predicted values for GWR are computed using the following spatial weighting function: exp(-d^2/b^2). There may
be differences in this weighting function among various GWR software implementations. Consequently, results from the ESRI GWR tool may
not match results of other GWR software packages exactly.
There are a number of good resources to help you learn more about both OLS regression and Geographically Weighted Regression. Start by
reading the Regression Analysis Basics documentation and/or watching the free one-hour ESRI Virtual Campus Regression Analysis Web
seminar. Next, work through a Regression Analysis tutorial. Once you begin creating your own regression models, you may want to refer to
the Interpreting OLS Regression Results and Interpreting GWR Regression Results documentation to help you understand regression output
and diagnostics.
Other resources
Fotheringham, Stewart A., Chris Brunsdon, and Martin Charlton. Geographically Weighted Regression: the analysis of spatially varying
relationships. John Wiley & Sons, 2002.
Locate topic
Output generated from the Geographically Weighted Regression (GWR) tool includes the following:
1. Output feature class
2. Optional coefficient raster surfaces
3. Message window report of overall model results
4. Supplementary table showing model variables and diagnostic results
5. Prediction output feature class
Each of the above outputs is shown and described below as a series of steps for running GWR and interpreting GWR results. You will typically
begin your regression analysis with Ordinary Least Squares (OLS). See Regression Analysis Basics and Interpreting OLS Regression Results for
more information. A common approach to regression analysis is to identify the very best OLS model possible before moving to GWR. This
approach provides the context for the steps below.
(A) Open the Results window, if necessary. After you have identified one or more candidate regression models using theOLS regression tool,
run those models using GWR. Exclude from your GWR model any regional binary (dummy) variables, as these will create problems with local
multicollinearity and are not needed with GWR. You will need to provide an Input feature class with the Dependent variable you want to
model/explain/predict and all the model Explanatory variables. You will also need to provide a path name for the Output feature class, a
Kernel type (either Fixed or Adaptive), and a Bandwidth method (AIC, CV, or user-provided value). If, for Bandwidth Method, you select
Bandwidth Parameter, you will need to provided a specific Distance (for FIXED Kernel Type) or a specific Number of neighbors (for ADAPTIVE
Kernel Type). You can also provide values for the optional parameters described in the GWR tool documentation. One especially interesting
optional parameter is the Coefficient raster workspace. When you provide a folder path name for this parameter, the GWR tool will create
coefficient raster surfaces (described below) for the model intercept and each explanatory variable.
GWR Tool Dialog Box
(B) Examine the statistical summary report written to the Results window. Right-clicking the Messages entry in the Results window and
selecting View will display the GWR summary report in a Message dialog box. If you execute this tool in the foreground, the summary
report will also be displayed in the progress dialog box. Each of the diagnostics reported is described below.
1. Bandwidth or Neighbors: This is the bandwidth or number of neighbors used for each local estimation and is perhaps the most
important parameter for Geographically Weighted Regression. It controls the degree of smoothing in the model. Typically, you will
let the program choose a bandwidth or neighbor value for you by selecting either AICc (the corrected Akaike Information Criterion)
or CV (Cross Validation) for the Bandwidth method parameter. Both of these options try to identify an optimal fixed distance or
optimal adaptive number of neighbors. Since the criteria for "optimal" are different for AICc than for CV, it is common to get a
different optimal value. You may also provide an exact fixed distance or a particular number of neighbors by selecting BANDWIDTH
PARAMETER for the Bandwidth method.
The bandwidth units depend on the specified Kernel type. If you select FIXED, the bandwidth value will reflect a distance in the
same units as the Input feature class (for example, if the input feature class is projected using UTM coordinates, the distance
reported will be in meters). If you select ADAPTIVE, the bandwidth distance will change according to the spatial density of features
in the Input feature class. The bandwidth becomes a function of the number of nearest neighbors such that each local estimation is
based on the same number of features. Instead of a specific distance, the number of neighbors used for the analysis is reported.
2. ResidualSquares: This is the sum of the squared residuals in the model (the residual being the difference between an observed y
value and its estimated value returned by the GWR model). The smaller this measure, the closer the fit of the GWR model to the
observed data. This value is used in a number of other diagnostic measures.
3. EffectiveNumber: This value reflects a tradeoff between the variance of the fitted values and the bias in the coefficient estimates
and is related to the choice of bandwidth. As the bandwidth approaches infinity, the geographic weights for every observation
approach 1, and the coefficient estimates will be very close to those for a global OLS model. For very large bandwidths, the
effective number of coefficients approaches the actual number; local coefficient estimates will have a small variance but will be
quite biased. Conversely, as the bandwidth approaches zero, the geographic weights for every observation approach zero with the
exception of the regression point itself. For extremely small bandwidths, the effective number of coefficients is the number of
observations, and the local coefficient estimates will have a large variance but a low bias. The effective number is used to compute
a number of diagnostic measures.
4. Sigma: This value is the square root of the normalized residual sum of squares, where the residual sum of squares is divided by the
effective degrees of freedom of the residual. This is the estimated standard deviation for the residuals. Smaller values of this
statistic are preferable. Sigma is used for AICc computations.
5. AICc: This is a measure of model performance and is helpful for comparing different regression models. Taking into account model
complexity, the model with the lower AICc value provides a better fit to the observed data. AICc is not an absolute measure of
goodness of fit but is useful for comparing models with different explanatory variables as long as they apply to the same dependent
variable. If the AICc values for two models differ by more than 3, the model with the lower AICc is held to be better. Comparing the
GWR AICc value to the OLS AICc value is one way to assess the benefits of moving from a global model (OLS) to a local regression
model (GWR).
6. R2: R-Squared is a measure of goodness of fit. Its value varies from 0.0 to 1.0, with higher values being preferable. It may be
interpreted as the proportion of dependent variable variance accounted for by the regression model. The denominator for the R2
computation is the sum of squared dependent variable values. Adding an extra explanatory variable to the model does not alter the
denominator but does alter the numerator; this gives the impression of improvement in model fit that may not be real. See
Adjusted R2 below.
7. R2Adjusted: Because of the problem described above for the R2 value, calculations for the adjusted R-squared value normalize the
numerator and denominator by their degrees of freedom. This has the effect of compensating for the number of variables in a
model, and consequently, the Adjusted R2 value is almost always smaller than the R2 value. However, in making this adjustment,
you lose the interpretation of the value as a proportion of the variance explained. In GWR, the effective number of degrees of
freedom is a function of the bandwidth, so the adjustment may be quite marked in comparison to a global model like OLS. For this
reason, the AICc is preferred as a means of comparing models.
GWR tool execution output
Message window diagnostics are written to a supplementary table (_supp) along with summary information about model variables and
parameters.
GWR _supp output table
(C) Examine the output feature class residuals.

Over- and underpredictions for a well-specified regression model will be randomly distributed. Clustering of over- and/or underpredictions is
evidence that you are missing at least one key explanatory variable. Examine the patterns in your OLS and GWR model residuals to see if they
provide clues about what those missing variables might be. Run the Spatial Autocorrelation (Moran's I) tool on the regression residuals to
ensure that they are spatially random. Statistically significant clustering of high and/or low residuals (model under- and overpredictions)
indicates that the GWR model is misspecified.
GWR feature class output with rendered residuals
In addition to regression residuals, the Output feature class includes fields for observed and predicted y values, condition number (cond), Local
R2, explanatory variable coefficients, and standard errors:
1. Condition Number: This diagnostic evaluates local multicollinearity. In the presence of strong local multicollinearity, results become
unstable. Results associated with condition numbers larger than 30 may be unreliable.
2. Local R2: These values range between 0.0 and 1.0 and indicate how well the local regression model fits observed y values. Very
low values indicate that the local model is performing poorly. Mapping the Local R2 values to see where GWR predicts well and
where it predicts poorly may provide clues about important variables that may be missing from the regression model.
3. Predicted: These are the estimated (or fitted) y values computed by GWR.
4. Residuals: To obtain the residual values, the fitted y values are subtracted from the observed y values. Standardized residuals have
a mean of zero and a standard deviation of 1. A cold-to-hot rendered map of standardized residuals is automatically added to the
table of contents when GWR is executed in ArcMap.
5. Coefficient Standard Error: These values measure the reliability of each coefficient estimate. Confidence in those estimates is
higher when standard errors are small in relation to the actual coefficient values. Large standard errors may indicate problems with
local multicollinearity.
Output feature class results
(D) Examine the coefficient raster surfaces created by GWR (and/or with polygon data, a graduated color rendering of the feature-level
coefficients) to better understand regional variation in the model explanatory variables. When you use GWR to model some variable (the
dependent variable) you are generally interested in predicting values or understanding the factors that contribute to dependent variable
outcomes. You are also interested, however, in examining how spatially consistent (stationary) relationships between the dependent variable
and each explanatory variable are across the study area. Examining the coefficient distribution as a surface shows where and how much
variation is present. You can use your understanding of this variation to inform policy:
 Statistically significant global variables that exhibit little regional variation inform regionwide policy.
 Statistically significant global variables that exhibit strong regional variation inform local policy.
 Some variables may not be globally significant, because in some regions, they are positively related, and in others are negatively related.
Spatial heterogeneity is revealed in output

coefficient raster surfaces.
(E) Map GWR predictions. GWR can be used for prediction when it is applied to sampled data. Specify a feature class containing all the
explanatory variables for locations where the dependent variable is unknown. GWR calibrates the regression equation using known dependent
variable values from the Input feature class, then creates a new Output feature class with dependent variable estimates.
Locate topic
When you run the Exploratory Regression tool, the primary output is a report. The report can be seen in the geoprocessing messages window
when you run in the foreground, or it can be accessed from the Results window. Optionally, a table will also be created that can help you
further investigate the models that have been tested. One purpose of the report is to help you figure out whether or not the candidate
explanatory variables you are considering yield any properly specified OLS models. In the event that there are no passing models (models that
meet all of the criteria you specified when you launched the Exploratory Regression tool), however, the output will also show you which
variables are consistent predictors and help you determine which diagnostics are giving you problems. Strategies for addressing problems
associated with each of the diagnostics are given in the Regression Analysis Basics document (see Common regression problems,
consequences, and solutions) and in What they don't tell you about regression analysis. For more information about how to determine whether
or not you have a properly specified OLS model, please see Regression Analysis Basics and Interpreting OLS results.
The report
The Exploratory Regression report has five distinct sections. Each section is described below.
1. Best models by number of explanatory variables
The first set of summaries in the output report is grouped by the number of explanatory variables in the models tested. If you specify a 1
for the Minimum Number of Explanatory Variables parameter, and a 5 for the Maximum Number of Explanatory Variables parameter, you
will have 5 summary sections. Each section lists the three models with the highest adjusted R2 values and all passing models. Each
summary section also includes the diagnostic values for each model listed: corrected Akaike Information Criteria - AICc, Jarque-Bera p-
value - JB, Koenker’s studentized Breusch-Pagan p-value - K(BP), the largest Variance Inflation Factor - VIF, and a measure of residual
Spatial Autocorrelation (the Global Moran’s I p-value) - SA. These summaries give you an idea of how well your models are predicting (Adj
R2 ), and if any models pass all of the diagnostic criteria you specified. If you accepted all of the default Search Criteria (Minimum
Acceptable Adj R Squared, Maximum Coefficient p-value Cutoff, Maximum VIF Value Cutoff , Minimum Acceptable Jarque Bera p-value, and
Minimum Acceptable Spatial Autocorrelation p-value parameters), any models included in the Passing Models list will be properly specified
OLS models.
If there aren’t any passing models, the rest of the output report still provides lots of good information about variable relationships, and can
help you make decisions about how to move forward.
2. Exploratory Regression Global Summary
The Exploratory Regression Global Summary section is an important place to start, especially if you haven't found any passing models,
because it shows you why none of the models are passing. This section lists the five diagnostic tests and the percentage of models that
passed each of those tests. If you don’t have any passing models, this summary will help you figure out which diagnostic test is giving you
trouble.
Often the diagnostic giving you problems will be the Global Moran’s I test for Spatial Autocorrelation (SA). When all of the models tested
have spatially autocorrelated regression residuals, it most often indicates you are missing key explanatory variables. One of the best ways
to find missing explanatory variables is to examine the map of the residuals output from the Ordinary Least Squares regression (OLS) tool.
Choose one of the exploratory regression models that performed well for all of the other criteria (use the lists of highest adjusted R-Squared
values, or select a model from those in the optional output table), and run OLS using that model. Output from the Ordinary Least Squares
regression (OLS) tool is a map of the model residuals. You should examine the residuals to see if they provide any clues about what might
be missing. Try to think of as many candidate spatial variables as you can (distance to major highways, hospitals, or other key geographic
features, for example). Consider trying spatial regime variables: if all of your underpredictions are in the rural areas, for example, create a
dummy variable to see if it improves your exploratory regression results.
The other diagnostic that is commonly problematic is the Jarque-Bera test for normally distributed residuals. When none of your models
pass the Jarque-Bera (JB) test, you are having a problem with model bias. Common sources of model bias include:
 Nonlinear relationships
 Data outliers
Viewing a scatterplot matrix of the candidate explanatory variables in relation to your dependent variable will show you if you have either of
these problems. Additional strategies are outlined in Regression Analysis Basics. If your models are failing the Spatial Autocorrelation test
(SA), fix those issues first. The bias may be the result of missing key explanatory variables.
3. Summary of Variable Significance
The Summary of Variable Significance section provides information about variable relationships and how consistent those relationships
are. Each candidate explanatory variable is listed with the proportion of times it was statistically significant. The first few variables in the list
have the largest values for the % Significant column. You can also see how stable variable relationships are by examining the % Negative
and % Positive columns. Strong predictors will be consistently significant (% Significant), and the relationship will be stable (primarily
negative or primarily positive).
This part of the report is also there to help you be more efficient. This is especially important when you are working with a lot of candidate
explanatory variables (over 50), and want to try models with five or more predictors. When you have a large number of explanatory
variables and are testing many combinations, the calculations can take a long time. In some cases, in fact, the tool won’t finish at all due to
memory errors. A good approach is to gradually increase the number of models tested: start by setting both the Minimum Number of
Explanatory Variables and the Maximum Number of Explanatory Variables to 2, then 3, then 4, and so on. With each run, remove the
variables that are rarely statistically significant in the models tested. This Summary of Variable Significance section will help you find
those variables that are consistently strong predictors. Even removing one candidate explanatory variable from your list can greatly reduce
the amount of time it takes for the Exploratory Regression tool to complete.
4. Summary of Multicollinearity
The Summary of Multicollinearity section of the report can be used in conjunction with the Summary of Variable Significance
section to understand which candidate explanatory variables may be removed from your analysis in order to improve performance. The
Summary of Multicollinearity section tells you how many times each explanatory variable was included in a model with high
multicollinearity, and the other explanatory variables that were also included in those models. When two (or more) explanatory variables
are frequently found together in models with high multicollinearity, it indicates that those variables may be telling the same story. Since you
only want to include variables that are explaining a unique aspect of the dependent variable, you may want to choose only one of the
redundant variables to include in further analysis. One approach is to use the strongest of the redundant variables based on the Summary of
Variable Significance.
5. Additional diagnostic summaries
The final diagnostic summaries show the highest Jarque-Bera p-values (Summary of Residual Normality) and the highest Global Moran’s
I p-values (Summary of Residual Autocorrelation). To pass these diagnostic tests, you are looking for large p-values.
These summaries are not especially useful when your models are passing the Jarque-Bera and Spatial Autocorrelation (Global Moran’s I)
test, because if your criterion for statistical significance is 0.1, all models with values larger than 0.1 are equally passing models. These
summaries are useful, however, when you do not have any passing models and you want to see how far you are from having normally
distributed residuals or residuals that are free from statistically significant spatial autocorrelation. For instance, if all of the p-values for the
Jarque-Bera summary are 0.000000, it's clear that you are far away from having normally distributed residuals. Alternatively, if the p-
values are 0.092, then know you're close to having residuals that are normally distributed (in fact, depending on the level of significance
that you chose, a p-value of 0.092 might be passing). These summaries are there to demonstrate how serious the problem is and, when
none of your models are passing, which variables are associated with the models that are at least getting close to passing.
The table
If you provided a value for the Output Results Table, a table will be created containing all models that met your Maximum Coefficient p-
value Cutoff and Maximum VIF Value Cutoff criteria. Even if you do not have any passing models, there is a good chance that you will have
some models in the output table. Each row in the table represents a model meeting your criteria for coefficient and VIF values. The columns
in the table provide the model diagnostics and explanatory variables. The diagnostics listed are Adjusted R-Squared (R2), corrected Akaike
Information Criteria (AICc), Jarque-Bera p-value (JB), Koenker’s studentized Breusch-Pagan p-value (BP), Variance Inflation Factor (VIF),
and Global Moran’s I p-value (SA). You may want to sort the models by their AICc values. The lower the AICc value, the better the model
performed. You can sort the AICc values in ArcMap by double-clicking on the AICc column. If you are choosing a model to use in an OLS
analysis (in order to examine the residuals), remember to choose a model with a low AICc value and passing values for as many of the
other diagnostics as possible. For example, if you have looked at your output report and you know that Jarque-Bera was the diagnostic that
gave you trouble, you would look for the model with the lowest AICc value that met all of the criteria except for Jarque-Bera.
If you’re new to regression analysis in ArcGIS, we strongly encourage that you watch the Free Esri Virtual Campus Training Seminar on
Regression, then run through the Regression Analysis tutorial before using Exploratory Regression.
You may also want to see:
 Learn more about how Exploratory Regression works
 What they don't tell you about regression analysis
 Regression analysis basics
 Burnham, K.P. and D.R. Anderson. 2002. Model Selection and Multimodel Inference: a practical information-theoretic approach, 2nd
Edition. New York: Springer. Section 1.5.
Also, check the Spatial Statistics Resource page for new videos, tutorials, and other training materials.
Locate topic
Finding a properly specified OLS model can be difficult, especially when there are lots of potential explanatory variables you think might be
important contributing factors to the variable you are trying to model (your dependent variable). The Exploratory Regression tool can help. It is
a data mining tool that will try all possible combinations of explanatory variables to see which models pass all of the necessary OLS
diagnostics. By evaluating all possible combinations of the candidate explanatory variables, you greatly increase your chances of finding the
best model to solve your problem or answer your question. While Exploratory Regression is similar to Stepwise Regression (found in many
statistical software packages), rather than only looking for models with high Adjusted R2 values, Exploratory Regression looks for models that
meet all of the requirements and assumptions of the OLS method.
Using the Exploratory Regression tool

When you run the Exploratory Regression tool, you specify a minimum and maximum number of explanatory variables each model should
contain, along with threshold criteria for Adjusted R2, coefficient p-values, Variance Inflation Factor (VIF) values, Jarque-Bera p-values, and
spatial autocorrelation p-values. Exploratory Regression runs OLS on every possible combination of the Candidate Explanatory Variables for
models with at least the Minimum Number of Explanatory Variables and not more than the Maximum Number of Explanatory Variables.
Each model it tries is assessed against your Search Criteria. When it finds a model:
 That exceeds your specified Adjusted R2 threshold

 With coefficient p-values, for all explanatory variables, less than you specified
 With coefficient VIF values, for all explanatory variables, less than your specified threshold
 Returning a Jarque-Bera p-value larger than you specified
It then runs the Spatial Autocorrelation (Global Moran’s I) tool on that model’s residuals. If the spatial autocorrelation p-value is also larger
than you specified in the tool’s search criteria (Minimum Acceptable Spatial Autocorrelation p-value), the model is listed as a passing
model. The Exploratory Regression tool will also test regression residuals using the Spatial Autocorrelation tool for models with the three
highest Adjusted R2 results.
Models listed under Passing Models meet your specified search criteria. If you take the default values for the Maximum Coefficient p value
Cutoff, Maximum VIF Value Cutoff, Minimum Acceptable Jarque Bera p value, and Minimum Acceptable Spatial Autocorrelation p value,
your passing models will also be properly specified OLS models. A properly specified OLS model has:
 Explanatory variables where all of the coefficients are statistically significant
 Coefficients reflecting the expected, or at least a justifiable, relationship between each explanatory variable and the dependent variable
 Explanatory variables that get at different aspects of what you are trying to model (none are redundant; small VIF values less than 7.5)
 Normally distributed residuals indicating your model is free from bias (the Jarque-Bera p-value is not statistically significant)
 Randomly distributed over and under predictions indicating model residuals are normally distributed (the spatial autocorrelation p-value
is not statistically significant)
When you specify an Output Results Table, models that meet your Maximum VIF Value Cutoff and for which all explanatory variables meet
the Maximum Coefficient p value Cutoff will be written to a table. This table is helpful when you want to examine more than just those
models included in the text report file.
Some cautions
Please be aware that, similar to using methods such as Stepwise Regression, using the Exploratory Regression tool is controversial. While
an exaggeration, there are basically two schools of thought on this: the scientific method viewpoint and the data miner’s viewpoint.
Scientific method viewpoint

A strong proponent of the scientific method might object to exploratory regression methods. From their perspective, you should formalize
your hypotheses before exploring your data to avoid creating models that fit only your data, but don’t reflect broader processes.
Constructing models that overfit one particular dataset may not be relevant to other datasets—sometimes, in fact, even adding new
observations will cause an overfit model to become unstable (performance might decrease and/or explanatory variable coefficient
significance may wane). When your model isn’t robust, even to new observations, it certainly is not getting at the key processes for what
you are trying to model.
In addition, please realize that regression statistics are based on probability theory, and when you run thousands of models, you strongly
increase your chances of inappropriately rejecting the null hypothesis (a type 1 statistical error). When you select a 95 percent confidence
level, for example, you are accepting a particular risk; if you could resample your data 100 times, probability indicates that as many as 5
out of those 100 samples would produce false positives. P-values are computed for each coefficient; the null hypothesis is that the
coefficient is actually zero and, consequently, the explanatory variable associated with that coefficient is not helping your model.
Probability theory indicates that in as many as 5 out of 100 samples, the p-value might be statistically significant only because you just
happened to select observations that falsely support that conclusion. When you are only running one model, a 95 percent confidence level
seems conservative. As you increase the number of models you try, you diminish your ability to draw conclusions from your results. The
Exploratory Regression tool can try thousands of models in just a few minutes. The number of models tried is reported in the Global
Summary section of the Output Report File.
Data miner's viewpoint

Researchers from the data mining school of thought, on the other hand, would likely feel it is impossible to know a priori all of the factors
that contribute to any given real-world outcome. Often the questions we are trying to answer are complex, and theory on our particular
topic may not exist, or might be out of date. Data miners are big proponents of inductive analyses such as those provided by exploratory
regression. They encourage thinking outside of the box and using exploratory regression methods for hypothesis development.
Recommendations
We feel that Exploratory Regression, when used with discretion, is a valuable data mining tool that can help you find a properly specified
OLS model. Our recommendation is that you always select candidate explanatory regression variables that are supported by theory,
guidance from experts, and common sense. Calibrate your regression models using a portion of your data, and validate it with the
remainder, or validate your model on additional datasets. If you do plan to draw inferences from your results, at minimum, you will want
to perform a sensitivity analysis such as bootstrapping.
Using the Exploratory Regression tool does have advantages over using other exploratory methods that only assess model performance in
terms of Adjusted R2 values. The Exploratory Regression tool is looking for models that pass all of the OLS diagnostics described above.
How Generate Network Spatial Weights works
Locate topic
A spatial weights matrix quantifies the spatial relationships that exist among the features in your dataset. Many tools in the Spatial Statistics
toolbox evaluate each feature within the context of its neighboring features. The spatial weights matrix file defines those neighbor spatial
relationships. (For more information about spatial weights and spatial weights matrix files, see Spatial weights.)
Typically, spatial relationships among a set of features are defined using Euclidean distance measurements and contiguity, fixed, or inverse
distance weighting schemes (see Modeling spatial relationships). However, for many applications, including retail analysis, accessibility to
services, emergency response, evacuation planning, and traffic incident analyses, defining spatial relationships in terms of real-world travel
networks (roads, railways, footpaths, for example) is more appropriate. The Generate Network Spatial Weights tool allows you to model and
store spatial relationships based on time or distance between point features in the case where travel is restricted to a network dataset. This
tool requires a license for the ArcGIS Network Analyst extension.
You provide a point feature class representing both feature origins and feature destinations. You also provide an existing network dataset (see
Designing a network dataset or use one of the ready-to-use network datasets that come with ESRI Data & Maps). The Generate Network
Spatial Weights tool locates each point on the network and quantifies, in distance or time, the proximity between each and every other feature.
The resultant proximity solution for any two features may optionally consider barriers and/or restrictions (road closures, for example). These
proximity values are utilized in the mathematics of several spatial statistics tools including Spatial Autocorrelation (Global Moran's I), Hot Spot
Analysis (Getis-Ord Gi*), and Cluster and Outlier Analysis (Anselin Local Moran's I).
Dive-in: The proximity values within the spatial weights matrix file are stored in little endian binary format
using sparse matrix techniques to minimize use of disk space, computer memory, and the number
of required calculations.
Tip: ESRI Data & Maps, free to ArcGIS users, contains StreetMap data including a prebuilt network
dataset in SDC format. The coverage for this dataset is the United States and Canada. These
network datasets can be used directly by the Generate Network Spatial Weights tool.
Anselin, L. (1988). Spatial Econometrics: Methods and Models. Boston: Kluwer.

Getis, A., and Aldstadt, J. (2004). "Constructing the Spatial Weights Matrix Using a Local Statistic." Geographical Analysis 36(2):90–104.
Haining, R. (2003). Spatial Data Analysis: Theory and Practice. Cambridge, UK: Cambridge University Press.
Price, Mike. (Fall 2009). "It's all about streets." ArcUser Online. ESRI.
How Generate Spatial Weights Matrix works
Locate topic
Spatial statistics does not mean applying traditional (nonspatial) statistical methods to data that just happens to be spatial (having x- and y-
coordinates). Spatial statistics integrate space and spatial relationships directly into their mathematics (area, distance, length, and so on). For
many spatial statistics, these spatial relationships are specified formally through a spatial weights matrix file or table.
A spatial weights matrix is a representation of the spatial structure of your data. It is a quantification of the spatial relationships that exist
among the features in your dataset (or, at least, a quantification of the way you conceptualize those relationships). Because the spatial weights
matrix imposes a structure on your data, you should select a conceptualization that best reflects how features actually interact with each other
(giving thought, of course, to what it is you are trying to measure). If you are measuring clustering of a particular species of seed-propagating
tree in a forest, for example, some form of inverse distance is probably most appropriate. However, if you are assessing the geographic
distribution of a region's commuters, travel time or travel cost might be a better choice.
While physically implemented in a variety of ways, conceptually, the spatial weights matrix is an NxN table (N is the number of features in the
dataset). There is one row for every feature and one column for every feature. The cell value for any given row/column combination is the
weight that quantifies the spatial relationship between those row and column features.
At the most basic level, there are two strategies for creating weights to quantify the relationships among data features: binary or variable
weighting. For binary strategies (fixed distance, K nearest neighbors, Delaunay Triangulation, contiguity, or space-time window) a feature is
either a neighbor (1) or it is not (0). For weighted strategies (inverse distance or zone of indifference), neighboring features have a varying
amount of impact (or influence), and weights are computed to reflect that variation.
Based on your parameter specifications, the Generate Spatial Weights Matrix tool creates a spatial weights matrix (SWM) file. The spatial
relationship values in that file are stored using sparse matrix techniques to minimize disk space, computer memory, and the number of
required calculations. These relationship values are utilized in the mathematics of several spatial statistics tools including Spatial
Autocorrelation (Global Moran's I), Hot Spot Analysis (Getis-Ord Gi*), and Cluster and Outlier Analysis (Anselin Local Moran's I). While the
spatial weights matrix file can conceivably store NxN spatial relationships, in most cases, each feature should only be related to a handful of
others. The sparse methodology takes advantage of this by only storing nonzero relationships.
Note: It is possible to run out of memory when you are using an SWM file. This generally occurs when
you select Conceptualization of Spatial Relationships or Distance Band or Threshold Distance,
resulting in features having many, many neighbors, negating the sparse nature of the SWM file.
You generally do not want to create a spatial weights matrix where every feature has thousands
of neighbors. You want all features to have at least one neighbor and almost all features to have
at least eight neighbors. You can ensure that each feature has a specified minimum number of
neighbors by entering that minimum value for the Number of Neighbors parameter.
Dive-in: The spatial weights matrix (SWM) file is written using a little endian binary file format. For more
information about how the SWM file is read and written to disk, right-click the Generate Spatial
Weights Matrix tool and choose Edit. This will display the Python source code for this tool. The
code for reading an SWM file is in the WeightsUtils.py file, which is installed in your
<ArcGIS>/ArcToolbox/Scripts folder.
Getis, Arthur, and Jared Aldstadt. "Constructing the Spatial Weights Matrix Using a Local Statistic." Geographical Analysis 36(2): 90–104,
2004.
Spatial weights
Locate topic
Spatial statistics integrate space and spatial relationships directly into their mathematics (area, distance, length, or proximity, for example).
Typically, these spatial relationships are defined formally through values called spatial weights. Spatial weights are structured into a spatial
weights matrix and stored as a spatial weights matrix file.
A spatial weights matrix quantifies the spatial and temporal relationships that exist among the features in your dataset (or at least quantifies
your conceptualization of those relationships). While the physical format of the spatial weights matrix file may vary, conceptually, you can
think of the spatial weights matrix as a table with one row and one column for every feature in the dataset. The cell value for any given
row/column combination is the weight that quantifies the spatial relationship between those row and column features.
There is a multitude of weighting possibilities including inverse distance, fixed distance, space-time window, K nearest neighbors, contiguity,
and spatial interaction (these conceptual models of spatial relationships are described in Modeling Spatial Relationships). Recognize that the
conceptualization you select to model spatial relationships for a particular analysis will impose a structure onto your data. Consequently, you
will want to select a conceptualization that best reflects how the features being analyzed actually interact with each other in the real world.
At a very basic level, however, weights are either binary or variable. Binary weighting, for example, is used with fixed distance, space-time
window, K nearest neighbors, and contiguity spatial relationships. For a particular target feature, binary weighting assigns a weight of 1 to all
neighboring features and a weight of 0 to all other features. For inverse distance or inverse time spatial relationships, weights are variable.
Variable weights fall into a range from 0 to 1 so that nearby neighbors get larger weights than neighbors farther away.
Spatial weights are often row standardized, particularly with binary weighting strategies. Row standardization is used to create proportional
weights in cases where features have an unequal number of neighbors. Row standardization involves dividing each neighbor weight for a
feature by the sum of all neighbor weights for that feature and is recommended whenever the distribution of your features is potentially biased
due to sampling design or an imposed aggregation scheme. You will almost always want to apply row standardization when your features are
polygons.
Related Topics
Grouping Analysis
An overview of the Utilities toolset
Locate topic
These utility scripts perform a variety of data conversion tasks. They were designed to be used in conjunction with other tools in the Spatial
Statistics toolbox.
Legacy: Because there are easier and more efficient ways to get the area of features, the Calculate Areas
tool will no longer be included with ArcGIS Pro. Use the Calculate_Field tool instead of the
Calculate Areas tool in your workflows.
Tool Description
Calculate Areas Calculates area values for each feature in a polygon feature class.
Calculate Distance Band Returns the minimum, the maximum, and the average distance to the specified Nth nearest neighbor (N
from Neighbor Count is an input parameter) for a set of features. Results are accessible from the Results window.
Collect Events Converts event data, such as crime or disease incidents, to weighted point data.
Convert Spatial Weights Converts a binary spatial weights matrix file (.swm) to a table.
Matrix To Table
Export Feature Attributes Exports feature class coordinates and attribute values to a space, comma, or semicolon-delimited ASCII
To ASCII text file.
Utilities tools
Related Topics
Calculate Areas (Spatial Statistics)
Locate topic
Summary
Calculates area values for each feature in a polygon feature class.
Legacy: Because there are easier and more efficient ways to get the area of features, the Calculate
Areas tool will no longer be included with ArcGIS Pro. Use the Calculate Field tool or the
Geometry Calculator instead of the Calculate Areas tool in your workflows and custom script or
models tools.
Illustration
Usage
 The F_AREA field created in the Output Feature Class will be populated with values for the area of each polygon feature in square
units of the Output Coordinate System.
 There are alternative methods for creating an Area field for polygon features including: Calculate Field and the Geometry Calculator.
 The Output Feature Class is a copy of the Input Feature Class with the additional (or updated) F_AREA field containing polygon areas.
 This tool is useful for determining a weight for intra-zonal interaction.
 This tool can be used to calculate an Area value for a study area polygon. The Average Nearest Neighbor tool, for example, has an
Area parameter.
Caution: The F_AREA field is created in the Output Feature Class to store calculated Area values. If a
field of this name already exists in the Input Feature Class, it will be overwritten in the
Output Feature Class.
information.
Syntax
CalculateAreas_stats (Input_Feature_Class, Output_Feature_Class)
The input polygon feature class.
The output feature class. This feature class is a copy of the input feature class
with field F_AREA added (or updated). The F_AREA field contains the polygon
area.
Code Sample
CalculateAreas Example (Python Window)
The following Python Window script demonstrates how to use the CalculateAreas tool.
import arcpy
arcpy.CalculateAreas_stats("tracts.shp", "tracts_with_area_field.shp")
CalculateAreas Example (stand-alone Python script)

The following stand-alone Python script demonstrates how to use the CalculateAreas tool.
# Calculate AREA values

import arcpy
input = "tracts.shp"
calculate_output = "tracts_with_area_field.shp"
try:
# Process: Calculate Areas...

arcpy.CalculateAreas_stats(input, calculate_output)
except:
Environments
Output_coordinate_system, Current_workspace, Scratch_workspace, Qualified_field_names, Output_has_M_values, Output_has_Z_values,
Default_output_Z_value
Feature geometry is projected to the Output Coordinate System prior to analysis.
Calculate Distance Band from Neighbor Count (Spatial Statistics)
Locate topic
Summary
Returns the minimum, the maximum, and the average distance to the specified Nth nearest neighbor (N is an input parameter) for a set of
features. Results are accessible from the Results window.
Illustration
Usage
 Given a set of features, this tool returns three numbers: the minimum, the maximum, and the average distance to a specified
number of neighbors (N). Example: if you specify 8 for the Neighbors parameter, this tool creates a list of distances between every
feature and its 8th nearest neighbor; from this list of distances it then calculates the minimum, maximum, and average distance.
 The maximum value is the distance you would have to travel away from each feature to ensure every feature has at least N
neighbors.
 The minimum value is the distance you would travel away from each feature to ensure that at least one feature has N neighbors.
 The average value is the average distance you would travel away from each feature to find its N nearest neighbors.
 The output from this tool is written as messages to the Results window. Right-click on the Messages entry and select View to see
results in a Message dialog box.
 Some tools, such as Hot Spot Analysis (Getis-Ord Gi*) and Spatial_Autocorrelation (Global Moran's I), allow you to specify a
neighborhood Distance Band or Threshold Distance value. By using the Maximum Distance output value from this tool for the
Distance Band or Threshold Distance parameter, you ensure every feature in the input feature class has at least N neighbors.
 This tool provides one strategy for determining a Distance Band or Threshold Distance value to use with tools in the Spatial Statistics
Toolbox such as Hot Spot Analysis (Getis-Ord Gi*) or Cluster and Outlier Analysis (Local Moran's I). See Selecting a Fixed Distance for
additional strategies.
 The distances returned by this tool are in the units of the geoprocessing environment Output_Coordinate_System.
Syntax
CalculateDistanceBand_stats (Input_Features, Neighbors, Distance_Method)

Input_Features Feature Layer
The feature class or layer used to calculate distance statistics.
Neighbors Long
The number of neighbors (N) to consider for each feature. This number
should be any integer between one and the total number of features in the
feature class. A list of distances between each feature and its Nth neighbor is
compiled, and the maximum, minimum, and average distances are output to
the Results window.
Distance_Method Specifies how distances are calculated from each feature to neighboring String
features.

(as the crow flies)
Code Sample
CalculateDistanceBandfromNeighborCount Example (Python Window)
The following Python Window script demonstrates how to use the CalculateDistanceBandfromNeighborCount tool.
import arcpy
mindist, avgdist, maxdist = arcpy.CalculateDistanceBand_stats("Blocks", 10, "EUCLIDEAN_DISTANCE")
CalculateDistanceBandfromNeighborCount Example (stand-alone Python script)

The following stand-alone Python script demonstrates how to use the CalculateDistanceBandfromNeighborCount tool.
# import module
import arcpy
# Set geoprocessing environment Workspace

# Set variables
infc = "Blocks"
field = "POP2000"
outfc = "PopHotSpots"
neighbors = 10
# Run the CalculateDistanceBand tool to get a distance for use with the Hot Spot tool from the tool result object
mindist, avgdist, maxdist = arcpy.CalculateDistanceBand_stats(infc, neighbors, "EUCLIDEAN_DISTANCE")
# Run the Hot Spot Analysis tool, using the maxdist output from the Calculate Distance Band tool as an input
arcpy.HotSpots_analysis(infc, field, outfc, "Fixed Distance Band", "EUCLIDEAN_DISTANCE", "None", maxdist
Environments
Related Topics
Collect Events (Spatial Statistics)
Locate topic
Summary
Converts event data, such as crime or disease incidents, to weighted point data.
Illustration
Usage
 Collect Events combines coincident points: it creates a new Output Feature Class containing all of the unique locations found in the
Input Feature Class. It then adds a field named ICOUNT to hold the sum of all incidents at each unique location.
 This tool will only combine features that have the exact same X and Y centroid coordinates. You may want to use the Integrate tool to
snap nearby features together prior to running the Collect Events tool.
Caution: The Integrate tool permanently alters feature geometry; always make a backup copy of
your feature class prior to using Integrate.
 The Hot Spot Analysis (Getis-Ord Gi*), Cluster and Outlier Analysis (Local Moran's I), and Spatial Autocorrelation (Morans I) tools, for
example, require weighted points rather than individual incidents. Collect Events can be used to create weights when the input
feature class contains coincident features.
 Although this tool will work with polygon or line data, it is really only appropriate for event, incident, or other point feature data. For
line and polygon features, feature coincidence is based on feature true geometric centroids. For multipoint, polyline, or polygons with
 If you want each individual point/part of multipoint/multipart data treated as singlepart features, run the Multipart to Singlepart tool,
then run Collect Events on the single part feature class. For more information, see Processing Multipoint Data.
 In addition to the Output Feature Class, this function passes, as derived output values, the name of the count field and the maximum
count value encountered for any one location. These derived output values are helpful when you use this tool in models or scripts.
 When this tool runs in ArcMap, the output feature class is automatically added to the Table of Contents (TOC) with default rendering
applied to the ICOUNT field. The graduated circle rendering scheme is defined by a layer file in
<ArcGIS>/ArcToolbox/Templates/Layers. You can reapply the default rendering, if needed, by importing the template layer
symbology.
Syntax
CollectEvents_stats (Input_Incident_Features, Output_Weighted_Point_Feature_Class)

Input_Incident_Features The features representing event or incident data. Feature Layer
Output_Weighted_Point_Feature_Class The output feature class to contain the weighted point data. Feature Class
Code Sample
CollectEvents Example (Python Window)
The following Python Window script demonstrates how to use the Collect Events tool.
import arcpy
arcpy.env.workspace = "C:/Data"
arcpy.CollectEvents_stats("911Copied.shp", "911Count.shp", "Count", "#")
CollectEvents Example (Stand-alone Python script)

The following stand-alone Python script demonstrates how to use the Collect Events tool.


import arcpy

try:
"#", 0, 0, 0)


"911Count.shp")

"#", "#", "#", 6,

"#", "#", "euclidean6Neighs.swm")
except:
Environments
Output_coordinate_system, Geograpic_transformations, Current_workspace, Scratch_workspace, Output_has_M_values, M_resolution,
M_tolerance, Output_has_Z_values, Default_output_Z_value, Z_resolution, Z_tolerance, XY_resolution, XY_tolerance
Related Topics
Integrate
Convert Spatial Weights Matrix to Table (Spatial Statistics)
Locate topic
Summary
Converts a binary spatial weights matrix file (.swm) to a table.
Illustration
Swm files may be converted to .dbf tables and edited.
Usage
 This tool allows you to edit a spatial weights matrix file, if necessary:
 Create a spatial weights matrix file using the Generate Spatial Weights Matrix or Generate Network Spatial Weight tool.
 Convert the resultant spatial weights matrix file to a table using this tool.
 Edit the table and modify the spatial relationships as desired.
 Use the Generate Spatial Weights Matrix tool to convert the modified table back to the binary spatial weights matrix file format.
Syntax
ConvertSpatialWeightsMatrixtoTable_stats (Input_Spatial_Weights_Matrix_File, Output_Table)
Input_Spatial_Weights_Matrix_File File
The full pathname for the spatial weights matrix file (.swm) you want to
convert.
Output_Table A full pathname to the table you want to create. Table
Code Sample
Convert Spatial Weights Matrix to Table Example (Python Window)
The following Python Window script demonstrates how to use the Convert Spatial Weights Matrix to Table tool.
import arcpy
arcpy.ConvertSpatialWeightsMatrixtoTable_stats("euclidean6Neighs.swm","euclidean6Neighs.dbf")
Convert Spatial Weights Matrix to Table Example (Stand-alone Python script)

The following stand-alone Python script demonstrates how to use the Convert Spatial Weights Matrix to Table tool.
# Create a Spatial Weights Matrix based on Network Data

import arcpy
# Set the geoprocessor object property to overwrite existing output

workspace = r"C:\Data\USCounties\US"
try:
# Create Spatial Weights Matrix

"#", "#", "#", 6)
# Dump Spatial Weights to Database Table

# Process: Convert Spatial Weights Matrix to Table...
dbf = arcpy.ConvertSpatialWeightsMatrixtoTable_stats("euclidean6Neighs.swm",
"euclidean6Neighs.dbf")
# Now you can edit the spatial weights (add, subtract and alter
# neighbors and weights)
# Read weights from table back into Spatial Weights Matrix format
"CONVERT_TABLE",
"#", "#", "#",
"#", "#", "#",
"euclidean6Neighs.dbf")
except:
Environments
Related Topics
Export Feature Attribute To ASCII (Spatial Statistics)
Locate topic
Summary
Exports feature class coordinates and attribute values to a space, comma, or semicolon-delimited ASCII text file.
Illustration
Coordinates (X and Y) and user-specified feature attributes are written to an ASCII text file.
Usage
 This tool may be used to export data for analysis with external software packages.
 The X and Y coordinate values are written to the text file with eight significant digits of precision. Floating-point attribute values are
written to the text file with six significant digits.
 If this tool is part of a custom model tool, the output text file will only appear in the Results window if it is set as a model parameter
prior to running the tool.
 When null values are encountered for a field value, they will be written to the output text file as NULL.
information.
Syntax
ExportXYv_stats (Input_Feature_Class, Value_Field, Delimiter, Output_ASCII_File, Add_Field_Names_to_Output)

The feature class from which to export feature coordinates and attribute
values.
Value_Field Field
The field or fields in the input feature class containing the values to export
[Value_Field,...] to an ASCII text file.
Delimiter String
Specifies how feature coordinates and attribute values will be separated in
the output ASCII file.
 SPACE —Feature coordinates and attribute values will be separated by a
space in the output.
 COMMA —Feature coordinates and attribute values will be separated by
a comma in the output.
 SEMI-COLON —Feature coordinates and attribute values will be
separated by a semicolon in the output.
Output_ASCII_File File
The ASCII text file that will contain the feature coordinate and attribute
values.
Add_Field_Names_to_Output  NO_FIELD_NAMES —No field names will be included in the output text Boolean
file (default).
 ADD_FIELD_NAMES —Field names will be written to the output text file.
Code Sample
ExportFeatureAttributeToASCII example (Python window)
The following Python Window script demonstrates how to use the ExportFeatureAttributeToASCII tool.
import arcpy
arcpy.ExportXYv_stats("AidsByCaCnty.shp","HEPRATE", "SPACE","aidsbycacnty.txt","ADD_FIELD_NAMES")
ExportFeatureAttributeToASCII example (stand-alone Python script)

The following stand-alone Python script demonstrates how to use the ExportFeatureAttributeToASCII tool.
# Export feature locations and attributes to an ASCII text file

import arcpy
workspace = "c:/data"
input_features = "AidsByCaCnty.shp"
export_ASCII = "aidsbycacnty.txt"
try:
# Process: Export Feature Attribute to ASCII...

arcpy.ExportXYv_stats(input_features, "HEPRATE", "SPACE", export_ASCII, "NO_FIELD_NAMES")
except:
Environments
Related Topics

Add XY Coordinates

An Overview of The Spatial Statistics Toolbox PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Overview of The Spatial Statistics Toolbox PDF

Uploaded by

Copyright:

Available Formats

An overview of the Spatial Statistics toolbox Page 1 of 155

An overview of the Spatial Statistics toolbox

Rendering These tools may be helpful for rendering analysis results.

Copyright © 1995-2014 Esri. All rights reserved.

Spatial Statistics toolbox licensing

row would have "N" "N" "Y".

Toolset: Analyzing Patterns

toolset: Analyzing Patterns Y Y Y

Toolset: Mapping Clusters

toolset: Mapping Clusters Y Y Y

Toolset: Measuring Geographic Distributions

toolset: Measuring Geographic Y Y Y

Toolset: Modeling spatial relationships

toolset: Modeling spatial Y Y Y

Copyright © 1995-2014 Esri. All rights reserved.

Spatial Statistics toolbox sample applications

 Identify features with similar characteristics

Identify Statistically Significant Clusters

Assess Overall Spatial Patterns

At which distances is spatial clustering Which distance best reflects an appropriate

most pronounced? scale for my analysis?

Copyright © 1995-2014 Esri. All rights reserved.

Modeling spatial relationships

minutes, and seconds).

Conceptualization of spatial relationships

Inverse distance, inverse distance squared (impedance)

Distance band (sphere of influence)

Polygon contiguity (first order)

Delaunay triangulation (natural neighbors)

Get spatial weights from file (user-defined spatial relationships)

Selecting a conceptualization of spatial relationships: Best practices

Selecting a fixed-distance band value

D = sq root [(x1–x2)**2.0 + (y1–y2)**2.0]

 Manhattan distance is calculated as

Self-potential (field giving intrazonal weight)

dii = 0.5*[(Ai / π)**0.5]

Standardize your spatial weights.

Distance band or threshold distance

0 No threshold or cutoff is Invalid. Runtime error will be generated. Ignored.

Weights matrix file

Spatial weights matrix file (.swm)

Field name Description

Sharing spatial weights matrix files

Copyright © 1995-2014 Esri. All rights reserved.

What is a z-score? What is a p-value?

z-score (Standard Deviations) p-value (Probability) Confidence level

< -1.65 or > +1.65 < 0.10 90%

< -1.96 or > +1.96 < 0.05 95%

< -2.58 or > +2.58 < 0.01 99%

The Null Hypothesis and Spatial Statistics

Copyright © 1995-2014 Esri. All rights reserved.

An overview of the Analyzing Patterns toolset

Copyright © 1995-2014 Esri. All rights reserved.

Average Nearest Neighbor (Spatial Statistics)

Parameter Explanation Data Type

 MANHATTAN_DISTANCE —The distance between two points measured

AverageNearestNeighbor example 2 (stand-alone Python script)

# Analyze crime data to determine if spatial patterns are statistically significant

# Import system modules

# Obtain Nearest Neighbor Ratio and z-score

Copyright © 1995-2014 Esri. All rights reserved.

High/Low Clustering (Getis-Ord General G) (Spatial Statistics)

Parameter Explanation Data Type

D = sq root [(x1–x2)2.0 + (y1–y2)2.0]