Statistical analysis

Inherent in GIS data is information on the attributes of features as well as their locations. This information is used to create maps that can be visually analyzed. Statistical analysis helps you extract additional information from your GIS data that might not be obvious simply by looking at a map—information such as how attribute values are distributed, whether there are spatial trends in the data, or whether the features form spatial patterns. Unlike query functions—such as identify or selection, which provide information about individual features—statistical analysis reveals the characteristics of a set of features as a whole.

Some of the statistical analysis techniques described in this document are most well-suited for interactive applications, such as ArcMap, that allow you to select and visualize data in an ad-hoc and fluid environment. Some of the methods described here are found in ArcMap's menus and toolbars and don't have a geoprocessing tool counterpart. Other methods, such as the spatial statistics tools, are only implemented as geoprocessing tools.

Uses of statistical analysis

Statistical analysis is often used to explore your data—for example, to examine the distribution of values for a particular attribute or to spot outliers (extreme high or low values). Having this information is useful when defining classes and ranges on a map, when reclassifying data, or when looking for data errors.

In the example below, statistics have been calculated for the distribution of senior citizens by census tract in this region (percentage of those aged 65 and over in each tract), including the mean and standard deviation, as well as a histogram showing the distribution of values. Most tracts have a lower percentage of seniors than the mean, but a few tracts have a very high percentage.

Summary statistics and histogram complement symbology

Another use of statistical analysis is to summarize data. Often, this is done for categories, such as calculating the total area in each land-use category. You can also create spatial summaries, such as calculating the average elevation for each watershed. Summary data is useful for gaining a better understanding of conditions in a study area.

In the example below, summary statistics have been calculated for each land-use class showing the number of parcels in that class, the size of the smallest and largest parcel, the average parcel size, and the total area in the class.

Parcel feature size may vary with land-use class; statistics can show the pattern.
Summary statistics can reveal patterns in data.

Statistical analysis is also used to identify and confirm spatial patterns, such as the center of a group of features, the directional trend, or whether features form clusters. While patterns may be apparent on a map, trying to draw conclusions from a map can be difficult—how you classify and symbolize the data can obscure or overemphasize patterns. Statistical functions analyze the underlying data and give you a measure that can be used to confirm the existence and strength of the pattern.

Below is an example of analyses that show the mean center of a set of burglaries, and the standard deviation ellipse for a set of moose sightings (showing the directional trend).

Spatial statistics can show geographic patterns or trends.

Below is an example of an analysis that shows statistically significant clusters of census tracts with many senior citizens (orange) or few (blue).

Spatial statistics can show geographic patterns or trends.

Types of statistical analysis

Statistical analysis functions in ArcGIS for Desktop are either nonspatial (tabular) or spatial (containing location).

Nonspatial statistics are used to analyze attribute values associated with features. The values are accessed directly from a layer's feature attribute table. Examples of nonspatial statistics include the mean and standard deviation.

In this example, the Summary Statistics tool was used to calculate the number of vacant parcels for a set of census tracts, including the total, the mean, and the standard deviation.

Summary statistics

Charts and graphs, such as a histogram or Q-Q plots, are another way of analyzing nonspatial data. In all cases, only the values are analyzed. The locations of the features with which the values are associated—and any spatial relationships between the features—are not considered.

In this example, the histogram shows the distribution of vacant parcels (the number of vacant parcels along the x-axis and the number of tracts in each range along the y-axis).

Histograms show the distribution of data values.

A Normal Q-Q Plot is used to assess the similarity of the distribution of a set of values to that of a standard normal distribution (the typical bell curve, when shown on a histogram). The line on the Normal Q-Q plot shows expected values for a normal distribution—the closer the values to the line, the closer the distribution is to normal. In this example, the concentration of the elements Phosphorous for a set of soil samples is close to normally distributed.

A Normal Q-Q plot compares data value distributions to a normal distribution.

The Normal QQ Plot tool is one of the data exploration tools available with the Geostatistical Analyst extension.

Spatial statistics, on the other hand, focus on the spatial relationships between features—how compact or dispersed the features are, whether they're oriented in a particular direction, and whether they form clusters. The spatial relationship is usually defined as distance (how far apart features are) but can also be other forms of interaction between features.

In the example below, the output of the Standard Distance tool (displayed graphically as a circle) is calculated using the distance of each wildlife sighting from the calculated center of the sightings.

Standard distance and mean center of a group of points

Some spatial statistics consider both the spatial relationships of features and the values of an attribute associated with the features. These are known as weighted statistics—the spatial relationship is influenced by the values. Weighted spatial statistics are used to find out if features having similar values occur together—if, for example, schools with similarly high or low test scores form clusters.

In the example below, the center of parks is weighted by the number of visitors at each park (represented by the size of the green circles).

Weighted mean center of points

Statistical functions can also be classified by whether they're descriptive or inferential. Descriptive statistics summarize some characteristic of the values or features you're analyzing—the mean value, the frequency distribution of values, or the directional trend of a group of features. Descriptive statistics are often useful for comparing two sets of features for the same area.

The example below compares the distribution of senior citizens (top) to that of children under 5 (bottom) for the same set of census tracts.

Histograms and summary statistics are a way to compare populations.

In the example below, the standard distance circles for the American Indian and African American population show that the distribution of the African American population in this area is much more compact.

Standard distance and mean centers are a way to compare populations.

Inferential statistics use probability theory to either predict the likely occurrence of values (using a set of known values), or to assess the likelihood that any pattern or trend you see in the data is not due to chance. The function provides a measure of the pattern or relationship. You then perform a statistical test on this measure to determine whether it is significant at some level of confidence. If the statistic analysis indicates that burglaries occur in clusters, you'd then run a test to find out the chance that the clusters occurred by chance. You might find, for example, that there's a 90 percent likelihood that the clusters didn't occur by chance, indicating the burglaries may be linked in some way. Essentially to determine the probability, the test compares the measure you get for the existing features to the measure you'd expect to get for the same number of features spread over the same area, but distributed randomly.

In the example below, the map on the left shows clusters of census tracts having a high number of senior citizens (orange) or a low number (blue), at a 90 percent level of probability; the right map shows clusters at a 99 percent level of probability.

Compare the detected clustering at different levels of probability.

Statistical analysis functions

The statistical functions in ArcGIS for Desktop are located in ArcMap, ArcCatalog, and geoprocessing, as well as within two extensions: Spatial Analyst and Geostatistical Analyst.

Table statistics

A core set of descriptive statistics that summarize the values for a single field is available from several locations in ArcGIS for Desktop—the table window in ArcMap, the table preview tab in ArcCatalog, and the Statistics toolset (within the Analysis toolbox).

Function

Location

Statistics

Output

Statistics menu option

ArcMap table window or ArcCatalog table preview tab

Count, Minimum, Maximum, Sum, Mean, Standard Deviation, Frequency histogram

Results are displayed in a window.

Summary Statistics tool

Analysis Toolbox/Statistics Toolset

Minimum, Maximum, Sum, Mean, Standard Deviation, Range, First, Last

Results are written to a new table.

Table of core summary statistics functions for a single field

To summarize a field by one or more other fields (for example, to count the number of parcels in each land-use class, sum the area in each land-use class, or find the average parcel size in each class), use the Summarize option on the ArcMap table window or the Frequency tool in the Statistics toolset in the Analysis toolbox.

Function

Location

Statistics

Output

Summarize menu option

ArcMap table window (right-click field name)

Minimum, Maximum, Average (mean), Sum, Standard Deviation, Variance

Results are written to a new table.

Frequency tool

Analysis Toolbox/Statistics Toolset

Count, Sum

Results are written to a new table.

Table of core summary statistics functions for more than one field

Spatial Statistics

The Spatial Statistics toolbox contains a number of statistical routines for analyzing the distribution of a set of features, analyzing patterns, and identifying clusters.

Functional Area

Toolset

Tools

Geographic distribution measurements

Measuring Geographic Distributions

Mean Center, Central Feature, Standard Distance, Directional Distribution (Standard Deviational Ellipse), Linear Directional Mean

Geographic pattern analysis

Analyzing patterns

Average Nearest Neighbor, Spatial Autocorrelation (Moran's I), High/Low Clustering (Getis-Ord General G)

Geographic cluster analysis

Mapping Clusters

Cluster and Outlier Analysis (Anselin Local Moran's I), Hot Spot Analysis (Getis-Ord Gi*)

Regression analysis

Modeling spatial relationships

Ordinary Least Squares, Exploratory Regression, Geographically Weighted Regression

Spatial Statistics tool functions and locations

Raster statistics

The Spatial Analyst includes several statistical functions that can be used to analyze rasters, primarily to summarize attribute values and assign the summary statistics to cells in a new raster layer. These are located in several different toolsets with the Spatial Analyst toolbox.

Tool

Location

Input

Output

What it does

Cell Statistics

Local Toolset

Multiple rasters

Raster

Calculates the specified statistic for each cell based on multiple inputs

Focal Statistics

Neighborhood Toolset

Raster

Raster

Summarizes the values for a raster within a defined neighborhood around each cell and assigns the value to that cell in the output raster

Point Statistics

Neighborhood Toolset

Point features

Raster

Summarizes values for point feature attributes within a defined neighborhood and assigns values to cells in the output raster

Line Statistics

Neighborhood Toolset

Line features

Raster

Summarizes values for line feature attributes within a defined neighborhood and assigns values to cells in the output raster

Zonal Statistics

Zonal Toolset

Raster or polygon features

Raster or summary table

Summarizes values of a raster surface by categories or classes (zones) of the input raster or polygon dataset

Raster statistics tools summary table

Data exploration tools

The Geostatistical Analyst—while focusing on the creation of surface from a set of sample points—also contains a set of tools for visual exploration of data values using charts and graphs. These are often used prior to surface creation to decide which parameters to use for a specific set of data but can also be used generally to explore your dataset. The tools allow you to explore the distribution of values, whether there is a directional trend in the data, and whether there are relationships between two attributes (for example, to see if the values vary together or inversely). The tools are available from the Explore Data option on the Geostatistical Analyst toolbar.

Related Topics

10/29/2012