How Grouping Analysis works

Whenever we look at the world around us, it is very natural for us to organize, group, differentiate, and catalog what we see to help us make better sense of it; this type of mental classification process is fundamental to learning and comprehension. Similarly, to help you learn about and better comprehend your data, you can use the Grouping Analysis tool. It performs a classification procedure that tries to find natural clusters in your data. Given the number of groups to create, it will look for a solution where all the features within each group are as similar as possible, and all the groups themselves are as different as possible. Feature similarity is based on the set of attributes that you specify for the Analysis Fields parameter and may optionally incorporate spatial properties or space-time properties. When space or space-time Spatial Constraints are specified, the algorithm employs a connectivity graph (minimum spanning tree) to find natural groupings. When NO_SPATIAL_CONSTRAINT is specified, the Grouping Analysis tool uses a K Means algorithm.

While hundreds of cluster analysis algorithms such as these exist, all of them are classified as NP-hard. This means that the only way to ensure that a solution will perfectly maximize both within-group similarities and between-group differences is to try every possible combination of the features you want to group. While this might be feasible with a handful of features, the problem quickly becomes intractable.

Not only is it intractable to ensure that you've found an optimal solution, it is also unrealistic to try to identify a grouping algorithm that will perform best for all possible data scenarios. Groups come in all different shapes, sizes, and densities; attribute data can include a variety of ranges, symmetry, continuity, and measurement units. This explains why so many different cluster analysis algorithms have been developed over the past 50 years. It is most appropriate, therefore, to think of Grouping Analysis as an exploratory tool that can help you learn more about underlying structures in your data.

Potential applications

Some of the ways that this tool might be applied are

Inputs

This tool takes point, polyline, or polygon Input Features, a unique ID field, a path for the Output Feature Class, one or more Analysis Fields, an integer value representing the Number of Groups to create, and the type of Spatial Constraint—if any—that should be applied within the grouping algorithm. There are also a number of optional parameters including one that allows you to create a PDF Output Report File.

Analysis Fields

Select fields that are numeric reflecting ratio, interval or ordinal measurement systems. While Nominal data may be represented using dummy (binary) variables, these generally do not work as well as the other numeric variable types. For example, you could create a variable called Rural and assign to each feature (each census tract, for example) a 1 if it is mostly rural and a 0 if it is mostly urban. A better representation for this variable for use with Grouping Analysis, however, would be the amount or proportion of rural acreage associated with each feature.

You should select variables that you think will distinguish one group of features from another. Suppose, for example, you are interested in grouping school districts by student performance on standardized achievement tests. You might select Analysis Fields that include overall test scores, results for particular subjects like math or reading, the proportion of students meeting some minimum test score threshold, and so forth. When you run the Grouping Analysis tool, an R2 value is computed for each variable. In the summary below, for example, school districts are grouped based on student test scores, the percentage of adults in the area who didn't finish high school, per student spending, and average student-to-teacher ratios. Notice that the TestScores variable has the highest R2 value. This indicates that this variable divides the school districts into groups most effectively. The R2 value reflects how much of the variation in the original TestScores data was retained after the grouping process, so the larger the R2 value is for a particular variable, the better that variable is at discriminating among your features.

Grouping Analysis Variable Summary

Dive-inDive-in:

R2 is computed as:

(TSS - ESS) / TSS

where TSS is the total sum of squares and ESS is the explained sum of squares. TSS is calculated by squaring and then summing deviations from the global mean value for a variable. ESS is calculated the same way, except deviations are group by group: every value is subtracted from the mean value for the group it belongs to, then squared and summed.

Number of Groups

Sometimes you will know the number of groups most appropriate to your question or problem. If you have five sales managers and want to assign each to their own contiguous region, for example, you would use 5 for the Number of Groups parameter. In many cases, however, you won't have any criteria for selecting a specific number of groups; instead, you just want the number that best distinguishes feature similarities and differences. To help you in this situation, you can check on the Evaluate Optimal Number of Groups parameter and let the Grouping Analysis tool assess the effectiveness of dividing your features into 2, 3, 4, and up to 15 groups. Grouping effectiveness is measured using the Calinski-Harabasz pseudo F-statistic, which is a ratio reflecting within-group similarity and between-group differences:

Calinski-Harabasz pseudo F-statistic

Spatial Constraint

If you want the resultant groups to be spatially proximal, specify a spatial constraint. The CONTIGUITY options are enabled for polygon feature classes and indicate that features can only be part of the same group if they share an edge (CONTIGUITY_EDGES_ONLY) or if they share either an edge or a vertex (CONTIGUITY_EDGES_CORNERS) with another member of the group. The polygon contiguity options are not good choices, however, if your dataset includes clusters of discontiguous polygons or polygons with no contiguous neighbors at all:

Discontiguous Polygons

The DELAUNAY_TRIANGULATION and K_NEAREST_NEIGHBORS options are both appropriate for point or polygon features; these options indicate that a feature will only be included in a group if at least one other group member is a natural neighbor (Delaunay Triangulation) or a K Nearest Neighbor. If you select K_NEAREST_NEIGHBORS and enter a 12 for the Number of Neighbors parameter, for example, every feature in a group will be within 12 nearest neighbors of at least one other feature in the group.

The DELAUNAY_TRIANGULATION option shouldn't be used for datasets with coincident features. Also, because the Delaunay Triangulation method converts features to Theissen polygons to determine neighbor relationships, especially with polygon features and sometimes with peripheral features in your dataset, the results from using this option may not always be what you expect. In the illustration below, notice that some of the grouped original polygons are not contiguous; when they are converted to Theissen polygons, however, all the grouped features do, in fact, share an edge:

Delaunay Triangulation Example
For Delaunay Triangulation, Theissen polygon contiguity defines neighbor relationships.

If you want the resultant groups to be both spatially and temporally proximal, create a spatial weights matrix file (SWM) using the Generate_Spatial_Weights_Matrix tool and select SPACE_TIME_WINDOW for the Conceptualization of Spatial Relationships parameter. You can then specify the SWM file you created with the Generate Spatial Weights Matrix tool for the Weights Matrix File parameter when you run Grouping Analysis.

NoteNote:

While the spatial relationships among your features are stored in an SWM file and used by the Grouping Analysis tool to impose spatial constraints, there is no actual weighting involved in the grouping process. The SWM file is only used to keep track of which features can and cannot be included in the same group.

For many analyses, imposing a spatial or space-time constraint is neither required nor helpful. Suppose, for example, you want to group crime incidents by perpetrator attributes (height, age, severity of the crime, and so forth). While crimes committed by the same person may tend to be proximal, it is unlikely that you would find that all the crimes in a particular area were committed by the same person. For this type of analysis, you would select NO_SPATIAL_CONSTRAINT for the Spatial Constraints parameter. You might, however, elect to include some spatial variables (proximity to banks, for example) in your list of Analysis Fields to capture some of the spatial aspects of the crimes you're analyzing.

K Means

When you select NO_SPATIAL_CONSTRAINT for the Spatial Constraints parameter, a K Means algorithm is used for grouping. The goal of the K Means algorithm is to partition features so that the differences among the features in a group, over all groups, is minimized. Because the algorithm is NP-hard, a greedy heuristic is employed to group features. The greedy algorithm will always converge to a local minimum but will not always find the global (most optimal) minimum.

The K Means algorithm works by first identifying seed features used to grow each group. Consequently, the number of seeds will always match the Number of Groups. The first seed is selected randomly. Selection of remaining seeds, however, while still employing a random component, applies a weighting that favors selection of subsequent seeds farthest in data space from the existing set of seed features (this part of the algorithm is called K Means ++). Because of the random component in finding seed features whenever you select FIND_SEED_LOCATIONS or USE_RANDOM_SEEDS for the Initialization Method, you might get variations in grouping results from one run of the tool to the next.

Once the seed features are identified, all features are assigned to the closest seed feature (closest in data space). For each cluster of features, a mean data center is computed, and each feature is reassigned to the closest center. The process of computing a mean data center for each group and then reassigning features to the closest center continues until group membership stabilizes (up to a maximum number of 100 iterations).

Minimum Spanning Tree

When you specify a spatial constraint to limit group membership to contiguous or proximal features, the tool first constructs a connectivity graph representing the neighborhood relationships among features. From the connectivity graph, a minimum spanning tree is devised that summarizes both feature spatial relationships and feature data similarity. Features become nodes in the minimum spanning tree connected by weighted edges. The weight for each edge is proportional to the similarity of the objects it connects. After building the minimum spanning tree, a branch (edge) in the tree is pruned, creating two minimum spanning trees. The edge to be pruned is selected so that it minimizes dissimilarity in the resultant groups, while avoiding (if possible) singletons (groups with only one feature). At each iteration one of the minimum spanning trees is divided by this pruning process until the Number of Groups specified is obtained. The published method employed is called SKATER (Spatial "K"luster Analysis by Tree Edge Removal). While the branch that optimizes group similarity is selected for pruning at each iteration, there is no guarantee that the final result will be optimal.

Outputs

A number of outputs are created by the Grouping Analysis tool. All these (including the optional PDF report file) can be accessed from the Results window. If you disable background processing, results will also be written to the Progress dialog box. These messages (shown below) summarize information presented in the optional PDF report (described below).

Grouping Analysis Progress Window

The default output for the Grouping Analysis tool is a new Output Feature Class containing the fields used in the analysis plus a new Integer field called SS_GROUP identifying which group each feature belongs to. This output feature class is added to the table of contents with a unique color rendering scheme applied to the SS_GROUP field. Hollow rendering indicates features that could not be added to any group, usually because they have no neighboring features. If you specify NO_SPATIAL_CONSTRAINT for the Spatial Constraints parameter, an additional field, SS_SEED, is added to the output feature class to indicate which seed features were used to grow groups.

Grouping Analysis Output Feature Class
Grouping with Contiguity Spatial Constraint

Grouping Analysis Report File

If you specify a path for the Output Report File parameter, a PDF is created summarizing the groups that were created.

NoteNote:

Creating the optional report file can add substantial processing time. Consequently, while Grouping Analysis will always create an output feature class showing group membership, the PDF report file will not be created if you specify more than 15 groups or more than 15 variables.

Box plots are included throughout the report, so the first element in the report is a graphic showing you how to interpret them (see below). The box plots in the Grouping Analysis report graphically depict nine summary values for each analysis field and group: minimum data value, lower quartile, median, upper quartile, maximum data value, data outliers (values smaller or larger than 1.5 times the interquartile range), group minimum, group mean, and group maximum. Any "+" marks falling outside the upper or lower whisker represent data outliers.
Dive-inDive-in:

The interquartile range (IQR) is the upper quartile minus the lower quartile. Low outliers would be values less than 1.5*IQR (Q1-1.5*IQR), and high outliers would be values greater than 1.5*IQR (Q3+1.5*IQR). Outliers appear in the box plots as "+" symbols.

The first page of the report compares the variables (the Analysis Fields) within each group to each other. In the report below, for example, Grouping Analysis was performed on census tracts to create four groups. Summary statistics for each group are printed using a different color (blue, red, green, and gold). The first set of summary statistics are printed in black because these are the global Mean, Standard Deviation (Std.Dev.), Minimum, Maximum, and R2 values for all data in each analysis field. The larger the R2 value is for a particular variable, the better that variable is at discriminating among your features. After the global summaries, the Mean, Standard Deviation, Minimum, Maximum, and Share values are reported for each variable in each group. In the report below, for example, you see that Group 1 (Blue) contains 52 percent of the range of values in the global AGE_UNDER5 variable; the global range of values is from 0 to 1,453 children under the age of 5, and the blue group contains tracts with from 488 to 1,246 children under the age of 5. The mean number of children under 5 for the tracts in the blue group is 805.3750. The box plot to the right of the blue group statistical summary shows how the group's values relate to the global values for that same analysis field. Notice that the blue dot on the box plot falls outside the upper quartile and that the first blue vertical line (representing the minimum value for the blue group tracts) is above the global mean for this field. In fact, looking at where the blue dots fall in the box plots for all the variables, you can see that, except for the MEDIANRENT variable, the mean values in all the analysis fields is above the upper quartile. This group has the highest range of values compared to the other groups.

Dive-inDive-in:

The Share value is the ratio of the group and global range. For group 1 and the AGE_UNDER5 variable, for example, the 52 percent share is obtained by dividing the group range (1246-488=758) by the global range (1453-0=1453), yielding 0.52 when rounded to two significant digits.

Group Summary
Section 1 of the output report

The second section of the report compares the variable ranges for each group, one analysis field (variable) at a time. With this view of the data, it is easy to see which group has the highest and lowest range of values within each variable. The group minimum, mean, and maximum values are superimposed on top of the box plot reflecting all values. Notice that group 4 (orange) has the lowest values for the MEDIANRENT variable. The minimum, mean, and maximum values for this group are lower than for any other group.

Variable Summary
Section 2 of the output report

The parallel box plot graph summarizes both the groups and the variables within them. From the graph below, notice that group 1 (blue) reflects tracts with average rents, the highest values for female-headed households with children (FHH_CHILD), the highest values for number of housing units (HSE_UNITS), and the highest values for children under the age of 5. Group 2 (red) reflects tracts with the highest median rents, lowest number of female-headed households with children, more than the average number of housing units (though fewer than the tracts in groups 1 or 3), and the fewest children under the age of 5.

Grouping Analysis Box Plot
Parallel box plot in the output report

When you check on the Evaluate Optimal Number of Groups parameter, the PDF report file will include a graph of pseudo F-statistic values. The circled point on the graph is the largest F-statistic, indicating how many groups will be most effective at distinguishing the features and variables you specified. In the graph below, the F-statistic associated with four groups is highest. Five groups, with a high pseudo F-statistic, would also be a good choice.

pseudo F-statistic Graph
Pseudo F-statistic plot in the output report

Best practices

While there is a tendency to want to include as many Analysis Fields as possible, for Grouping Analysis, it works best to start with a single variable and build. Results are much easier to interpret with fewer analysis fields. It is also easier to determine which variables are the best discriminators when there are fewer fields.

In many scenarios, you will likely run the Grouping Analysis tool a number of times looking for the optimal Number of Groups, most effective Spatial Constraints, and the combination of Analysis Fields that best separate your features into groups. Because creating the Output Report can add substantial processing time, you will likely not want to create the report while you are experimenting with different input parameters.

Additional Resources

Duque, J. C., R. Ramos, and J. Surinach. 2007. "Supervised Regionalization Methods: A Survey" in International Regional Science Review 30: 195–220.

Assuncao, R. M., M. C. Neves, G. Camara, and C. Da Costa Freitas. 2006. "Efficient Regionalisation Techniques for Socio-economic Geographical Units using Minimum Spanning Trees" in International Journal of Geographical Information Science 20 (7): 797–811.

Jain, A. K. 2009. "Data Clustering: 50 years beyond K-Means." Pattern Recognition Letters.

Hinde, A., T. Whiteway, R. Ruddick, and A. D. Heap. 2007. "Seascapes of the Australian Margin and adjacent sea floor: Keystroke Methodology." in Geoscience Australia, Record 2007/10, 58pp.

4/18/2013