Similarity Search (Spatial Statistics)
Summary
Identifies which candidate features are most similar or most dissimilar to one or more input features based on feature attributes.
Illustration
Usage
-
You will provide a layer containing the Input Features To Match and a second layer containing the Candidate Features from which matches will be obtained. Often your Input Features To Match and your Candidate Features will be in the same feature layer. While one option is to create two separate datasets, you don't have to do this. It is much easier to create layers with two different selection sets instead. Suppose you have a file with all crime incidents that have occurred over the past month. If you want to find all of the crimes that are most similar to the latest carjacking, you could
- Using standard ArcMap selection tools or geoprocessing tools, select the record for the latest carjacking from the layer with all crime incidents.
- Right-click the layer with the selection and click Selection > Create Layer From Selected Features. Use this new layer for the Input Features To Match parameter.
- Switch the selection on the layer with all crime incidents. Use this layer for the Candidate Features parameter. Caution:
A common mistake when all inputs come from a single dataset is to forget to switch the selection so the Input Features To Match have exactly the same features as the Candidate Features. It is very unlikely this is what you want. The most typical scenario is to have a single Input Features To Match and many Candidate Features.
If there is more than one Input Features To Match, matching is based on averaged Attributes of Interest values. So, for example, if there are two Input Features To Match and one of the Attributes of Interest is a population variable, the tool will look for Candidate Features with populations that are most like the average population values. If the population values are 100 and 102, for example, the tool will look for candidates with populations near 101.
Note:When you have more than one Input Features To Match, you will want to select Attributes of Interest with similar values. If, for example, the population value for one of the inputs is 100 and the other input is 100,000, the tool will look for matches with populations near the average of those two values: 50,050. Notice that this averaged value is nothing like the population for either of the Input Features To Match.
Output Features will always contain points unless the Input Features To Match and the Candidate Features are both polygons or both polylines. Creating polygon or polyline Output Features can slow performance for large datasets, so you can check the Collapse Output To Points to force point geometries for improved performance.
With the Most Or Least Similar parameter, you can search for features that are either MOST_SIMILAR or LEAST_SIMILAR to the Input Features To Match. In some cases you will want to see both ends of the spectrum. If you enter 3 for the Number of Results parameter and BOTH for the Most Or Least Similar parameter, for example, the tool will return the three most similar and the three least similar candidate features.
Any given solution match in the Output Features will either be a solution that is most similar or least similar to the target Input Features To Match; a single solution cannot be both (and solution matches won't be duplicated in the Output Features). Consequently, when you select BOTH for the Most Or Least Similar parameter, the maximum number of resulting matches possible (Number of Results) will be half the number of Candidate Features. When you enter a Number of Results value that is too large, the tool will adjust it to the maximum possible.
Sometimes, in order to explore the spatial pattern of similarity, you will want to rank similarity for all of the Candidate Features. An easy way to indicate that you want all of the Candidate Features to be ranked is to enter zero for the Number of Results parameter. The tool will then determine the number of valid features in the candidates dataset and write all of them to the Output Features in rank order from most to least similar.
For the Match Method parameter you may select ATTRIBUTE_VALUES, RANKED_ATTRIBUTE_VALUES, or ATTRIBUTE_PROFILES.
- For ATTRIBUTE_VALUES the most similar candidates will have the smallest sum of squared differences for all of the Attributes of Interest; all values are standardized before differences are calculated.
- For RANKED_ATTRIBUTE_VALUES the most similar candidates will have the smallest sum of squared ranks for all of the Attributes of Interest. The Output Features reports these sums in the SIMINDEX (Sum of Squared Rank Differences) field.
- For ATTRIBUTE_PROFILES the cosine similarity is measured. Cosine similarity looks for the same relationships among standardized attribute values rather than trying to match magnitudes. Suppose there are four Attributes of Interest called A1, A2, A3, and A4, and that A2 is twice as large as A1, A3 is almost equal to A2, and A4 is three times larger than A3. For the ATTRIBUTE_PROFILES Match Method the tool will be looking for candidates with those same attribute relationships: twice as large, then almost equal, then three times larger. Because this method is looking at attribute relationships, you must specify a minimum of two Attributes of Interest for this method. You might use the cosine similarity method (ATTRIBUTE_PROFILES) to find places like Los Angeles, but at a smaller scale overall. The cosine similarity index ranges from 1.0 (perfect similarity) to -1.0 (perfect dissimilarity). The cosine similarity index is written to the Output Features SIMINDEX (Cosine Similarity) field.
The Attributes of Interest must be numeric and must be present (same field name and same field type) in both the Input Features To Match and the Candidate Features datasets. For the Attributes of Interest parameter, the tool will list all numeric fields found in the Input Features To Match dataset. If the tool doesn't find corresponding fields for the Candidate Features you will see a warning indicating the missing attributes were dropped from the analysis. If all of the Attributes of Interest are dropped, the tool has nothing to use for matching and you will get an error indicating the tool cannot perform the analysis.
All of the attributes used for matching are written to the Output Features. The Fields To Append To Output parameter allows you to include other fields in the output table, if desired. Because numeric Attributes of Interest fields are probably not effective identifiers, you may want to append a name or other identifier field for each solution match. If you need to decide among several matching solutions, you may want to append other nonnumeric attributes as well. If the solution you are seeking must be one of several land-use types, for example, appending a categorical land-use attribute will help you hone in on solutions that meet this requirement. Sometimes you will want to include additional numeric attributes in the output table for reference purposes only. Suppose, for example, you are looking for suitable habitat for a particular animal. You can use known locations where the species is successful for the Input Features To Match. You can select Attributes of Interest that relate to species success. In addition, you might append a numeric area attribute to the Output Features, not because you want to actually match on the area value of the target, but because ultimately you are looking for solutions with the largest areas possible.
All of the Input Features To Match and solution matches are written to the Output Features along with Attributes of Interest and the Fields To Append To Output. In addition, the following fields are included in the Output Features:
Field Name
Field Alias
Description
Notes
MATCH_ID
MATCH_ID
All of the target features in the Input Features To Match layer are listed first with their OID or FID identifier written to the MATCH_ID field. Solution matches have NULL values for this field.
When the Output Features is a shapefile, NULL values are represented by a very large negative number (such as -21474836).
CAND_ID
CAND_ID
All of the solution matches are listed next and this value is their OID or FID identifier. The target features in the Input Features To Match layer have NULL values for this field.
When the Output Features is a shapefile, NULL values are represented by a very large negative number (such as -21474836).
SIMRANK
Similarity Rank
When you select MOST_SIMILAR or BOTH for the Match Method parameter, all of the solution matches are ranked from most similar to least similar. The most similar solution match has a rank value of 1.
This field is only included in the Output Features when you select MOST_SIMILAR or BOTH for the Match Method parameter.
DSIMRANK
Dissimilarity Rank
When you select LEAST_SIMILAR or BOTH for the Match Method parameter, all of the solution matches are ranked from least similar to most similar. The solution that is least similar gets a rank value of 1.
This field is only included in the Output Features when you select LEAST_SIMILAR or BOTH for the Match Method parameter.
SIMINDEX
Sum of Squared Value Differences, Sum of Squared Rank Differences, or Cosine Similarity
This field quantifies how similar each solution match is to the target feature.
- When you specify ATTRIBUTE_VALUES for the Match Method the field alias is Sum of Squared Value Differences.
- When you specify RANKED_ATTRIBUTE_VALUES for the Match Method the field alias is Sum of Squared Rank Differences.
- When you specify ATTRIBUTE_PROFILES for the Match Method the field alias is Cosine Similarity.
If there is only one Input Features To Match, the target feature is this feature. When more than one Input Features To Match is specified, the target feature is a temporary feature created with averaged values for all of the Attributes Of Interest.
LABELRANK
Render Rank
This field is used for display purposes only. The tool uses this field to provide default rendering of the analysis results.
-
When this tool runs in ArcMap, the Output Features are automatically added to the table of contents with default rendering applied to the LABELRANK field. The rendering applied is defined by a layer file in <ArcGIS>/Desktop10.x/ArcToolbox/Templates/Layers. You can reapply the default rendering, if needed, by importing the template layer symbology.
Note:The default sample size is 10,000 records. When the Number Of Results is larger than this default, you will want to increase the sampling size to render all of the results.
Syntax
Parameter | Explanation | Data Type |
Input_Features_To_Match |
The layer (or a selection on a layer) containing the features you want to match; you are searching for other features that look like these features. When more than one feature is provided, matching is based on attribute averages. Tip: When your Input Features To Match and Candidate Features come from a single dataset,
| Feature Layer |
Candidate_Features |
The layer (or a selection on a layer) containing candidate matching features. The tool will look for features most like (or most dislike) the Input Features To Match among these candidates. Tip: When your Input Features To Match and Candidate Features come from a single dataset,
| Feature Layer |
Output_Features |
The output feature class contains a record for each of the Input Features To Match and for all of the solution matching features found. | Feature Class |
Collapse_Output_To_Points | Specify whether you want the geometry for the Output_Features to be points or to match the geometry (lines or polygons) of the input features. This option is only available when the Input_Features_To_Match and the Candidate_Features are both lines or both polygons. Choosing COLLAPSE for large line or polygon datasets will improve tool performance.
| Boolean |
Most_Or_Least_Similar |
Choose whether you are interested in features that are most alike or most different to the Input Features To Match.
| String |
Match_Method |
Choose whether matching should be based on values, ranks, or cosine relationships.
| String |
Number_Of_Results |
The number of solution matches to find. Entering zero or a number larger than the total number of Candidate Features will return rankings for all of the candidate features. | Long |
Attributes_Of_Interest [field,...] |
A list of numeric attributes representing the matching criteria. | Field |
Fields_To_Append_To_Output [field,...] (Optional) |
An optional list of attributes to include with the Output Features. You might want to include a name identifier, categorical field, or date field, for example. These fields are not used to determine similarity; they are only included in the Output Features for your reference. | Field |
Code Sample
The following Python window script demonstrates how to use the SimilaritySearch tool.
import arcpy
import arcpy.stats as SS
arcpy.env.workspace = r"C:\Analysis"
SS.SimilaritySearch ("Crime_selection", "AllCrime", "c:\\Analysis\\CrimeMatches",
"NO_COLLAPSE", "MOST_SIMILAR", "ATTRIBUTE_VALUES", 4,
"HEIGHT;WEIGHT;SEVERITY;DST2CHPSHP", "Name;WEAPON")
The following stand-alone Python script demonstrates how to use the SimilaritySearch tool.
# Similarity Search of crime data in a metropolitan area
# Import system modules
import arcpy, os
import arcpy.stats as SS
# Set geoprocessor object property to overwrite existing output
arcpy.gp.overwriteOutput = True
try:
# Set the current workspace (to avoid having to specify the full path to
# the feature classes each time)
arcpy.env.workspace = r"C:\Analysis"
# Make a layer from the crime feature class
arcpy.MakeFeatureLayer_management("AllCrime", "Crime_selection")
# Select the target crime to match
# Process: Select By Attribute
arcpy.SelectLayerByAttribute_management("Crime_selection","NEW_SELECTION",
'"OBJECTID" = 1230043')
# Use Similarity Search to find to create groups based on different variables
# or analysis fields
# Process: Group Similar Features
SS.SimilaritySearch("Crime_selection","AllCrime","CJMatches","NO_COLLAPSE",
"MOST_SIMILAR","ATTRIBUTE_VALUES",4,
"HEIGHT;WEIGHT;SEVERITY;DST2CHPSHP","Name;WEAPON")
except:
# If an error occurred when running the tool, print out the error message.
print arcpy.GetMessages()