Exploratory Regression (Spatial Statistics)
Summary
The Exploratory Regression tool evaluates all possible combinations of the input candidate explanatory variables, looking for OLS models that best explain the dependent variable within the context of userspecified criteria.
You can access the results of this tool (including the optional report file) from the Results window. If you disable background processing, results will also be written to the Progress dialog box.
Illustration
Usage

The primary output for this tool is a report file which is written to the Results window. Rightclicking on the Messages entry in the Results window and selecting View will display the Exploratory Regression summary report in a Message dialog box.
This tool will optionally create a text file report summarizing results. This report file will be added to the table of contents (TOC) and may be viewed in ArcMap by rightclicking on it and selecting Open.
This tool also produces an optional table of all models meeting your maximum coefficient pvalue cutoff and Variance Inflation Factor (VIF) value criteria. A full explanation of the report elements and table is provided in Interpreting Exploratory Regression Results.
This tool uses Ordinary Least Squares (OLS) and Spatial Autocorrelation (Global Moran's I). The optional spatial weights matrix file is used with the Spatial Autocorrelation (Global Moran's I) tool to assess model residuals; it is not used by the OLS tool at all.
This tool tries every combination of the Candidate Explanatory Variables entered, looking for a properly specified OLS model. Only when it finds a model that meets your threshold criteria for Minimum Acceptable Adj R Squared, Maximum Coefficient pvalue Cutoff, Maximum VIF Value Cutoff and Minimum Acceptable JarqueBera pvalue will it run the Spatial Autocorrelation (Global Moran's I) tool on the model residuals to see if the under/overpredictions are clustered or not. In order to provide at least some information about residual clustering in the case where none of the models pass all of these criteria, the Spatial Autocorrelation (Global Moran's I) test is also applied to the residuals for the three models that have the highest Adjusted R^{2} values and the three models that have the largest JarqueBera pvalues.
Especially when there is strong spatial structure in your dependent variable, you will want to try to come up with as many candidate spatial explanatory variables as you can. Some examples of spatial variables would be distance to major highways, accessibility to job opportunities, number of local shopping opportunities, connectivity measurements, or densities. Until you find explanatory variables that capture the spatial structure in your dependent variable, model residuals will likely not pass the spatial autocorrelation test. Significant clustering in regression residuals, as determined by the Spatial Autocorrelation (Global Moran's I) tool, indicates model misspecification. Strategies for dealing with misspecification are outlined in What they don't tell you about regression analysis.
Because the Spatial Autocorrelation (Global Moran's I) is not run for all of the models tested (see the previous usage tip), the table will have missing data for the SA (Spatial Autocorrelation) field. Because .dbf files do not store null values, these appear as very, very small (negative) numbers (something like 1.797693e+308). For geodatabase tables, these missing values appear as null values. A missing value indicates that the residuals for the associated model were not tested for spatial autocorrelation because the model did not pass all of the other model search criteria.
The default spatial weights matrix file used to run the Spatial Autocorrelation (Global Moran's I) tool is based on an 8 nearest neighbor conceptualization of spatial relationships. This default was selected primarily because it executes fairly quickly. To define neighbor relationships differently, however, you can simply create your own spatial weights matrix file using the Generate Spatial Weights Matrix File tool, then specify the name of that file for the Input Spatial Weights Matrix File parameter. Inverse Distance, Polygon Contiguity, or K Nearest Neighbors, are all appropriate Conceptualizations of Spatial Relationships for testing regression residuals.
Note:The spatial weights matrix file is only used to test model residuals for spatial structure. When a model is properly specified, the residuals are spatially random (large residuals are intermixed with small residuals; large residuals do not cluster together spatially).
Syntax
Parameter  Explanation  Data Type 
Input_Features 
The feature class or feature layer containing the dependent and candidate explanatory variables to analyze.  Feature Layer 
Dependent_Variable 
The numeric field containing the observed values you want to model using OLS.  Field 
Candidate_Explanatory_Variables [Candidate_Explanatory_Variables,...] 
A list of fields to try as OLS model explanatory variables.  Field 
Weights_Matrix_File (Optional) 
A file containing spatial weights that define the spatial relationships among your input features. This file is used to assess spatial autocorrelation among regression residuals. You can use the Generate Spatial Weights Matrix File tool to create this. When you do not provide a spatial weights matrix file, residuals are assessed for spatial autocorrelation based on each feature's 8 nearest neighbors. Note: The spatial weights matrix file is only used to analyze spatial structure in model residuals; it is not used to build or to calibrate any of the OLS models.  File 
Output_Report_File (Optional) 
The report file contains tool results, including details about any models found that passed all the search criteria you entered. This output file also contains diagnostics to help you fix common regression problems in the case that you don't find any passing models.  File 
Output_Results_Table (Optional) 
The optional output table created containing the explanatory variables and diagnostics for all of the models within the Coefficient pvalue and VIF value cutoffs.  Table 
Maximum_Number_of_Explanatory_Variables (Optional) 
All models with explanatory variables up to the value entered here will be assessed. If, for example, the Minimum_Number_of_Explanatory_Variables is 2 and the Maximum_Number_of Explanatory_Variables is 3, the Exploratory Regression tool will try all models with every combination of two explanatory variables, and all models with every combination of three explanatory variables.  Long 
Minimum_Number_of_Explanatory_Variables (Optional) 
This value represents the minimum number of explanatory variables for models evaluated. If, for example, the Minimum_Number_of_Explanatory_Variables is 2 and the Maximum_Number_of_Explanatory_Variables is 3, the Exploratory Regression tool will try all models with every combination of two explanatory variables, and all models with every combination of three explanatory variables.  Long 
Minimum_Acceptable_Adj_R_Squared (Optional) 
This is the lowest Adjusted RSquared value you consider a passing model. If a model passes all of your other search criteria, but has an Adjusted RSquared value smaller than the value entered here, it will not show up as a Passing Model in the Output Report File. Valid values for this parameter range from 0.0 to 1.0. The default value is 0.5, indicating that passing models will explain at least 50 percent of the variation in the dependent variable.  Double 
Maximum_Coefficient_p_value_Cutoff (Optional) 
For each model evaluated, OLS computes explanatory variable coefficient pvalues. The cutoff pvalue you enter here represents the confidence level you require for all coefficients in the model in order to consider the model passing. Small pvalues reflect a stronger confidence level. Valid values for this parameter range from 1.0 down to 0.0, but will most likely be 0.1, 0.05, 0.01, 0.001, and so on. The default value is 0.05, indicating passing models will only contain explanatory variables whose coefficients are statistically at the 95 percent confidence level (pvalues smaller than 0.05). To relax this default you would enter a larger pvalue cutoff, such as 0.1. If you are getting lots of passing models, you will likely want to make this search criteria more stringent by decreasing the default pvalue cutoff from 0.05 to 0.01 or smaller.  Double 
Maximum_VIF_Value_Cutoff (Optional) 
This value reflects how much redundancy (multicollinearity) among model explanatory variables you will tolerate. When the VIF (Variance Inflation Factor) value is higher than about 7.5, multicollinearity can make a model unstable; consequently, 7.5 is the default value here. If you want your passing models to have less redundancy, you would enter a smaller value, such as 5.0, for this parameter.  Double 
Minimum_Acceptable_Jarque_Bera_p_value (Optional) 
The pvalue returned by the JarqueBera diagnostic test indicates whether the model residuals are normally distributed. If the pvalue is statistically significant (small), the model residuals are not normal and the model is biased. Passing models should have large JarqueBera pvalues. The default minimum acceptable pvalue is 0.1. Only models returning pvalues larger than this minimum will be considered passing. If you are having trouble finding unbiased passing models, and decide to relax this criterion, you might enter a smaller minimum pvalue such as 0.05.  Double 
Minimum_Acceptable_Spatial_Autocorrelation_p_value (Optional) 
For models that pass all of the other search criteria, the Exploratory Regression tool will check model residuals for spatial clustering using Global Moran's I. When the pvalue for this diagnostic test is statistically significant (small), it indicates the model is very likely missing key explanatory variables (it isn't telling the whole story). Unfortunately, if you have spatial autocorrelation in your regression residuals, your model is misspecified, so you cannot trust your results. Passing models should have large pvalues for this diagnostic test. The default minimum pvalue is 0.1. Only models returning pvalues larger than this minimum will be considered passing. If you are having trouble finding properly specified models because of this diagnostic test, and decide to relax this search criteria, you might enter a smaller minimum such as 0.05.  Double 
Code Sample
The following Python window script demonstrates how to use the ExploratoryRegression tool.
import arcpy, os
arcpy.env.workspace = r"C:\ER"
arcpy.ExploratoryRegression_stats("911CallsER.shp",
"Calls",
"Pop;Jobs;LowEduc;Dst2UrbCen;Renters;Unemployed;Businesses;NotInLF; \
ForgnBorn;AlcoholX;PopDensity;MedIncome;CollGrads;PerCollGrd; \
PopFY;JobsFY;LowEducFY",
"BG_911Calls.swm", "BG_911Calls.txt", "",
"MAX_NUMBER_ONLY", "5", "1", "0.5", "0.05", "7.5", "0.1", "0.1")
The following standalone Python script demonstrates how to use the ExploratoryRegression tool.
# Exploratory Regression of 911 calls in a metropolitan area
# using the Exploratory Regression Tool
# Import system modules
import arcpy, os
# Set geoprocessor object property to overwrite existing output, by default
arcpy.gp.overwriteOutput = True
try:
# Set the current workspace (to avoid having to specify the full path to
# the feature classes each time)
arcpy.env.workspace = r"C:\ER"
# Join the 911 Call Point feature class to the Block Group Polygon feature class
# Process: Spatial Join
fieldMappings = arcpy.FieldMappings()
fieldMappings.addTable("BlockGroups.shp")
fieldMappings.addTable("911Calls.shp")
sj = arcpy.SpatialJoin_analysis("BlockGroups.shp", "911Calls.shp", "BG_911Calls.shp",
"JOIN_ONE_TO_ONE",
"KEEP_ALL",
fieldMappings,
"COMPLETELY_CONTAINS", "", "")
# Delete extra fieldsto clean up the data
# Process: Delete Field
arcpy.DeleteField_management("BG_911Calls.shp", "OBJECTID;INC_NO;DATE_;MONTH_;STIME; \
SD_T;DISP_REC;NFPA_TYP;CALL_TYPE;RESP_COD;NFPA_SF; \
SIT_FND;FMZ_Q;FMZ;RD;JURIS;COMPANY;COMP_COD;RESP_YN; \
DISP_DT;DAY_;D1_N2;RESP_DT;ARR_DT;TURNOUT;TRAVEL; \
RESP_INT;ADDRESS_ID;CITY;CO;AV_STATUS;AV_SCORE; \
AV_SIDE;Season;DayNight")
# Create Spatial Weights Matrix for Calculations
# Process: Generate Spatial Weights Matrix
swm = arcpy.GenerateSpatialWeightsMatrix_stats("BG_911Calls.shp", "TARGET_FID", "BG_911Calls.swm",
"CONTIGUITY_EDGES_CORNERS",
"EUCLIDEAN", "1", "", "", "ROW_STANDARDIZATION", "", "", "", "")
# Exploratory Regression Analysis for 911 Calls
# Process: Exploratory Regression
er = arcpy.ExploratoryRegression_stats("BG_911Calls.shp",
"Calls",
"Pop;Jobs;LowEduc;Dst2UrbCen;Renters;Unemployed;Businesses;NotInLF; \
ForgnBorn;AlcoholX;PopDensity;MedIncome;CollGrads;PerCollGrd; \
PopFY;JobsFY;LowEducFY",
"BG_911Calls.swm", "BG_911Calls.txt", "",
"MAX_NUMBER_ONLY", "5", "1", "0.5", "0.05", "7.5", "0.1", "0.1")
except:
# If an error occurred when running the tool, print out the error message.
print arcpy.GetMessages()