Use Moran’s I on residuals to detect geographic clustering of model errors, and pair it with spatially weighted RMSE to surface hidden regional bias. Run both diagnostics inside your Spatial Cross-Validation Strategies loop so problems are caught during training, not after deployment.
Part of: Spatial Cross-Validation Strategies
Why Standard Metrics Fail Geospatial ML Pipelines
Standard metrics such as RMSE, MAE, and F1-score treat every observation as independent. In geospatial data that assumption is wrong by design. Tobler’s First Law of Geography — nearby locations are more similar than distant ones — means that training and test observations sampled from adjacent areas share latent environmental, topographic, or socioeconomic drivers. A model can achieve a high aggregate R² simply by interpolating between spatially clustered training points, never learning the causal feature relationships needed to generalise to new regions.
The failure is silent. You receive a convincing validation score, ship the model, and discover systematic collapse in the deployment region months later. Two specific failure modes are common:
- Spatial leakage masking: When folds are assigned randomly, nearby training and test points share variance. The model memorises location rather than features. Reducing Spatial Leakage in Model Training addresses the partitioning side of this problem; the metrics below address the detection side.
- Regional bias invisibility: Aggregate RMSE averages over all locations. A model may perform acceptably on flat agricultural plains while failing badly on coastal margins or high-elevation terrain. Unless you stratify and weight your metrics geographically, this bias is invisible until production.
Both failure modes are detectable — and correctable — if you apply spatially explicit diagnostics at evaluation time.
Core Spatial Metrics Principles
- Residual-first: Compute prediction residuals before any metric. Residuals carry the full spatial signal; aggregating too early destroys it.
- Weights matrix before statistics: Every spatial metric depends on a definition of neighbourhood. Build the spatial weights matrix explicitly and inspect it for disconnected components before computing any statistic.
- Permutation significance, not asymptotic: Moran’s I permutation p-values are more reliable than asymptotic approximations on irregular spatial point patterns. Use
permutations=999minimum. - Weight by operational cost, not area: Spatially weighted RMSE should reflect where errors matter, not which cells are largest. In public-health models, weight by population density; in infrastructure risk models, weight by asset criticality.
- Stratify at deployment resolution: Hexagonal grids at the scale your model will be deployed to are more honest than administrative boundaries, which often mask intra-region variance.
- Track drift over time: Log spatial metrics per model version. A sudden rise in residual Moran’s I in production often precedes broader model degradation before any aggregate metric reacts.
Spatial Metrics Reference
| Metric | Purpose | Actionable Threshold |
|---|---|---|
| Moran’s I on residuals | Quantifies spatial autocorrelation in model errors | |I| > 0.1 with p_sim < 0.05 — investigate missing spatial covariates or tighten CV blocking |
| Spatially weighted RMSE | Penalises errors in high-priority or data-sparse zones | SW-RMSE substantially above standard RMSE signals error concentration in weighted regions |
| Geographically stratified MAE | Computes MAE within spatial strata | Any stratum exceeding 2× median MAE warrants targeted data collection or sub-model tuning |
| Boundary-weighted IoU | Evaluates raster segmentation or land-cover mapping | Penalises fragmented, misaligned boundaries; more robust than pixel accuracy for spatial objects |
| Spatial concordance index | Rank-order agreement accounting for spatial lag | Use for ordinal environmental variables where relative ordering matters more than absolute error |
Production-Ready Code
The function below computes Moran’s I on residuals and spatially weighted RMSE from a GeoDataFrame containing geometry, ground truth, and predictions. It uses geopandas, libpysal, esda, and scikit-learn.
import logging
import warnings
from typing import Optional
import numpy as np
import geopandas as gpd
from sklearn.metrics import mean_squared_error
from libpysal.weights import KNN
from esda.moran import Moran
logger = logging.getLogger(__name__)
def evaluate_spatial_metrics(
gdf: gpd.GeoDataFrame,
y_true_col: str,
y_pred_col: str,
k_neighbors: int = 4,
permutations: int = 999,
custom_weights: Optional[np.ndarray] = None,
) -> dict:
"""
Compute spatially explicit evaluation metrics for geospatial ML models.
Parameters
----------
gdf : GeoDataFrame
Must contain geometry, ground truth, and predictions.
Point geometries recommended; polygon centroids used otherwise.
y_true_col : str
Column name for observed values.
y_pred_col : str
Column name for predicted values.
k_neighbors : int
Number of nearest neighbours for spatial weights construction.
permutations : int
Permutations for Moran's I significance testing.
Use 999 for development, 9999 for production reporting.
custom_weights : np.ndarray, optional
Domain-specific observation weights (e.g., population density).
If None, defaults to inverse cardinality decay from KNN graph.
Returns
-------
dict with keys: rmse, sw_rmse, moran_i, moran_p_sim, significant_bias
"""
if gdf.crs is None:
raise ValueError(
"GeoDataFrame must have a defined CRS. "
"Call gdf.set_crs() or gdf.to_crs() before evaluation."
)
if not gdf.crs.is_projected:
raise ValueError(
f"CRS {gdf.crs} is geographic (degrees). "
"Reproject to a metric CRS (e.g. EPSG:32633) before computing "
"distance-based spatial weights."
)
missing = {y_true_col, y_pred_col} - set(gdf.columns)
if missing:
raise KeyError(f"Columns not found in GeoDataFrame: {missing}")
# Use point geometry; fall back to centroid for polygon/line inputs
geom_type = gdf.geometry.geom_type.iloc[0]
if geom_type != "Point":
logger.warning(
"Non-point geometry detected (%s). Using centroids for spatial weights.",
geom_type,
)
gdf = gdf.copy()
gdf["geometry"] = gdf.geometry.centroid
residuals = gdf[y_true_col].values - gdf[y_pred_col].values
standard_rmse = float(
np.sqrt(mean_squared_error(gdf[y_true_col], gdf[y_pred_col]))
)
logger.info("Standard RMSE: %.4f", standard_rmse)
# Build KNN spatial weights matrix
w = KNN.from_dataframe(gdf, k=k_neighbors)
if w.n_components > 1:
warnings.warn(
f"Spatial weights contain {w.n_components} disconnected components. "
"Increase k_neighbors or verify geometry topology.",
stacklevel=2,
)
w.transform = "r" # Row-standardise before computing Moran's I
moran_result = Moran(residuals, w, permutations=permutations)
logger.info(
"Moran's I: %.4f p_sim: %.4f", moran_result.I, moran_result.p_sim
)
# Spatially weighted RMSE
if custom_weights is not None:
if len(custom_weights) != len(gdf):
raise ValueError(
f"custom_weights length {len(custom_weights)} does not match "
f"GeoDataFrame length {len(gdf)}."
)
sw_weights = np.asarray(custom_weights, dtype=float)
else:
# Approximate inverse-distance decay via KNN cardinality
sw_weights = np.array([1.0 / (1.0 + d) for d in w.cardinality])
sw_rmse = float(np.sqrt(np.average(residuals ** 2, weights=sw_weights)))
logger.info("SW-RMSE: %.4f", sw_rmse)
return {
"rmse": standard_rmse,
"sw_rmse": sw_rmse,
"moran_i": float(moran_result.I),
"moran_p_sim": float(moran_result.p_sim),
"significant_bias": bool(moran_result.p_sim < 0.05),
}A second helper computes geographically stratified MAE, which is useful for spotting regional underperformance that aggregate metrics hide.
import pandas as pd
def stratified_spatial_mae(
gdf: gpd.GeoDataFrame,
y_true_col: str,
y_pred_col: str,
strata_col: str,
) -> pd.DataFrame:
"""
Compute MAE within each spatial stratum.
Parameters
----------
gdf : GeoDataFrame
Must contain strata_col, y_true_col, y_pred_col.
strata_col : str
Column identifying spatial strata (e.g., admin region, H3 cell, fold ID).
Returns
-------
DataFrame with columns: stratum, n_samples, mae, relative_mae
"""
gdf = gdf.copy()
gdf["_abs_err"] = (gdf[y_true_col] - gdf[y_pred_col]).abs()
summary = (
gdf.groupby(strata_col)["_abs_err"]
.agg(n_samples="count", mae="mean")
.reset_index()
.rename(columns={strata_col: "stratum"})
)
global_mae = gdf["_abs_err"].mean()
summary["relative_mae"] = summary["mae"] / global_mae
return summary.sort_values("relative_mae", ascending=False)Step-by-Step Walkthrough
The following steps apply both functions to a concrete evaluation scenario: a random-forest regressor trained on a regional soil-carbon dataset.
1. Load and project the evaluation GeoDataFrame.
Start from a GeoDataFrame that contains geometry, ground truth in soc_true, and predictions in soc_pred. Reproject to a metric CRS immediately — distance-based spatial weights require it.
gdf = gpd.read_file("soil_carbon_test.geojson")
gdf = gdf.to_crs("EPSG:32632") # UTM zone 32N, metres2. Confirm predictions are aligned.
Before computing any spatial metric, verify that the column values look sane. A single misconfigured preprocessing step can silently shift residuals by an order of magnitude.
assert len(gdf) > 0, "Empty GeoDataFrame after loading"
assert gdf["soc_true"].notna().all(), "NaN values in ground truth"
assert gdf["soc_pred"].notna().all(), "NaN values in predictions"
print(gdf[["soc_true", "soc_pred"]].describe())3. Run the core spatial diagnostics.
metrics = evaluate_spatial_metrics(
gdf,
y_true_col="soc_true",
y_pred_col="soc_pred",
k_neighbors=6,
permutations=999,
)
print(metrics)
# Example output:
# {'rmse': 3.14, 'sw_rmse': 4.71, 'moran_i': 0.23, 'moran_p_sim': 0.001,
# 'significant_bias': True}In this example the SW-RMSE is 50% higher than standard RMSE and Moran’s I is significant. Both signals point to the same problem: errors are not random — they cluster in a specific part of the study area, and that area carries high operational weight.
4. Stratify by region to isolate the problematic zone.
Use a region column derived from administrative boundaries or from the Implementing SpatialKFold in Python block IDs used during training.
region_mae = stratified_spatial_mae(
gdf,
y_true_col="soc_true",
y_pred_col="soc_pred",
strata_col="region",
)
print(region_mae.head(5))
# stratum n_samples mae relative_mae
# NW_highlands 34 6.2 2.31
# ...Any stratum with relative_mae > 2.0 is a candidate for targeted data augmentation or a sub-model.
5. Apply custom operational weights if available.
Pass a custom_weights array derived from population density, asset density, or any domain cost surface to align SW-RMSE with what actually matters in deployment.
import numpy as np
# Synthetic example: weight by normalised population density
pop_weights = gdf["pop_density"].values.astype(float)
pop_weights /= pop_weights.sum()
metrics_pop = evaluate_spatial_metrics(
gdf,
y_true_col="soc_true",
y_pred_col="soc_pred",
custom_weights=pop_weights,
)
print(f"Population-weighted RMSE: {metrics_pop['sw_rmse']:.3f}")Verification
Confirm the diagnostics are producing credible outputs before using them to gate deployments.
# 1. Shuffle residuals to verify Moran's I approaches 0 under the null
import numpy as np
shuffled_gdf = gdf.copy()
shuffled_gdf["soc_pred"] = np.random.permutation(gdf["soc_pred"].values)
null_metrics = evaluate_spatial_metrics(
shuffled_gdf, "soc_true", "soc_pred", permutations=199
)
assert abs(null_metrics["moran_i"]) < 0.1, (
f"Moran's I on shuffled residuals is {null_metrics['moran_i']:.3f}; "
"expected near zero. Check weights matrix construction."
)
print("Null check passed — Moran's I on shuffled residuals is near zero.")
# 2. Confirm SW-RMSE >= standard RMSE when weights are uniform
uniform_weights = np.ones(len(gdf)) / len(gdf)
uniform_metrics = evaluate_spatial_metrics(
gdf, "soc_true", "soc_pred", custom_weights=uniform_weights
)
assert uniform_metrics["sw_rmse"] >= uniform_metrics["rmse"] * 0.99, (
"SW-RMSE with uniform weights should match standard RMSE."
)
print("Weight check passed.")The SVG below shows how spatial diagnostic output — Moran’s I and SW-RMSE — feeds back into the training loop, alongside the conventional metrics path.
FAQ
How do I interpret a Moran’s I value of 0.23 on my residuals?
A value of 0.23 with p_sim < 0.05 is actionable. It means roughly 23% of the variance in your residuals can be explained by geographic proximity — your model is missing spatially structured signals. First, check whether your training data covers the high-error regions; sparse sampling often drives this. Then consider adding spatial covariates (elevation, land cover, distance to water), applying a spatial lag feature (see Handling Spatial Autocorrelation), or tightening the blocking radius in your CV splitter.
My SW-RMSE is much higher than standard RMSE. What does that mean?
It means model errors concentrate in the locations your weight vector identifies as important. If you are using population-density weights, errors are occurring in populated areas. Investigate whether those areas are underrepresented in training, have different feature distributions, or contain a distinct land-use regime. Sub-region fine-tuning or targeted data collection for those areas is usually the next step.
Can I use these metrics with raster outputs, not just point data?
Yes, but you need to vectorise the residual surface first. Convert raster predictions and observations to a point GeoDataFrame using rasterio with a regular sampling grid, then apply evaluate_spatial_metrics as above. For boundary-quality evaluation of categorical outputs (e.g., land-cover maps), compute boundary-weighted IoU directly on the raster using scikit-image morphological operations before converting to points.
Related
- Implementing SpatialKFold in Python — build the custom CV splitter that generates the fold structure these metrics evaluate
- Reducing Spatial Leakage in Model Training — address partitioning-side leakage that inflates the metrics this page detects
- Handling Spatial Autocorrelation — understand the spatial dependence structure that drives residual clustering
- Spatial Cross-Validation Strategies — the parent page covering the full taxonomy of spatial CV methods
Up: Spatial Cross-Validation Strategies | Training Geospatial Predictive Models in Python