Evaluating Model Performance with Spatial Metrics

Evaluating model performance with spatial metrics requires moving beyond global accuracy scores to quantify spatial autocorrelation in residuals, account for geographic heterogeneity, and validate predictions against location-aware baselines. Standard metrics like RMSE or F1-score treat observations as independent, which violates the fundamental structure of geospatial data. When training and test sets share geographic proximity, conventional validation inflates performance by rewarding spatial interpolation rather than true generalization. To build robust geospatial ML systems, you must pair Spatial Cross-Validation Strategies with spatially explicit evaluation techniques that detect leakage, quantify regional bias, and align error penalties with operational priorities.

Why Standard Metrics Fail Geospatial Data

Spatial data inherently violates the independent and identically distributed (IID) assumption. Tobler’s First Law of Geography states that nearby locations are more similar than distant ones, meaning proximal observations share latent environmental, socioeconomic, or physical drivers. Predictive models often exploit these spatial dependencies instead of learning causal feature relationships, effectively memorizing location. This spatial leakage produces overoptimistic validation scores that collapse when deployed to unseen regions.

Detecting this bias requires computing prediction residuals, constructing a spatial weights matrix, and testing for residual clustering. A statistically insignificant global Moran’s I on residuals confirms spatial independence, while significant positive values indicate systematic geographic bias. For authoritative guidance on spatial weights construction and autocorrelation testing, consult the PySAL documentation.

Core Spatial Metrics for Geospatial ML

Metric	Purpose	Interpretation
Moran’s I on Residuals	Quantifies spatial autocorrelation in model errors	`+1` = clustered errors (systematic bias), `0` = random, `-1` = dispersed. Use permutation `p_sim < 0.05` for significance.
Spatially Weighted RMSE (SW-RMSE)	Penalizes errors in high-priority or data-sparse zones	Higher than standard RMSE indicates error concentration in weighted regions. Aligns evaluation with business or ecological impact.
Geographically Stratified Metrics	Computes precision, recall, or MAE within spatial blocks	Exposes regional performance variance. Use hexagonal grids, watersheds, or administrative boundaries for stratification.
Boundary-Weighted F1 & Spatial IoU	Evaluates raster segmentation or land-cover mapping	Penalizes fragmented predictions and misaligned boundaries. More robust than pixel-level accuracy for spatial objects.
Spatial Concordance Index	Measures rank-order agreement accounting for spatial lag	Useful for ordinal or continuous environmental variables where relative ordering matters more than absolute error.

Implementation Workflow

Integrating these metrics into a Training Geospatial Predictive Models in Python pipeline requires a consistent evaluation routine. The workflow follows three steps: compute residuals, define a spatial weights matrix, and calculate spatial diagnostics. Below is a production-ready function that computes Moran’s I on residuals and SW-RMSE using geopandas, libpysal, esda, and scikit-learn.

import numpy as np
import geopandas as gpd
from sklearn.metrics import mean_squared_error
from libpysal.weights import KNN
from esda.moran import Moran
import warnings

def evaluate_spatial_metrics(
    gdf: gpd.GeoDataFrame,
    y_true_col: str,
    y_pred_col: str,
    k_neighbors: int = 4,
    permutations: int = 999,
    custom_weights: np.ndarray = None
) -> dict:
    """
    Compute spatially explicit evaluation metrics for geospatial ML models.

    Parameters
    ----------
    gdf : GeoDataFrame
        Must contain geometry, ground truth, and predictions. Assumes point data.
    y_true_col : str
        Column name for observed values.
    y_pred_col : str
        Column name for predicted values.
    k_neighbors : int
        Number of nearest neighbors for spatial weights construction.
    permutations : int
        Number of permutations for Moran's I significance testing.
    custom_weights : np.ndarray, optional
        Domain-specific weights (e.g., population density, cost surfaces).
        If None, defaults to inverse-distance decay from KNN.

    Returns
    -------
    dict
        Dictionary containing standard RMSE, SW-RMSE, Moran's I, and p-value.
    """
    if gdf.crs is None:
        raise ValueError("GeoDataFrame must have a defined CRS for spatial operations.")

    residuals = gdf[y_true_col] - gdf[y_pred_col]
    standard_rmse = np.sqrt(mean_squared_error(gdf[y_true_col], gdf[y_pred_col]))

    # Build spatial weights (KNN recommended for point datasets)
    w = KNN.from_dataframe(gdf, k=k_neighbors)

    if w.n_components > 1:
        warnings.warn("Spatial weights contain disconnected components. "
                      "Increase k_neighbors or verify geometry topology.")

    w.transform = "R"  # Row-standardize for Moran's I

    # Compute Moran's I on residuals
    moran_result = Moran(residuals.values, w, permutations=permutations)

    # Compute Spatially Weighted RMSE
    if custom_weights is not None:
        sw_weights = custom_weights
    else:
        # Default: inverse-distance decay approximation using KNN cardinality
        sw_weights = np.array([1.0 / (1.0 + d) for d in w.cardinality])

    sw_rmse = np.sqrt(np.average(residuals**2, weights=sw_weights))

    return {
        "rmse": standard_rmse,
        "sw_rmse": sw_rmse,
        "moran_i": moran_result.I,
        "moran_p_sim": moran_result.p_sim,
        "significant_bias": moran_result.p_sim < 0.05
    }

Interpretation & MLOps Integration

After computing these metrics, interpret them in context. A significant positive Moran’s I (p_sim < 0.05) signals that errors cluster geographically, often due to unmodeled spatial trends, missing covariates, or inadequate cross-validation. SW-RMSE values substantially higher than standard RMSE indicate that errors concentrate in high-weight zones, requiring targeted data collection or model recalibration. For continuous monitoring, log spatial metrics alongside traditional scores in your MLflow or Weights & Biases dashboard. Track drift by comparing baseline spatial autocorrelation against production residuals; sudden shifts in Moran’s I often precede model degradation in environmental or urban applications.

When deploying models at scale, automate spatial validation gates. Reject deployments if stratified MAE exceeds regional thresholds or if residual clustering emerges in newly observed territories. Pair these checks with spatial cross-validation during training to ensure your evaluation reflects real-world geographic generalization. For comprehensive guidance on structuring these pipelines, refer to the official scikit-learn cross-validation documentation and adapt it with geospatial-aware splitters.

Thresholds & Best Practices

Moran’s I Thresholds: Treat |I| > 0.1 with p < 0.05 as actionable spatial bias. Investigate feature engineering or add spatial lag terms.
SW-RMSE Calibration: Align weights with operational cost functions. In public health, weight by population density; in infrastructure, weight by asset criticality.
Stratification Granularity: Use hexagonal grids (h3 or geopandas tessellation) at scales matching your deployment footprint. Avoid arbitrary administrative boundaries that mask intra-region variance.
Permutation Count: Use permutations=999 for development, 9999 for production reporting. Higher counts stabilize p_sim but increase compute time.
Raster vs. Vector: For raster outputs, convert predictions to polygons or use rasterio + scikit-image for boundary-aware IoU before applying spatial metrics.

Evaluating model performance with spatial metrics transforms geospatial ML from black-box interpolation into auditable, production-ready systems. By quantifying residual autocorrelation, weighting errors by operational priority, and stratifying performance across geographic zones, you eliminate proximity bias and build models that generalize beyond training footprints. Integrate these diagnostics early in your validation workflow to catch spatial leakage before deployment and maintain robust performance across dynamic landscapes.