Evaluating Model Performance with Spatial Metrics

Evaluate geospatial ML performance with spatial metrics: quantify residual autocorrelation, detect regional bias, and apply spatial validation techniques in Python.

Use Moran’s I on residuals to detect geographic clustering of model errors, and pair it with spatially weighted RMSE to surface hidden regional bias. Run both diagnostics inside your Spatial Cross-Validation Strategies loop so problems are caught during training, not after deployment.

Part of: Spatial Cross-Validation Strategies

Why Standard Metrics Fail Geospatial ML Pipelines

Standard metrics such as RMSE, MAE, and F1-score treat every observation as independent. In geospatial data that assumption is wrong by design. Tobler’s First Law of Geography — nearby locations are more similar than distant ones — means that training and test observations sampled from adjacent areas share latent environmental, topographic, or socioeconomic drivers. A model can achieve a high aggregate R² simply by interpolating between spatially clustered training points, never learning the causal feature relationships needed to generalise to new regions.

The failure is silent. You receive a convincing validation score, ship the model, and discover systematic collapse in the deployment region months later. Two specific failure modes are common:

  • Spatial leakage masking: When folds are assigned randomly, nearby training and test points share variance. The model memorises location rather than features. Reducing Spatial Leakage in Model Training addresses the partitioning side of this problem; the metrics below address the detection side.
  • Regional bias invisibility: Aggregate RMSE averages over all locations. A model may perform acceptably on flat agricultural plains while failing badly on coastal margins or high-elevation terrain. Unless you stratify and weight your metrics geographically, this bias is invisible until production.

Both failure modes are detectable — and correctable — if you apply spatially explicit diagnostics at evaluation time.

Core Spatial Metrics Principles

  • Residual-first: Compute prediction residuals before any metric. Residuals carry the full spatial signal; aggregating too early destroys it.
  • Weights matrix before statistics: Every spatial metric depends on a definition of neighbourhood. Build the spatial weights matrix explicitly and inspect it for disconnected components before computing any statistic.
  • Permutation significance, not asymptotic: Moran’s I permutation p-values are more reliable than asymptotic approximations on irregular spatial point patterns. Use permutations=999 minimum.
  • Weight by operational cost, not area: Spatially weighted RMSE should reflect where errors matter, not which cells are largest. In public-health models, weight by population density; in infrastructure risk models, weight by asset criticality.
  • Stratify at deployment resolution: Hexagonal grids at the scale your model will be deployed to are more honest than administrative boundaries, which often mask intra-region variance.
  • Track drift over time: Log spatial metrics per model version. A sudden rise in residual Moran’s I in production often precedes broader model degradation before any aggregate metric reacts.

Spatial Metrics Reference

Metric Purpose Actionable Threshold
Moran’s I on residuals Quantifies spatial autocorrelation in model errors |I| > 0.1 with p_sim < 0.05 — investigate missing spatial covariates or tighten CV blocking
Spatially weighted RMSE Penalises errors in high-priority or data-sparse zones SW-RMSE substantially above standard RMSE signals error concentration in weighted regions
Geographically stratified MAE Computes MAE within spatial strata Any stratum exceeding 2× median MAE warrants targeted data collection or sub-model tuning
Boundary-weighted IoU Evaluates raster segmentation or land-cover mapping Penalises fragmented, misaligned boundaries; more robust than pixel accuracy for spatial objects
Spatial concordance index Rank-order agreement accounting for spatial lag Use for ordinal environmental variables where relative ordering matters more than absolute error

Production-Ready Code

The function below computes Moran’s I on residuals and spatially weighted RMSE from a GeoDataFrame containing geometry, ground truth, and predictions. It uses geopandas, libpysal, esda, and scikit-learn.

import logging
import warnings
from typing import Optional

import numpy as np
import geopandas as gpd
from sklearn.metrics import mean_squared_error
from libpysal.weights import KNN
from esda.moran import Moran

logger = logging.getLogger(__name__)


def evaluate_spatial_metrics(
    gdf: gpd.GeoDataFrame,
    y_true_col: str,
    y_pred_col: str,
    k_neighbors: int = 4,
    permutations: int = 999,
    custom_weights: Optional[np.ndarray] = None,
) -> dict:
    """
    Compute spatially explicit evaluation metrics for geospatial ML models.

    Parameters
    ----------
    gdf : GeoDataFrame
        Must contain geometry, ground truth, and predictions.
        Point geometries recommended; polygon centroids used otherwise.
    y_true_col : str
        Column name for observed values.
    y_pred_col : str
        Column name for predicted values.
    k_neighbors : int
        Number of nearest neighbours for spatial weights construction.
    permutations : int
        Permutations for Moran's I significance testing.
        Use 999 for development, 9999 for production reporting.
    custom_weights : np.ndarray, optional
        Domain-specific observation weights (e.g., population density).
        If None, defaults to inverse cardinality decay from KNN graph.

    Returns
    -------
    dict with keys: rmse, sw_rmse, moran_i, moran_p_sim, significant_bias
    """
    if gdf.crs is None:
        raise ValueError(
            "GeoDataFrame must have a defined CRS. "
            "Call gdf.set_crs() or gdf.to_crs() before evaluation."
        )

    if not gdf.crs.is_projected:
        raise ValueError(
            f"CRS {gdf.crs} is geographic (degrees). "
            "Reproject to a metric CRS (e.g. EPSG:32633) before computing "
            "distance-based spatial weights."
        )

    missing = {y_true_col, y_pred_col} - set(gdf.columns)
    if missing:
        raise KeyError(f"Columns not found in GeoDataFrame: {missing}")

    # Use point geometry; fall back to centroid for polygon/line inputs
    geom_type = gdf.geometry.geom_type.iloc[0]
    if geom_type != "Point":
        logger.warning(
            "Non-point geometry detected (%s). Using centroids for spatial weights.",
            geom_type,
        )
        gdf = gdf.copy()
        gdf["geometry"] = gdf.geometry.centroid

    residuals = gdf[y_true_col].values - gdf[y_pred_col].values
    standard_rmse = float(
        np.sqrt(mean_squared_error(gdf[y_true_col], gdf[y_pred_col]))
    )
    logger.info("Standard RMSE: %.4f", standard_rmse)

    # Build KNN spatial weights matrix
    w = KNN.from_dataframe(gdf, k=k_neighbors)

    if w.n_components > 1:
        warnings.warn(
            f"Spatial weights contain {w.n_components} disconnected components. "
            "Increase k_neighbors or verify geometry topology.",
            stacklevel=2,
        )

    w.transform = "r"  # Row-standardise before computing Moran's I

    moran_result = Moran(residuals, w, permutations=permutations)
    logger.info(
        "Moran's I: %.4f  p_sim: %.4f", moran_result.I, moran_result.p_sim
    )

    # Spatially weighted RMSE
    if custom_weights is not None:
        if len(custom_weights) != len(gdf):
            raise ValueError(
                f"custom_weights length {len(custom_weights)} does not match "
                f"GeoDataFrame length {len(gdf)}."
            )
        sw_weights = np.asarray(custom_weights, dtype=float)
    else:
        # Approximate inverse-distance decay via KNN cardinality
        sw_weights = np.array([1.0 / (1.0 + d) for d in w.cardinality])

    sw_rmse = float(np.sqrt(np.average(residuals ** 2, weights=sw_weights)))
    logger.info("SW-RMSE: %.4f", sw_rmse)

    return {
        "rmse": standard_rmse,
        "sw_rmse": sw_rmse,
        "moran_i": float(moran_result.I),
        "moran_p_sim": float(moran_result.p_sim),
        "significant_bias": bool(moran_result.p_sim < 0.05),
    }

A second helper computes geographically stratified MAE, which is useful for spotting regional underperformance that aggregate metrics hide.

import pandas as pd


def stratified_spatial_mae(
    gdf: gpd.GeoDataFrame,
    y_true_col: str,
    y_pred_col: str,
    strata_col: str,
) -> pd.DataFrame:
    """
    Compute MAE within each spatial stratum.

    Parameters
    ----------
    gdf : GeoDataFrame
        Must contain strata_col, y_true_col, y_pred_col.
    strata_col : str
        Column identifying spatial strata (e.g., admin region, H3 cell, fold ID).

    Returns
    -------
    DataFrame with columns: stratum, n_samples, mae, relative_mae
    """
    gdf = gdf.copy()
    gdf["_abs_err"] = (gdf[y_true_col] - gdf[y_pred_col]).abs()

    summary = (
        gdf.groupby(strata_col)["_abs_err"]
        .agg(n_samples="count", mae="mean")
        .reset_index()
        .rename(columns={strata_col: "stratum"})
    )

    global_mae = gdf["_abs_err"].mean()
    summary["relative_mae"] = summary["mae"] / global_mae

    return summary.sort_values("relative_mae", ascending=False)

Step-by-Step Walkthrough

The following steps apply both functions to a concrete evaluation scenario: a random-forest regressor trained on a regional soil-carbon dataset.

1. Load and project the evaluation GeoDataFrame.

Start from a GeoDataFrame that contains geometry, ground truth in soc_true, and predictions in soc_pred. Reproject to a metric CRS immediately — distance-based spatial weights require it.

gdf = gpd.read_file("soil_carbon_test.geojson")
gdf = gdf.to_crs("EPSG:32632")  # UTM zone 32N, metres

2. Confirm predictions are aligned.

Before computing any spatial metric, verify that the column values look sane. A single misconfigured preprocessing step can silently shift residuals by an order of magnitude.

assert len(gdf) > 0, "Empty GeoDataFrame after loading"
assert gdf["soc_true"].notna().all(), "NaN values in ground truth"
assert gdf["soc_pred"].notna().all(), "NaN values in predictions"

print(gdf[["soc_true", "soc_pred"]].describe())

3. Run the core spatial diagnostics.

metrics = evaluate_spatial_metrics(
    gdf,
    y_true_col="soc_true",
    y_pred_col="soc_pred",
    k_neighbors=6,
    permutations=999,
)
print(metrics)
# Example output:
# {'rmse': 3.14, 'sw_rmse': 4.71, 'moran_i': 0.23, 'moran_p_sim': 0.001,
#  'significant_bias': True}

In this example the SW-RMSE is 50% higher than standard RMSE and Moran’s I is significant. Both signals point to the same problem: errors are not random — they cluster in a specific part of the study area, and that area carries high operational weight.

4. Stratify by region to isolate the problematic zone.

Use a region column derived from administrative boundaries or from the Implementing SpatialKFold in Python block IDs used during training.

region_mae = stratified_spatial_mae(
    gdf,
    y_true_col="soc_true",
    y_pred_col="soc_pred",
    strata_col="region",
)
print(region_mae.head(5))
# stratum  n_samples  mae   relative_mae
# NW_highlands  34   6.2   2.31
# ...

Any stratum with relative_mae > 2.0 is a candidate for targeted data augmentation or a sub-model.

5. Apply custom operational weights if available.

Pass a custom_weights array derived from population density, asset density, or any domain cost surface to align SW-RMSE with what actually matters in deployment.

import numpy as np

# Synthetic example: weight by normalised population density
pop_weights = gdf["pop_density"].values.astype(float)
pop_weights /= pop_weights.sum()

metrics_pop = evaluate_spatial_metrics(
    gdf,
    y_true_col="soc_true",
    y_pred_col="soc_pred",
    custom_weights=pop_weights,
)
print(f"Population-weighted RMSE: {metrics_pop['sw_rmse']:.3f}")

Verification

Confirm the diagnostics are producing credible outputs before using them to gate deployments.

# 1. Shuffle residuals to verify Moran's I approaches 0 under the null
import numpy as np

shuffled_gdf = gdf.copy()
shuffled_gdf["soc_pred"] = np.random.permutation(gdf["soc_pred"].values)

null_metrics = evaluate_spatial_metrics(
    shuffled_gdf, "soc_true", "soc_pred", permutations=199
)
assert abs(null_metrics["moran_i"]) < 0.1, (
    f"Moran's I on shuffled residuals is {null_metrics['moran_i']:.3f}; "
    "expected near zero. Check weights matrix construction."
)
print("Null check passed — Moran's I on shuffled residuals is near zero.")

# 2. Confirm SW-RMSE >= standard RMSE when weights are uniform
uniform_weights = np.ones(len(gdf)) / len(gdf)
uniform_metrics = evaluate_spatial_metrics(
    gdf, "soc_true", "soc_pred", custom_weights=uniform_weights
)
assert uniform_metrics["sw_rmse"] >= uniform_metrics["rmse"] * 0.99, (
    "SW-RMSE with uniform weights should match standard RMSE."
)
print("Weight check passed.")

The SVG below shows how spatial diagnostic output — Moran’s I and SW-RMSE — feeds back into the training loop, alongside the conventional metrics path.

Spatial Metrics Feedback Loop A flow diagram with two parallel paths from Model Output. The left path shows Standard Metrics (RMSE, F1) flowing to Aggregate Score and then to Model Registry. The right path shows Spatial Diagnostics (Moran's I, SW-RMSE) flowing to Spatial Bias Detected, which splits into a pass path to Model Registry and a fail path looping back to Feature Engineering and then back to Model Training, which feeds into Model Output again. Model Output Standard Metrics (RMSE, F1) Spatial Diagnostics (Moran's I, SW-RMSE) Aggregate Score Spatial Bias Detected? p_sim < 0.05 or SW-RMSE ↑ Model Registry Feature Engineering (add spatial covariates) pass pass fail

FAQ

How do I interpret a Moran’s I value of 0.23 on my residuals?

A value of 0.23 with p_sim < 0.05 is actionable. It means roughly 23% of the variance in your residuals can be explained by geographic proximity — your model is missing spatially structured signals. First, check whether your training data covers the high-error regions; sparse sampling often drives this. Then consider adding spatial covariates (elevation, land cover, distance to water), applying a spatial lag feature (see Handling Spatial Autocorrelation), or tightening the blocking radius in your CV splitter.

My SW-RMSE is much higher than standard RMSE. What does that mean?

It means model errors concentrate in the locations your weight vector identifies as important. If you are using population-density weights, errors are occurring in populated areas. Investigate whether those areas are underrepresented in training, have different feature distributions, or contain a distinct land-use regime. Sub-region fine-tuning or targeted data collection for those areas is usually the next step.

Can I use these metrics with raster outputs, not just point data?

Yes, but you need to vectorise the residual surface first. Convert raster predictions and observations to a point GeoDataFrame using rasterio with a regular sampling grid, then apply evaluate_spatial_metrics as above. For boundary-quality evaluation of categorical outputs (e.g., land-cover maps), compute boundary-weighted IoU directly on the raster using scikit-image morphological operations before converting to points.

Up: Spatial Cross-Validation Strategies | Training Geospatial Predictive Models in Python