Reducing Spatial Leakage in Model Training

TL;DR: Replace random train/validation splits with spatially aware partitioning that enforces a minimum geographic separation between training and validation samples. Use grid-based blocking with a block size of at least 2× your target variable’s spatial correlation range, passed to scikit-learn’s GroupKFold as a group array. This is part of the broader Handling Spatial Autocorrelation workflow — without it, performance metrics will be systematically inflated by geographic proximity rather than genuine predictive signal.

Why Random Splits Fail in Geospatial ML Pipelines

Spatial leakage is one of the most reliably silent failure modes in geospatial machine learning. When a dataset has spatial autocorrelation — nearby locations tend to share similar target values — a random 80/20 split places many validation points within the spatial correlation range of their nearest training neighbors. The model does not need to generalise; it interpolates.

The consequence is a validation R² or AUC that looks excellent in development but collapses the moment the model is deployed to a new watershed, city boundary, or satellite tile it has never seen. A random split that ignores a 3 km correlation range will place validation points an average of far less than 3 km from a training point, so the model never encounters a genuinely novel spatial regime during evaluation. This affects every downstream decision: hyperparameter selection, feature importance ranking, and the threshold at which you decide the model is production-ready.

The failure is compounded when spatial lag features — kernel-smoothed values computed across the full dataset before splitting — are included as model inputs. These features encode neighbor information that would not exist at inference time for unseen regions, creating a second, subtler leakage pathway alongside the split geometry problem.

Core Principles for Leak-Free Spatial Partitioning

Apply these principles before writing any CV code:

Project to a metric CRS first. All distance thresholds — block sizes, buffer radii — must be in metres. Degree-based grids produce wildly variable distances at different latitudes.
Calibrate block size to the empirical correlation range. Fit a semivariogram or compute Moran’s I at multiple lag distances; set block size to at least 2× the range where semivariance plateaus.
Never compute spatial lag features before splitting. Fit any neighbor-averaging transform on the training fold only, then apply it to the validation fold using only training-side neighbors.
Use groups, not stratification. scikit-learn’s GroupKFold guarantees that all observations in the same group go to the same fold. Feed it block or tile IDs, not raw coordinates.
Report per-stratum metrics, not just global aggregates. A model that performs well on average but collapses in dense urban cells or remote rural tiles is not production-ready.
Gate on spatial CV score at deployment. Reject model versions where the ratio of spatial CV score to random CV score falls below a defined threshold (0.85 is a reasonable starting point).

Production-Ready Spatial Blocking Code

The function below creates a deterministic grid, assigns each observation to a block, and returns a GeoDataFrame ready for GroupKFold. It includes CRS validation, edge-case handling for observations outside the grid, and structured logging.

import logging
import geopandas as gpd
import numpy as np
from shapely.geometry import box
from sklearn.model_selection import GroupKFold, cross_val_score
from sklearn.ensemble import RandomForestRegressor

logger = logging.getLogger(__name__)


def assign_spatial_blocks(
    gdf: gpd.GeoDataFrame,
    block_size: float = 5000.0,
    id_col: str = "block_id",
) -> gpd.GeoDataFrame:
    """Assign a spatial grid block ID to every observation.

    Parameters
    ----------
    gdf:
        Input GeoDataFrame with point or polygon geometry.
        Must be in a projected (metric) CRS — raises ValueError otherwise.
    block_size:
        Grid cell dimension in CRS units (default 5000 m = 5 km).
        Set this to at least 2× the empirical spatial correlation range
        of your target variable.
    id_col:
        Name of the output block-ID column.

    Returns
    -------
    GeoDataFrame with ``id_col`` appended. Observations that fall
    outside the computed grid receive block_id = -1 and are logged.
    """
    if gdf.crs is None:
        raise ValueError("GeoDataFrame has no CRS. Set one before blocking.")
    if gdf.crs.is_geographic:
        raise ValueError(
            f"CRS {gdf.crs.to_string()} is geographic. "
            "Reproject to a metric CRS (e.g. EPSG:3857 or a local UTM) first."
        )

    x_min, y_min, x_max, y_max = gdf.total_bounds

    x_bins = np.arange(x_min, x_max + block_size, block_size)
    y_bins = np.arange(y_min, y_max + block_size, block_size)

    grid_cells = [
        box(x, y, x + block_size, y + block_size)
        for x in x_bins[:-1]
        for y in y_bins[:-1]
    ]
    grid_gdf = gpd.GeoDataFrame(
        {"_cell_id": range(len(grid_cells))},
        geometry=grid_cells,
        crs=gdf.crs,
    )

    joined = gpd.sjoin(gdf, grid_gdf, how="left", predicate="intersects")
    joined[id_col] = joined["_cell_id"].fillna(-1).astype(int)
    joined = joined.drop(columns=["index_right", "_cell_id"], errors="ignore")

    n_unassigned = (joined[id_col] == -1).sum()
    if n_unassigned > 0:
        logger.warning(
            "%d observation(s) fell outside the grid and received block_id=-1. "
            "Exclude or merge them with a neighbouring block before CV.",
            n_unassigned,
        )

    logger.info(
        "Assigned %d observations to %d spatial blocks (block_size=%.0f m).",
        len(joined),
        joined[id_col].nunique(),
        block_size,
    )
    return joined


def run_spatial_cv(
    gdf_blocked: gpd.GeoDataFrame,
    feature_cols: list[str],
    target_col: str,
    block_id_col: str = "block_id",
    n_splits: int = 5,
) -> dict[str, float]:
    """Run spatial GroupKFold CV and return mean ± std of R².

    Parameters
    ----------
    gdf_blocked:
        GeoDataFrame produced by ``assign_spatial_blocks``.
    feature_cols:
        List of numeric feature column names.
    target_col:
        Name of the regression target column.
    block_id_col:
        Column containing spatial block IDs (used as CV groups).
    n_splits:
        Number of CV folds.

    Returns
    -------
    dict with keys ``mean_r2`` and ``std_r2``.
    """
    valid = gdf_blocked[gdf_blocked[block_id_col] != -1].copy()
    X = valid[feature_cols].values
    y = valid[target_col].values
    groups = valid[block_id_col].values

    gkf = GroupKFold(n_splits=n_splits)
    model = RandomForestRegressor(n_estimators=200, random_state=42)

    scores = cross_val_score(
        model, X, y, cv=gkf.split(X, y, groups=groups), scoring="r2"
    )
    logger.info(
        "Spatial CV R²: %.3f ± %.3f over %d folds.",
        scores.mean(), scores.std(), n_splits,
    )
    return {"mean_r2": float(scores.mean()), "std_r2": float(scores.std())}

Step-by-Step Walkthrough

Work through these steps on your own dataset, substituting real file paths and column names.

Step 1 — Reproject to a metric CRS

gdf = gpd.read_file("observations.geojson")
# Always inspect the CRS before any spatial operation
print(gdf.crs)

# Reproject to Web Mercator (global) or a local UTM zone for better accuracy
gdf_m = gdf.to_crs(epsg=3857)

Confirm reprojection succeeded:

assert gdf_m.crs.is_projected, "CRS must be projected before blocking"
assert not gdf_m.crs.is_geographic

Step 2 — Determine block size from the correlation range

Fit a variogram or use a quick Moran’s I sweep across lag distances using the Local Moran’s I workflow to find where spatial dependence drops off:

import libpysal
from esda.moran import Moran

# Build a distance-band weights matrix at candidate ranges
for radius_m in [1000, 2000, 5000, 10000]:
    w = libpysal.weights.DistanceBand.from_dataframe(
        gdf_m, threshold=radius_m, silence_warnings=True
    )
    mi = Moran(gdf_m["target"].values, w)
    print(f"Range {radius_m} m → Moran's I = {mi.I:.3f}  p = {mi.p_sim:.4f}")

Pick the smallest radius where Moran’s I approaches zero and set block_size to 2× that value. If the range is 3 000 m, use block_size=6000.

Step 3 — Assign blocks and inspect coverage

gdf_blocked = assign_spatial_blocks(gdf_m, block_size=6000)

# Sanity check: how many blocks have enough observations for CV?
block_counts = gdf_blocked["block_id"].value_counts()
print(block_counts.describe())

# Drop or merge sparse blocks (fewer than 5 observations) to avoid tiny folds
min_obs = 5
valid_blocks = block_counts[block_counts >= min_obs].index
gdf_blocked = gdf_blocked[gdf_blocked["block_id"].isin(valid_blocks)]

Step 4 — Run spatial CV

feature_cols = ["elevation", "slope", "ndvi", "distance_to_road"]
result = run_spatial_cv(
    gdf_blocked,
    feature_cols=feature_cols,
    target_col="target",
    n_splits=5,
)
print(f"Spatial CV  R² = {result['mean_r2']:.3f} ± {result['std_r2']:.3f}")

For comparison, run a random split CV with the same model to quantify the leakage gap:

from sklearn.model_selection import cross_val_score, KFold

X_all = gdf_blocked[feature_cols].values
y_all = gdf_blocked["target"].values
random_scores = cross_val_score(
    RandomForestRegressor(n_estimators=200, random_state=42),
    X_all, y_all,
    cv=KFold(n_splits=5, shuffle=True, random_state=42),
    scoring="r2",
)
print(f"Random CV   R² = {random_scores.mean():.3f} ± {random_scores.std():.3f}")

A meaningful gap (spatial CV materially below random CV) confirms that spatial leakage was inflating your earlier estimates.

Step 5 — Enforce a deployment gate

LEAKAGE_THRESHOLD = 0.85  # reject if spatial R² < 85% of random R²

ratio = result["mean_r2"] / random_scores.mean()
if ratio < LEAKAGE_THRESHOLD:
    raise RuntimeError(
        f"Spatial CV / random CV ratio = {ratio:.2f} < {LEAKAGE_THRESHOLD}. "
        "Model is over-reliant on spatial proximity. Review block size and features."
    )

Inline SVG: Spatial Blocking Fold Structure

The diagram below shows how a spatial grid partitions observations into five non-overlapping CV folds, with each fold’s validation set separated from the training set by entire blocks rather than random selection.

Verification

After running the full pipeline, confirm that leakage has been eliminated:

# 1. Verify no training observation is within block_size of any validation observation
from shapely.ops import unary_union

for fold_idx, (train_idx, val_idx) in enumerate(
    GroupKFold(n_splits=5).split(
        gdf_blocked, groups=gdf_blocked["block_id"].values
    )
):
    train_geom = unary_union(gdf_blocked.iloc[train_idx].geometry.buffer(0))
    val_pts = gdf_blocked.iloc[val_idx].geometry
    # No validation point should touch the unioned training geometry
    # (for strict buffer exclusion, substitute buffer(block_size/2) above)
    overlaps = val_pts.intersects(train_geom).sum()
    assert overlaps == 0, f"Fold {fold_idx}: {overlaps} validation pts overlap training geometry"
    print(f"Fold {fold_idx}: OK — no spatial overlap")

# 2. Log the spatial CV vs random CV gap
print(f"Leakage gap: {random_scores.mean() - result['mean_r2']:.3f} R² units")

A gap above 0.10 R² is a strong signal that spatial autocorrelation was driving random-CV performance. A gap below 0.03 suggests either the data has genuinely low spatial dependence or the block size is too large and is discarding valid training signal.

FAQ

How do I choose the right block size for spatial CV?

Fit a semivariogram to your target variable and read off the range — the distance beyond which semivariance plateaus. Set your block size to at least 2× that range. If the range is 3 km, blocks should be at least 6 km wide. When semivariogram fitting is impractical, run CV at several block sizes and pick the smallest size where spatial CV R² stabilises — further increases in block size should not materially change the score.

Does spatial blocking always lower my CV score?

Yes, almost always — and that is the point. The drop reveals the true generalisation gap that random splits hide. A model scoring 0.91 R² under random CV but 0.64 under spatial CV is not a worse model; it is an honestly evaluated one. Treat the spatial CV score as the realistic production estimate and tune hyperparameters against it, not the random-CV score.

Can I use spatial blocking with raster datasets?

Yes. Assign each pixel to a grid tile based on its centroid coordinates, then treat tile IDs as the group variable in GroupKFold. For large rasters, use a coarser tiling than your pixel resolution — tile dimensions of 10–50× the pixel size are typical. The same 2× correlation-range rule applies; fit a variogram on a sampled subset of pixels if the full raster is too large.

Part of: Handling Spatial Autocorrelation — Training Geospatial Predictive Models in Python