Spatial Cross-Validation Strategies

Implement spatial cross-validation in Python: prevent spatial leakage, build geographically stratified folds, and evaluate geospatial ML models on unseen regions.

Standard k-fold cross-validation assumes independent and identically distributed (i.i.d.) samples — an assumption that fundamentally breaks down in geospatial machine learning. Geographic observations exhibit spatial autocorrelation: nearby points share environmental, topographic, or socioeconomic characteristics. When you train predictive models on raster or vector datasets without accounting for this structure, validation folds inevitably leak spatial information into training sets. The result is inflated performance metrics that collapse the moment the model encounters an unseen region in production.

This is part of Training Geospatial Predictive Models in Python, which covers the full lifecycle from handling spatial autocorrelation through inference automation. This guide covers the complete workflow for spatial partitioning: estimating correlation range, assigning block IDs, choosing a partition strategy, wiring everything into scikit-learn, and interpreting the resulting metrics.

Random k-fold vs Spatial Block Cross-Validation Left panel shows random split where similar colours (same fold) appear adjacent, illustrating leakage. Right panel shows spatially separated blocks with a buffer exclusion band between training and validation folds. Random k-fold (leakage risk) Same-fold neighbours leak autocorrelated signal Spatial block CV (leakage prevented) Fold A buffer Fold B buffer Fold C Contiguous blocks + buffer exclusion eliminate leakage

Prerequisites & Environment Setup

Before building spatial folds, your environment and data must meet strict baseline requirements. Pipelines fail silently when CRS assumptions or library versions are violated.

# Pinned requirements for spatial cross-validation
geopandas==0.14.4
scikit-learn==1.5.1
rasterio==1.3.10
numpy==1.26.4
shapely==2.0.5
pyproj==3.6.1
scipy==1.13.1        # semivariogram estimation

Install with:

pip install "geopandas==0.14.4" "scikit-learn==1.5.1" "rasterio==1.3.10" \
            "numpy==1.26.4" "shapely==2.0.5" "pyproj==3.6.1" "scipy==1.13.1"

GDAL/PROJ system dependencies (Ubuntu/Debian):

sudo apt-get install -y gdal-bin libgdal-dev libproj-dev

CRS alignment is a hard prerequisite

All spatial layers must share a projected coordinate system (for example EPSG:32633 or EPSG:3857) rather than geographic coordinates (latitude/longitude). Distance-based blocking and buffer generation rely on linear units such as metres. Calculating Euclidean distances in decimal degrees introduces severe distortion at higher latitudes, making spatial splits mathematically invalid.

Always validate CRS consistency at pipeline ingestion:

assert gdf.crs is not None, "GeoDataFrame has no CRS — cannot compute distances"
assert gdf.crs.is_projected, f"Expected projected CRS, got {gdf.crs.to_string()}"

Full guidance on reprojection patterns is in CRS Alignment and Projection Handling.

Understand your spatial autocorrelation range

If your dataset exhibits strong Moran’s I values or a long-range variogram sill, standard random splits guarantee data leakage. Review Handling Spatial Autocorrelation to learn how to estimate the empirical range and choose an appropriate blocking radius before you write a single line of fold logic.

Step-by-Step Implementation

Step 1 — Estimate the spatial correlation range

The variogram range tells you the minimum safe block size. Any block smaller than this range allows autocorrelated signal to bleed across the fold boundary.

import numpy as np
import geopandas as gpd
from scipy.spatial.distance import cdist
from scipy.optimize import curve_fit

def spherical_variogram(h: np.ndarray, nugget: float, sill: float, rang: float) -> np.ndarray:
    """Spherical variogram model used to estimate the empirical range."""
    gamma = np.where(
        h < rang,
        nugget + (sill - nugget) * (1.5 * (h / rang) - 0.5 * (h / rang) ** 3),
        sill
    )
    return gamma

def estimate_variogram_range(
    gdf: gpd.GeoDataFrame,
    target_col: str,
    n_pairs: int = 5000,
    n_lags: int = 15,
) -> float:
    """Return the fitted spherical variogram range in CRS units.

    Args:
        gdf: GeoDataFrame in a projected metric CRS.
        target_col: Column name of the target variable.
        n_pairs: Random sample of point pairs to use (avoids O(n^2) cost).
        n_lags: Number of distance bins.
    Returns:
        Estimated range in CRS units (e.g. metres for UTM).
    """
    coords = np.column_stack([gdf.geometry.x, gdf.geometry.y])
    z = gdf[target_col].values

    rng = np.random.default_rng(42)
    idx = rng.choice(len(coords), size=min(n_pairs, len(coords)), replace=False)
    coords_s, z_s = coords[idx], z[idx]

    dists = cdist(coords_s, coords_s)
    sq_diffs = (z_s[:, None] - z_s[None, :]) ** 2

    max_dist = np.percentile(dists[dists > 0], 50)
    lag_edges = np.linspace(0, max_dist, n_lags + 1)
    lag_centres, gamma_vals = [], []

    for lo, hi in zip(lag_edges[:-1], lag_edges[1:]):
        mask = (dists > lo) & (dists <= hi)
        if mask.sum() > 10:
            lag_centres.append((lo + hi) / 2)
            gamma_vals.append(sq_diffs[mask].mean() / 2)

    lag_centres = np.array(lag_centres)
    gamma_vals = np.array(gamma_vals)

    p0 = [gamma_vals[0], gamma_vals[-1], lag_centres[len(lag_centres) // 2]]
    try:
        popt, _ = curve_fit(spherical_variogram, lag_centres, gamma_vals, p0=p0, maxfev=5000)
        return float(popt[2])   # fitted range
    except RuntimeError:
        # Fall back to the lag where variance first stabilises
        diffs = np.diff(gamma_vals)
        stable_idx = np.argmax(diffs < diffs.max() * 0.1)
        return float(lag_centres[stable_idx])
gdf = gpd.read_file("training_data.geojson").to_crs("EPSG:32633")
range_m = estimate_variogram_range(gdf, target_col="target")
block_size = max(range_m * 2.0, 1000)   # at least 2× range, minimum 1 km
print(f"Variogram range: {range_m:.0f} m  →  block size: {block_size:.0f} m")

Step 2 — Assign spatial block IDs

Convert raw coordinates into discrete spatial units. A regular grid is the most reproducible approach; its resolution must exceed the correlation range from Step 1.

import geopandas as gpd
import numpy as np
from shapely.geometry import box

def create_spatial_blocks(
    gdf: gpd.GeoDataFrame,
    block_size: float,
) -> gpd.GeoDataFrame:
    """Assign a grid block ID to each observation via a spatial join.

    Args:
        gdf: Input GeoDataFrame in a projected metric CRS.
        block_size: Grid cell side length in CRS units (metres for UTM).
    Returns:
        GeoDataFrame with a 'block_id' integer column appended.
    """
    x_min, y_min, x_max, y_max = gdf.total_bounds
    x_bins = np.arange(x_min, x_max + block_size, block_size)
    y_bins = np.arange(y_min, y_max + block_size, block_size)

    grid_cells = [
        box(x, y, x + block_size, y + block_size)
        for x in x_bins[:-1] for y in y_bins[:-1]
    ]
    grid_gdf = gpd.GeoDataFrame(
        {"block_id": range(len(grid_cells))},
        geometry=grid_cells,
        crs=gdf.crs,
    )
    joined = gpd.sjoin(gdf, grid_gdf[["block_id", "geometry"]], how="left", predicate="within")
    # Points that fall on grid edges may get multiple matches; keep the first
    joined = joined[~joined.index.duplicated(keep="first")]
    joined["block_id"] = joined["block_id"].fillna(-1).astype(int)
    assert (joined["block_id"] == -1).sum() == 0, \
        "Some points fell outside all grid cells — check CRS or grid extent"
    return joined
gdf_blocked = create_spatial_blocks(gdf, block_size=block_size)
n_blocks = gdf_blocked["block_id"].nunique()
print(f"Assigned {len(gdf_blocked)} observations to {n_blocks} spatial blocks")

Inline validation: a correct assignment returns zero unmatched points and at least as many unique blocks as the intended number of folds.

assert n_blocks >= 5, f"Too few blocks ({n_blocks}) to form 5 folds — reduce block_size"

Alternative blocking schemes:

  • H3 hexagonal indexing — uniform-area cells with consistent edge length, documented at the H3 Geospatial Indexing System. Preferred when data is unevenly distributed across a large extent.
  • Administrative boundaries — use existing shapefiles of provinces or watersheds as the blocking unit. This directly models leave-one-region-out transfer.
  • Environmental strata — define blocks by elevation band, land cover class, or climatic zone to test generalisation across environmental gradients rather than just distance.

Step 3 — Select a partition strategy

Match the split method to your deployment scope. The choice changes what the validation score actually measures.

Strategy Best for Trade-off
Spatial k-Fold Regional generalisation over a well-sampled area Balanced folds; still assumes the region is broadly similar
Leave-One-Region-Out (LORO) Domain transfer to unseen administrative zones High variance; each fold trains on the full remaining area
Buffer exclusion Dense datasets with strong local clustering Removes the most informative near-boundary training points
Environmental block CV Models predicting across climate or land-cover gradients Requires auxiliary covariate layers for stratification

For most production deployments that need to generalise to new geographic extents, leave-one-region-out gives the most pessimistic — and therefore most honest — error estimate. Spatial k-fold is appropriate when you care about average performance across the existing study area rather than pure transfer.

Step 4 — Run cross-validation with GroupKFold

cross_val_score does not accept a pre-computed fold-assignment array as its cv argument. Pass a splitter object such as GroupKFold and supply block IDs via the groups parameter.

import numpy as np
import geopandas as gpd
from sklearn.model_selection import GroupKFold, cross_val_score
from sklearn.ensemble import GradientBoostingRegressor

# Load, project, block
gdf = gpd.read_file("training_data.geojson").to_crs("EPSG:32633")
range_m = estimate_variogram_range(gdf, target_col="target")
block_size = max(range_m * 2.0, 1000)
gdf_blocked = create_spatial_blocks(gdf, block_size=block_size)

# Feature matrix
feature_cols = [c for c in gdf_blocked.columns if c not in
                ("target", "geometry", "block_id", "index_right")]
X = gdf_blocked[feature_cols].values
y = gdf_blocked["target"].values
groups = gdf_blocked["block_id"].values

# Estimator and spatial CV
estimator = GradientBoostingRegressor(n_estimators=200, learning_rate=0.05, random_state=42)
gkf = GroupKFold(n_splits=5)

scores = cross_val_score(
    estimator, X, y,
    cv=gkf,
    groups=groups,
    scoring="neg_root_mean_squared_error",
    n_jobs=-1,
)
print(f"Spatial CV RMSE: {-scores.mean():.3f} ± {scores.std():.3f}")
print(f"Per-fold RMSE:   {(-scores).round(3).tolist()}")

For guidance on building feature-aware pipelines that pass geometry metadata through transformers without dropping the geometry column, see Training with Scikit-Learn-Geo.

Step 5 — Buffer exclusion for vector data

When training points are densely clustered (for example, sensor networks along road corridors), a minimum separation zone between training and validation geometries is mandatory. Spatial k-fold alone will not eliminate leakage at block edges.

import geopandas as gpd
from shapely.geometry import MultiPolygon
import numpy as np
from sklearn.model_selection import BaseCrossValidator

class BufferedSpatialKFold(BaseCrossValidator):
    """GroupKFold that excludes training points within `buffer_dist` of the
    validation set boundary, preventing edge-case spatial leakage.

    Args:
        n_splits: Number of folds.
        buffer_dist: Exclusion radius in CRS units (metres for projected CRS).
    """

    def __init__(self, n_splits: int = 5, buffer_dist: float = 5000.0) -> None:
        self.n_splits = n_splits
        self.buffer_dist = buffer_dist

    def _iter_test_masks(self, X=None, y=None, groups=None):
        unique_groups = np.unique(groups)
        fold_size = len(unique_groups) // self.n_splits
        for i in range(self.n_splits):
            val_groups = unique_groups[i * fold_size: (i + 1) * fold_size]
            yield np.isin(groups, val_groups)

    def split(self, X, y=None, groups=None):
        if groups is None:
            raise ValueError("groups must be provided (block_id array)")
        gdf: gpd.GeoDataFrame = X  # expects GeoDataFrame passed as X
        unique_groups = np.unique(groups)
        fold_size = max(1, len(unique_groups) // self.n_splits)

        for i in range(self.n_splits):
            val_groups = unique_groups[i * fold_size: (i + 1) * fold_size]
            val_mask = np.isin(groups, val_groups)
            val_geom = gdf[val_mask].geometry.union_all()
            exclusion_zone = val_geom.buffer(self.buffer_dist)

            train_mask = ~val_mask & ~gdf.geometry.intersects(exclusion_zone).values
            val_idx = np.where(val_mask)[0]
            train_idx = np.where(train_mask)[0]
            yield train_idx, val_idx

    def get_n_splits(self, X=None, y=None, groups=None):
        return self.n_splits
# Usage — pass GeoDataFrame as X so the splitter can access geometries
bskf = BufferedSpatialKFold(n_splits=5, buffer_dist=block_size)
results = []
for train_idx, val_idx in bskf.split(gdf_blocked, groups=groups):
    X_train = gdf_blocked.iloc[train_idx][feature_cols].values
    X_val   = gdf_blocked.iloc[val_idx][feature_cols].values
    y_train = y[train_idx]
    y_val   = y[val_idx]
    estimator.fit(X_train, y_train)
    rmse = np.sqrt(np.mean((estimator.predict(X_val) - y_val) ** 2))
    results.append(rmse)
print(f"Buffered spatial CV RMSE: {np.mean(results):.3f} ± {np.std(results):.3f}")

Verification & Testing

After running spatial CV, verify that fold assignments are genuinely spatially separated and that your scores are not an artefact of a too-small block size.

import numpy as np
from sklearn.model_selection import GroupKFold

def verify_spatial_folds(groups: np.ndarray, n_splits: int = 5) -> None:
    """Assert that no group ID appears in both training and validation for any fold."""
    gkf = GroupKFold(n_splits=n_splits)
    for fold, (train_idx, val_idx) in enumerate(gkf.split(groups, groups=groups)):
        train_groups = set(groups[train_idx])
        val_groups   = set(groups[val_idx])
        overlap = train_groups & val_groups
        assert len(overlap) == 0, \
            f"Fold {fold}: group IDs {overlap} appear in both train and validation sets"
    print(f"All {n_splits} folds verified — no group ID overlap")

verify_spatial_folds(groups, n_splits=5)

Visual sanity check — plot fold assignments to confirm spatial contiguity:

import matplotlib.pyplot as plt

fold_labels = np.full(len(gdf_blocked), -1, dtype=int)
gkf = GroupKFold(n_splits=5)
for fold, (_, val_idx) in enumerate(gkf.split(X, groups=groups)):
    fold_labels[val_idx] = fold

gdf_blocked["fold"] = fold_labels
fig, ax = plt.subplots(figsize=(8, 6))
gdf_blocked.plot(column="fold", ax=ax, categorical=True,
                 legend=True, cmap="Set1", markersize=4)
ax.set_title("Spatial CV fold assignments")
plt.tight_layout()
plt.savefig("spatial_folds.png", dpi=150)

A correctly blocked fold map shows contiguous coloured regions. If you see a salt-and-pepper pattern, your block size is smaller than the grid resolution and some points fell into the wrong cell — revisit Step 1.

Numerical sanity check — compare spatial CV RMSE against random k-fold RMSE. If they are within 5 %, either your data has very low spatial autocorrelation or your blocks are still too small:

from sklearn.model_selection import KFold, cross_val_score

random_scores = cross_val_score(
    GradientBoostingRegressor(n_estimators=200, learning_rate=0.05, random_state=42),
    X, y, cv=KFold(n_splits=5, shuffle=True, random_state=42),
    scoring="neg_root_mean_squared_error",
)
spatial_rmse = -scores.mean()
random_rmse  = -random_scores.mean()
inflation = (random_rmse - spatial_rmse) / spatial_rmse * 100
print(f"Random CV RMSE:  {random_rmse:.3f}")
print(f"Spatial CV RMSE: {spatial_rmse:.3f}")
print(f"Optimism:        {inflation:.1f}% — {'significant leakage in random CV' if inflation > 10 else 'low autocorrelation or oversized blocks'}")

For per-region evaluation methodology and spatially disaggregated metrics, see Evaluating Model Performance with Spatial Metrics.

Troubleshooting & Common Errors

ValueError: The groups parameter contains fewer unique values than n_splits

Cause: The block grid produced fewer occupied cells than the requested number of folds.

Fix: Reduce block_size so more cells are populated, or reduce n_splits. Always assert n_blocks >= n_splits after blocking.

Silent leakage despite using GroupKFold

Cause: Block size is smaller than the variogram range. Observations in adjacent blocks share autocorrelated signal.

Fix: Re-run estimate_variogram_range and ensure block_size >= range_m * 2. Re-verify with the salt-and-pepper test above.

AttributeError: 'GeoDataFrame' object has no attribute 'union_all'

Cause: union_all() was added in Shapely 2.0. Older environments use unary_union.

Fix: Upgrade to shapely>=2.0 or replace union_all() with unary_union.

Fold imbalance — one fold contains 80 % of the data

Cause: Sampling density is geographically skewed (for example, dense urban coverage, sparse rural coverage).

Fix: Use stratified spatial blocking: compute block-level sample counts, sort blocks by count, and assign them to folds in a round-robin fashion to balance fold sizes.

import numpy as np

def balanced_spatial_folds(groups: np.ndarray, n_splits: int = 5) -> np.ndarray:
    """Return a fold-label array where blocks are distributed round-robin
    by descending group size to balance validation set cardinality."""
    unique, counts = np.unique(groups, return_counts=True)
    sorted_idx = np.argsort(-counts)        # descending by count
    sorted_groups = unique[sorted_idx]

    fold_map = {}
    for i, g in enumerate(sorted_groups):
        fold_map[g] = i % n_splits

    return np.array([fold_map[g] for g in groups])

CRSError: Input is not a CRS during spatial join in block assignment

Cause: The grid GeoDataFrame was created without an explicit CRS, or the input file contains a non-standard EPSG code.

Fix:

grid_gdf = grid_gdf.set_crs(gdf.crs)   # not to_crs — set only when grid has no CRS
assert grid_gdf.crs == gdf.crs, "CRS mismatch between grid and input data"

Memory exhaustion during spatial join on large datasets

Cause: gpd.sjoin loads all geometries into memory. For millions of points this exhausts RAM.

Fix: Assign block IDs arithmetically using numpy floor division rather than a spatial join:

def fast_block_assignment(
    gdf: gpd.GeoDataFrame,
    block_size: float,
) -> np.ndarray:
    """Assign block IDs using integer arithmetic — O(n) and memory-efficient."""
    x_min, y_min = gdf.total_bounds[0], gdf.total_bounds[1]
    col_idx = ((gdf.geometry.x - x_min) / block_size).astype(int)
    row_idx = ((gdf.geometry.y - y_min) / block_size).astype(int)
    n_cols = col_idx.max() + 1
    return (row_idx * n_cols + col_idx).values

Performance Optimisation

Parallelise fold scoring with n_jobs=-1

cross_val_score accepts n_jobs=-1 to distribute folds across all CPU cores. For expensive estimators such as gradient boosting or graph neural networks, this reduces wall time by a factor proportional to the number of folds.

Pre-compute spatial indices before fold generation

When using gpd.sjoin in the blocking step, build an STR-tree index explicitly to avoid re-indexing on every call:

from shapely.strtree import STRtree

tree = STRtree(grid_gdf.geometry.values)
# query returns indices into grid_gdf for each observation geometry
idx_pairs = tree.query(gdf.geometry.values, predicate="within")

This cuts sjoin time by roughly 3-5× on datasets with >100 k points.

Use Dask-GeoPandas for continental-scale blocking

For datasets that exceed RAM (continental rasters or global point clouds), replace gpd.sjoin with dask_geopandas.sjoin. Partition the GeoDataFrame on spatial tiles and compute block IDs lazily:

import dask_geopandas as dgpd

dgdf = dgpd.from_geopandas(gdf, npartitions=8)
dblocked = dgpd.sjoin(dgdf, grid_gdf[["block_id", "geometry"]], how="left", predicate="within")
block_ids = dblocked["block_id"].compute().fillna(-1).astype(int).values

For broader guidance on vector proximity and buffer generation at scale — including parallelised distance matrices that feed into spatial CV pipelines — see that linked section.

Cache block assignments between runs

Block assignments are deterministic for a given block_size and input dataset. Persist them to avoid recomputing:

import hashlib, json, numpy as np
from pathlib import Path

def cached_block_ids(gdf, block_size, cache_dir=".cache"):
    key = hashlib.md5(
        json.dumps({"bbox": list(gdf.total_bounds), "n": len(gdf), "bs": block_size}).encode()
    ).hexdigest()[:12]
    path = Path(cache_dir) / f"blocks_{key}.npy"
    if path.exists():
        return np.load(path)
    ids = fast_block_assignment(gdf, block_size)
    path.parent.mkdir(exist_ok=True)
    np.save(path, ids)
    return ids

FAQ

Why does spatial CV give much higher RMSE than random CV on the same model?

Spatial CV removes the benefit of memorising local autocorrelated signal from the training set. If your model achieves R² = 0.90 under random CV but only 0.65 under spatial CV, the gap represents the fraction of explained variance that comes from spatial proximity rather than real predictive features. The spatial CV number is the honest estimate of how the model performs when deployed to a new region.

How many folds should I use?

Five folds is the standard starting point. Use fewer (three) when data is sparse or the study area is small and you cannot form five geographically independent groups with enough samples each. Use more (ten) when the study area contains diverse sub-regions and you want stable variance estimates. Avoid leave-one-out on spatial data — it creates thousands of single-sample validation sets that defeat the purpose of geographic separation.

Can I use spatial CV with time-series geodata?

Yes, but you need to block in both space and time simultaneously. Assign block IDs by combining a spatial grid ID with a temporal bin (for example, year or month). This prevents leakage both from spatial autocorrelation and from temporal autocorrelation between adjacent sampling periods. For aggregating temporal windows before splitting, see Temporal Aggregation for Time-Series Geodata.

Does spatial CV work with raster data?

Yes. Instead of assigning point block IDs, partition the raster into non-overlapping tile windows using rasterio.windows. Assign each tile a block ID, mask validation tiles to no-data during training, and apply a pixel dilation kernel (for example scipy.ndimage.binary_dilation) to create a no-data buffer around each validation tile before scoring.


Part of: Training Geospatial Predictive Models in Python