TL;DR Answer
Subclass sklearn.model_selection.BaseCrossValidator and use scipy.cluster.vq.kmeans2 to group observations into n_splits spatially contiguous folds. Each call to split() yields training indices from all folds except one, with the remaining fold used for validation — identical to standard k-fold but with geography-aware partitioning instead of random shuffling. This is part of the broader Spatial Cross-Validation Strategies needed to get honest model evaluations on geospatial data.
from sklearn.model_selection import BaseCrossValidator
from scipy.cluster.vq import kmeans2
import numpy as np
class SpatialKFold(BaseCrossValidator):
def __init__(self, n_splits=5, random_state=None):
self.n_splits = n_splits
self.random_state = random_state
def split(self, X, y=None, groups=None):
rng = np.random.default_rng(self.random_state)
coords = X[:, -2:]
_, labels = kmeans2(coords, self.n_splits, minit='++',
seed=int(rng.integers(0, 2**31)))
for fold in range(self.n_splits):
yield np.where(labels != fold)[0], np.where(labels == fold)[0]
def get_n_splits(self, X=None, y=None, groups=None):
return self.n_splitsPart of: Spatial Cross-Validation Strategies
Why Standard KFold Fails in Geospatial ML Pipelines
Standard KFold treats every observation as independent. In geospatial data, that assumption is almost always false. Tobler’s First Law of Geography guarantees that nearby locations share environmental gradients — soil moisture, canopy cover, elevation, land surface temperature — that the model uses as features. When a randomly shuffled fold places a training point 50 metres from a validation point, both observations encode nearly the same underlying signal. The model appears to generalize, but it is actually interpolating between memorized locations.
The corruption is silent. cross_val_score with random KFold will report R² values 0.10–0.30 higher than you will see at inference time on genuinely new geographic extents. Hyperparameter searches tuned against these inflated scores will overfit to spatial proximity rather than causal feature relationships. When the model is deployed to a different watershed, province, or growing season, performance collapses with no obvious diagnostic trail.
The downstream steps most severely affected are:
- Feature importance scores: Features that happen to correlate with spatial position (elevation, aspect) appear more predictive than they are.
- Threshold selection for classifiers: Decision thresholds calibrated on leaky validation sets produce recall/precision trade-offs that do not hold in production.
- Drift detection baselines: MLOps systems that use cross-validation RMSE as a retraining trigger will fire too late because the baseline was optimistic from the start.
Without addressing this, the handling-spatial-autocorrelation work you do during feature engineering has no payoff — you correct the features but then evaluate them against a leaky split.
Core Principles for a Sound Spatial Splitter
- Project before you cluster. K-means minimises Euclidean distance. Raw lat/lon stretches distance in the longitude direction at higher latitudes. Always convert to a projected metric CRS (UTM or EPSG:3857) before passing coordinates to the splitter.
- Validate array shape early. The
split()method receivesX— the full feature matrix. Fail fast if the coordinate columns are missing or ifn_samples < n_splits. - Seed deterministically.
kmeans2accepts an integerseed. Derive it from anumpy.random.Generatorso the full pipeline can be reproduced from a single top-levelrandom_state. - Emit standard (train_idx, test_idx) pairs. Any
BaseCrossValidatorsubclass works transparently withcross_val_score,GridSearchCV, andPipelinewithout extra glue code. - Audit fold sizes. Uneven point density will produce imbalanced folds. Log the size of each fold before running expensive hyperparameter sweeps.
- Keep coordinates as the last two columns. Fixing coordinate position at
X[:, -2:]is the simplest convention. Document it clearly so downstream engineers know the expected array layout.
Production-Ready Code
The implementation below adds input validation, logging, and typed signatures. It also handles the edge case where a kmeans2 run produces an empty cluster (which can happen when n_splits approaches the number of distinct coordinate positions).
import logging
import numpy as np
from sklearn.model_selection import BaseCrossValidator
from scipy.cluster.vq import kmeans2
logger = logging.getLogger(__name__)
class SpatialKFold(BaseCrossValidator):
"""Spatial cross-validator that clusters coordinates to prevent geographic leakage.
Assumes the last two columns of X contain projected metric coordinates
(easting, northing). Convert lat/lon to a metric CRS before use.
Parameters
----------
n_splits : int, default=5
Number of spatially contiguous folds.
random_state : int or None, default=None
Seed for reproducibility. Pass an integer for deterministic folds.
max_iter : int, default=100
Maximum K-means iterations passed to scipy.cluster.vq.kmeans2.
"""
def __init__(
self,
n_splits: int = 5,
random_state: int | None = None,
max_iter: int = 100,
) -> None:
self.n_splits = n_splits
self.random_state = random_state
self.max_iter = max_iter
def split(
self,
X: np.ndarray,
y: np.ndarray | None = None,
groups: np.ndarray | None = None,
):
"""Generate (train_idx, test_idx) index pairs for each spatial fold.
Parameters
----------
X : np.ndarray of shape (n_samples, n_features)
Feature matrix. Last two columns must be projected metric coordinates.
y : ignored
groups : ignored
Yields
------
train_idx, test_idx : np.ndarray
"""
if X.ndim != 2 or X.shape[1] < 2:
raise ValueError(
"X must be a 2-D array with at least two columns. "
"Last two columns must be projected coordinates."
)
n_samples = X.shape[0]
if n_samples < self.n_splits:
raise ValueError(
f"n_samples ({n_samples}) must be >= n_splits ({self.n_splits})."
)
rng = np.random.default_rng(self.random_state)
coords = X[:, -2:].astype(float)
_, labels = kmeans2(
coords,
self.n_splits,
minit="++",
iter=self.max_iter,
seed=int(rng.integers(0, 2**31)),
)
for fold in range(self.n_splits):
test_idx = np.where(labels == fold)[0]
train_idx = np.where(labels != fold)[0]
if test_idx.size == 0:
logger.warning(
"Fold %d is empty — consider reducing n_splits or "
"inspecting coordinate density.", fold
)
continue
logger.debug(
"Fold %d: train=%d, test=%d", fold, train_idx.size, test_idx.size
)
yield train_idx, test_idx
def get_n_splits(
self,
X: np.ndarray | None = None,
y: np.ndarray | None = None,
groups: np.ndarray | None = None,
) -> int:
return self.n_splitsStep-by-Step Walkthrough
1. Prepare your feature matrix with projected coordinates in the last two columns
import numpy as np
import geopandas as gpd
# Load points already in a metric CRS (UTM zone 33N)
gdf = gpd.read_file("soil_samples.geojson").to_crs("EPSG:32633")
# Extract feature columns (drop geometry and target)
feature_cols = ["ndvi", "elevation", "slope", "aspect", "clay_pct"]
features = gdf[feature_cols].to_numpy()
# Append easting and northing as the last two columns
coords = np.column_stack([gdf.geometry.x, gdf.geometry.y])
X = np.column_stack([features, coords])
y = gdf["organic_carbon"].to_numpy()
print(f"X shape: {X.shape}") # (n_samples, n_features + 2)
print(f"CRS: {gdf.crs}") # EPSG:32633 — projected metric, safe for K-means2. Instantiate the splitter and audit fold sizes
from spatial_kfold import SpatialKFold
cv = SpatialKFold(n_splits=5, random_state=42)
for fold, (train_idx, test_idx) in enumerate(cv.split(X)):
print(f"Fold {fold}: train={len(train_idx)}, test={len(test_idx)}")Expected output (sizes vary with data density):
Fold 0: train=842, test=198
Fold 1: train=851, test=189
Fold 2: train=813, test=227
Fold 3: train=867, test=173
Fold 4: train=826, test=214
Unbalanced folds (one fold with < 50 samples while others have > 200) indicate spatially clustered data. Increase n_splits or switch to grid-based blocking as described in Spatial Cross-Validation Strategies.
3. Run cross-validation with cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
rf = RandomForestRegressor(n_estimators=200, random_state=42, n_jobs=-1)
scores = cross_val_score(rf, X, y, cv=cv, scoring="neg_root_mean_squared_error")
print(f"Spatial CV RMSE: {-scores.mean():.3f} ± {scores.std():.3f}")4. Integrate with GridSearchCV for hyperparameter tuning
from sklearn.model_selection import GridSearchCV
param_grid = {
"n_estimators": [100, 200],
"max_depth": [5, 10, None],
"min_samples_split": [2, 5],
}
grid = GridSearchCV(
rf,
param_grid,
cv=SpatialKFold(n_splits=5, random_state=42),
scoring="neg_root_mean_squared_error",
n_jobs=-1,
verbose=1,
)
grid.fit(X, y)
print(f"Best params: {grid.best_params_}")
print(f"Best spatial CV RMSE: {-grid.best_score_:.3f}")5. Pre-compute and cache fold labels for large datasets
For datasets exceeding 100k points, running K-means inside every GridSearchCV call wastes CPU time because the same spatial partition is recomputed repeatedly.
from sklearn.model_selection import PredefinedSplit
from scipy.cluster.vq import kmeans2
import numpy as np
# Compute cluster labels once
rng = np.random.default_rng(42)
coords = X[:, -2:]
_, fold_labels = kmeans2(
coords, 5, minit="++", seed=int(rng.integers(0, 2**31))
)
# Convert to sklearn's PredefinedSplit: label = test fold index, -1 = always train
# Here every sample is used as test exactly once
ps = PredefinedSplit(test_fold=fold_labels)
grid_cached = GridSearchCV(
rf, param_grid, cv=ps, scoring="neg_root_mean_squared_error", n_jobs=-1
)
grid_cached.fit(X, y)Visualising the Spatial Folds
The SVG below shows how K-means partitions a point cloud into five spatially contiguous folds. Each region is evaluated as the held-out test set in turn while the other four form the training set.
Verification
After running cross-validation, confirm that geographic leakage is genuinely absent by checking two things: the physical distance between nearest training and test points, and whether residuals show spatial autocorrelation.
from scipy.spatial import cKDTree
def min_train_test_distance(X_train: np.ndarray, X_test: np.ndarray) -> float:
"""Return the minimum Euclidean distance between any training and test point.
Coordinates must be in a projected metric CRS (metres).
"""
tree = cKDTree(X_train[:, -2:])
distances, _ = tree.query(X_test[:, -2:], k=1)
return float(distances.min())
# Check the first fold
for fold, (train_idx, test_idx) in enumerate(cv.split(X)):
min_dist = min_train_test_distance(X[train_idx], X[test_idx])
print(f"Fold {fold}: min train–test distance = {min_dist:.1f} m")
breakA minimum distance of several kilometres confirms the fold boundary is spatially meaningful. If the value is near zero, your n_splits is too high relative to your dataset’s spatial extent, or the data has extreme coordinate clustering.
For a full residual autocorrelation audit, pair this with the techniques in Evaluating Model Performance with Spatial Metrics, which covers computing Moran’s I on cross-validation residuals.
FAQ
Why can’t I just use sklearn’s built-in KFold for geospatial data?
KFold shuffles observations randomly without considering their coordinates. Geographically adjacent training and test points share unmodeled environmental gradients, leaking spatial information across the split boundary. This silently inflates R², RMSE, and accuracy metrics. SpatialKFold groups observations into coordinate-based clusters so each fold is geographically contiguous and well separated from the others.
Should I use lat/lon or projected coordinates for K-means clustering?
Always use projected metric coordinates (UTM, EPSG:3857, or a local equal-area CRS). Euclidean distance on raw latitude/longitude is geometrically distorted — one degree of longitude near the equator spans ~111 km but near the poles it spans nearly zero. For guidance on projecting data correctly, see CRS Alignment and Projection Handling.
How many splits should I use?
Five is a practical default. Fewer splits reduce computation but increase fold imbalance; more splits shrink the training set per fold and may leave some folds too small to fit the model reliably. If your dataset has strong density clustering, increase n_splits to 10 and audit fold sizes before running the full pipeline. Compare per-fold RMSE values — high variance across folds signals spatial sampling bias rather than model instability.
Related
- Evaluating Model Performance with Spatial Metrics
- Reducing Spatial Leakage in Model Training
- Handling Spatial Autocorrelation
- CRS Alignment and Projection Handling
Part of: Spatial Cross-Validation Strategies — Training Geospatial Predictive Models in Python