How many splits (n_splits) should I use?

Five is a practical default for most geospatial datasets. Fewer splits reduce computation but increase fold imbalance; more splits shrink training set size per fold and may leave some folds with too few samples to fit the model reliably. If your dataset has strong density clustering, increase n_splits to 10 and audit fold sizes before running the full pipeline.

Implementing SpatialKFold in Python

Q: Why can't I just use sklearn's built-in KFold for geospatial data?

sklearn's KFold shuffles observations randomly without regard to their coordinates. Geographically adjacent points share unmodeled environmental gradients, so adjacent training and test samples leak information across the split boundary. This silently inflates R², RMSE, and accuracy metrics. SpatialKFold groups observations by coordinate-based clusters so each fold is geographically contiguous and separated from the others.

TL;DR Answer

Subclass sklearn.model_selection.BaseCrossValidator and use scipy.cluster.vq.kmeans2 to group observations into n_splits spatially contiguous folds. Each call to split() yields training indices from all folds except one, with the remaining fold used for validation — identical to standard k-fold but with geography-aware partitioning instead of random shuffling. This is part of the broader Spatial Cross-Validation Strategies needed to get honest model evaluations on geospatial data.

from sklearn.model_selection import BaseCrossValidator
from scipy.cluster.vq import kmeans2
import numpy as np

class SpatialKFold(BaseCrossValidator):
    def __init__(self, n_splits=5, random_state=None):
        self.n_splits = n_splits
        self.random_state = random_state

    def split(self, X, y=None, groups=None):
        rng = np.random.default_rng(self.random_state)
        coords = X[:, -2:]
        _, labels = kmeans2(coords, self.n_splits, minit='++',
                            seed=int(rng.integers(0, 2**31)))
        for fold in range(self.n_splits):
            yield np.where(labels != fold)[0], np.where(labels == fold)[0]

    def get_n_splits(self, X=None, y=None, groups=None):
        return self.n_splits

Part of: Spatial Cross-Validation Strategies

Why Standard KFold Fails in Geospatial ML Pipelines

Standard KFold treats every observation as independent. In geospatial data, that assumption is almost always false. Tobler’s First Law of Geography guarantees that nearby locations share environmental gradients — soil moisture, canopy cover, elevation, land surface temperature — that the model uses as features. When a randomly shuffled fold places a training point 50 metres from a validation point, both observations encode nearly the same underlying signal. The model appears to generalize, but it is actually interpolating between memorized locations.

The corruption is silent. cross_val_score with random KFold will report R² values 0.10–0.30 higher than you will see at inference time on genuinely new geographic extents. Hyperparameter searches tuned against these inflated scores will overfit to spatial proximity rather than causal feature relationships. When the model is deployed to a different watershed, province, or growing season, performance collapses with no obvious diagnostic trail.

The downstream steps most severely affected are:

Feature importance scores: Features that happen to correlate with spatial position (elevation, aspect) appear more predictive than they are.
Threshold selection for classifiers: Decision thresholds calibrated on leaky validation sets produce recall/precision trade-offs that do not hold in production.
Drift detection baselines: MLOps systems that use cross-validation RMSE as a retraining trigger will fire too late because the baseline was optimistic from the start.

Without addressing this, the handling-spatial-autocorrelation work you do during feature engineering has no payoff — you correct the features but then evaluate them against a leaky split.

Core Principles for a Sound Spatial Splitter

Project before you cluster. K-means minimises Euclidean distance. Raw lat/lon stretches distance in the longitude direction at higher latitudes. Always convert to a projected metric CRS (UTM or EPSG:3857) before passing coordinates to the splitter.
Validate array shape early. The split() method receives X — the full feature matrix. Fail fast if the coordinate columns are missing or if n_samples < n_splits.
Seed deterministically. kmeans2 accepts an integer seed. Derive it from a numpy.random.Generator so the full pipeline can be reproduced from a single top-level random_state.
Emit standard (train_idx, test_idx) pairs. Any BaseCrossValidator subclass works transparently with cross_val_score, GridSearchCV, and Pipeline without extra glue code.
Audit fold sizes. Uneven point density will produce imbalanced folds. Log the size of each fold before running expensive hyperparameter sweeps.
Keep coordinates as the last two columns. Fixing coordinate position at X[:, -2:] is the simplest convention. Document it clearly so downstream engineers know the expected array layout.

Production-Ready Code

The implementation below adds input validation, logging, and typed signatures. It also handles the edge case where a kmeans2 run produces an empty cluster (which can happen when n_splits approaches the number of distinct coordinate positions).

import logging
import numpy as np
from sklearn.model_selection import BaseCrossValidator
from scipy.cluster.vq import kmeans2

logger = logging.getLogger(__name__)


class SpatialKFold(BaseCrossValidator):
    """Spatial cross-validator that clusters coordinates to prevent geographic leakage.

    Assumes the last two columns of X contain projected metric coordinates
    (easting, northing). Convert lat/lon to a metric CRS before use.

    Parameters
    ----------
    n_splits : int, default=5
        Number of spatially contiguous folds.
    random_state : int or None, default=None
        Seed for reproducibility. Pass an integer for deterministic folds.
    max_iter : int, default=100
        Maximum K-means iterations passed to scipy.cluster.vq.kmeans2.
    """

    def __init__(
        self,
        n_splits: int = 5,
        random_state: int | None = None,
        max_iter: int = 100,
    ) -> None:
        self.n_splits = n_splits
        self.random_state = random_state
        self.max_iter = max_iter

    def split(
        self,
        X: np.ndarray,
        y: np.ndarray | None = None,
        groups: np.ndarray | None = None,
    ):
        """Generate (train_idx, test_idx) index pairs for each spatial fold.

        Parameters
        ----------
        X : np.ndarray of shape (n_samples, n_features)
            Feature matrix. Last two columns must be projected metric coordinates.
        y : ignored
        groups : ignored

        Yields
        ------
        train_idx, test_idx : np.ndarray
        """
        if X.ndim != 2 or X.shape[1] < 2:
            raise ValueError(
                "X must be a 2-D array with at least two columns. "
                "Last two columns must be projected coordinates."
            )
        n_samples = X.shape[0]
        if n_samples < self.n_splits:
            raise ValueError(
                f"n_samples ({n_samples}) must be >= n_splits ({self.n_splits})."
            )

        rng = np.random.default_rng(self.random_state)
        coords = X[:, -2:].astype(float)

        _, labels = kmeans2(
            coords,
            self.n_splits,
            minit="++",
            iter=self.max_iter,
            seed=int(rng.integers(0, 2**31)),
        )

        for fold in range(self.n_splits):
            test_idx = np.where(labels == fold)[0]
            train_idx = np.where(labels != fold)[0]

            if test_idx.size == 0:
                logger.warning(
                    "Fold %d is empty — consider reducing n_splits or "
                    "inspecting coordinate density.", fold
                )
                continue

            logger.debug(
                "Fold %d: train=%d, test=%d", fold, train_idx.size, test_idx.size
            )
            yield train_idx, test_idx

    def get_n_splits(
        self,
        X: np.ndarray | None = None,
        y: np.ndarray | None = None,
        groups: np.ndarray | None = None,
    ) -> int:
        return self.n_splits

Step-by-Step Walkthrough

1. Prepare your feature matrix with projected coordinates in the last two columns

import numpy as np
import geopandas as gpd

# Load points already in a metric CRS (UTM zone 33N)
gdf = gpd.read_file("soil_samples.geojson").to_crs("EPSG:32633")

# Extract feature columns (drop geometry and target)
feature_cols = ["ndvi", "elevation", "slope", "aspect", "clay_pct"]
features = gdf[feature_cols].to_numpy()

# Append easting and northing as the last two columns
coords = np.column_stack([gdf.geometry.x, gdf.geometry.y])
X = np.column_stack([features, coords])
y = gdf["organic_carbon"].to_numpy()

print(f"X shape: {X.shape}")   # (n_samples, n_features + 2)
print(f"CRS: {gdf.crs}")       # EPSG:32633 — projected metric, safe for K-means

2. Instantiate the splitter and audit fold sizes

from spatial_kfold import SpatialKFold

cv = SpatialKFold(n_splits=5, random_state=42)

for fold, (train_idx, test_idx) in enumerate(cv.split(X)):
    print(f"Fold {fold}: train={len(train_idx)}, test={len(test_idx)}")

Expected output (sizes vary with data density):

Fold 0: train=842, test=198
Fold 1: train=851, test=189
Fold 2: train=813, test=227
Fold 3: train=867, test=173
Fold 4: train=826, test=214

Unbalanced folds (one fold with < 50 samples while others have > 200) indicate spatially clustered data. Increase n_splits or switch to grid-based blocking as described in Spatial Cross-Validation Strategies.

3. Run cross-validation with `cross_val_score`

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

rf = RandomForestRegressor(n_estimators=200, random_state=42, n_jobs=-1)
scores = cross_val_score(rf, X, y, cv=cv, scoring="neg_root_mean_squared_error")

print(f"Spatial CV RMSE: {-scores.mean():.3f} ± {scores.std():.3f}")

4. Integrate with `GridSearchCV` for hyperparameter tuning

from sklearn.model_selection import GridSearchCV

param_grid = {
    "n_estimators": [100, 200],
    "max_depth": [5, 10, None],
    "min_samples_split": [2, 5],
}

grid = GridSearchCV(
    rf,
    param_grid,
    cv=SpatialKFold(n_splits=5, random_state=42),
    scoring="neg_root_mean_squared_error",
    n_jobs=-1,
    verbose=1,
)
grid.fit(X, y)
print(f"Best params: {grid.best_params_}")
print(f"Best spatial CV RMSE: {-grid.best_score_:.3f}")

5. Pre-compute and cache fold labels for large datasets

For datasets exceeding 100k points, running K-means inside every GridSearchCV call wastes CPU time because the same spatial partition is recomputed repeatedly.

from sklearn.model_selection import PredefinedSplit
from scipy.cluster.vq import kmeans2
import numpy as np

# Compute cluster labels once
rng = np.random.default_rng(42)
coords = X[:, -2:]
_, fold_labels = kmeans2(
    coords, 5, minit="++", seed=int(rng.integers(0, 2**31))
)

# Convert to sklearn's PredefinedSplit: label = test fold index, -1 = always train
# Here every sample is used as test exactly once
ps = PredefinedSplit(test_fold=fold_labels)

grid_cached = GridSearchCV(
    rf, param_grid, cv=ps, scoring="neg_root_mean_squared_error", n_jobs=-1
)
grid_cached.fit(X, y)

Visualising the Spatial Folds

The SVG below shows how K-means partitions a point cloud into five spatially contiguous folds. Each region is evaluated as the held-out test set in turn while the other four form the training set.

Verification

After running cross-validation, confirm that geographic leakage is genuinely absent by checking two things: the physical distance between nearest training and test points, and whether residuals show spatial autocorrelation.

from scipy.spatial import cKDTree

def min_train_test_distance(X_train: np.ndarray, X_test: np.ndarray) -> float:
    """Return the minimum Euclidean distance between any training and test point.

    Coordinates must be in a projected metric CRS (metres).
    """
    tree = cKDTree(X_train[:, -2:])
    distances, _ = tree.query(X_test[:, -2:], k=1)
    return float(distances.min())

# Check the first fold
for fold, (train_idx, test_idx) in enumerate(cv.split(X)):
    min_dist = min_train_test_distance(X[train_idx], X[test_idx])
    print(f"Fold {fold}: min train–test distance = {min_dist:.1f} m")
    break

A minimum distance of several kilometres confirms the fold boundary is spatially meaningful. If the value is near zero, your n_splits is too high relative to your dataset’s spatial extent, or the data has extreme coordinate clustering.

For a full residual autocorrelation audit, pair this with the techniques in Evaluating Model Performance with Spatial Metrics, which covers computing Moran’s I on cross-validation residuals.

FAQ

Why can’t I just use sklearn’s built-in KFold for geospatial data?

KFold shuffles observations randomly without considering their coordinates. Geographically adjacent training and test points share unmodeled environmental gradients, leaking spatial information across the split boundary. This silently inflates R², RMSE, and accuracy metrics. SpatialKFold groups observations into coordinate-based clusters so each fold is geographically contiguous and well separated from the others.

Should I use lat/lon or projected coordinates for K-means clustering?

Always use projected metric coordinates (UTM, EPSG:3857, or a local equal-area CRS). Euclidean distance on raw latitude/longitude is geometrically distorted — one degree of longitude near the equator spans ~111 km but near the poles it spans nearly zero. For guidance on projecting data correctly, see CRS Alignment and Projection Handling.

How many splits should I use?

Five is a practical default. Fewer splits reduce computation but increase fold imbalance; more splits shrink the training set per fold and may leave some folds too small to fit the model reliably. If your dataset has strong density clustering, increase n_splits to 10 and audit fold sizes before running the full pipeline. Compare per-fold RMSE values — high variance across folds signals spatial sampling bias rather than model instability.

Part of: Spatial Cross-Validation Strategies — Training Geospatial Predictive Models in Python