Problem Framing
scikit-learn provides a robust, standardized API for predictive modeling, but it carries no native awareness of coordinate reference systems, spatial topology, or geographic autocorrelation. Naively feeding a GeoDataFrame into a Pipeline loses all spatial meaning the moment geometries are coerced to raw floats. Worse, standard random cross-validation applied to spatially autocorrelated data produces wildly optimistic accuracy estimates because neighbouring samples leak information across the train–test boundary.
Making scikit-learn spatially aware requires three coordinated changes: wrapping geospatial preprocessing in BaseEstimator-compatible transformers so spatial operations participate in the pipeline contract, replacing random splits with geography-aware validation, and hardening the serialized artifact against CRS drift and feature distribution shift at inference time. This page walks through each step in production-grade detail.
Part of: Training Geospatial Predictive Models in Python
Prerequisites & Environment Setup
Before building a spatial pipeline, verify your environment satisfies the following requirements:
- Python 3.9+ with an isolated virtual environment (
venvorconda) - Core libraries:
scikit-learn>=1.3,geopandas>=0.14,rasterio>=1.3,xarray>=2023.0,numpy>=1.24,pandas>=2.0 - Spatial indexing:
shapely>=2.0(GEOS-backed) for accelerated spatial joins and vector operations - Spatial statistics:
libpysal>=4.7,esda>=2.5for Moran’s I and spatial weight matrices - MLOps & monitoring:
joblib,mlflow, andevidentlyoralibi-detectfor drift tracking - CRS consistency: all input layers must share a projected CRS (e.g., EPSG:32633) before computing distances, buffers, or zonal statistics
Applying CRS Alignment and Projection Handling before the first pipeline step is mandatory — metric-space operations on geographic coordinates produce nonsensical distances that silently corrupt every downstream feature.
Install with explicit version pinning to avoid ABI conflicts in compiled geospatial libraries:
pip install "scikit-learn>=1.3" "geopandas>=0.14" "rasterio>=1.3" \
"xarray>=2023.0" "shapely>=2.0" "libpysal>=4.7" "esda>=2.5" \
"joblib" "mlflow" "evidently"Pipeline Architecture Overview
The diagram below shows how spatial data flows through the five stages covered in this guide — ingestion, transformer wrapping, pipeline assembly, geography-aware validation, and deployment.
Step 1: Spatial Data Ingestion & Geometric Alignment
Geospatial training data typically originates from heterogeneous sources: vector shapefiles or GeoPackages, raster imagery, and tabular telemetry. The ingestion phase aligns these sources into a unified feature matrix while preserving topological integrity.
Raster values are extracted at vector geometries using point sampling or zonal statistics. Vector layers undergo spatial joins to aggregate neighbourhood attributes, administrative boundaries, or proximity metrics. All operations must enforce strict CRS transformations to avoid metric distortion.
import geopandas as gpd
import rasterio
from rasterio.mask import mask as rasterio_mask
import numpy as np
# Load vector targets — enforce projected CRS immediately
vectors = gpd.read_file("training_zones.gpkg").to_crs("EPSG:32633")
def extract_zonal_mean(
vector_gdf: gpd.GeoDataFrame,
raster_path: str,
) -> np.ndarray:
"""Extract mean raster value per vector geometry using rasterio.mask.
Raises ValueError if the raster CRS does not match the vector CRS,
preventing silent metric distortion from misaligned projections.
"""
results = []
with rasterio.open(raster_path) as src:
raster_epsg = src.crs.to_epsg()
vector_epsg = vector_gdf.crs.to_epsg()
if raster_epsg != vector_epsg:
raise ValueError(
f"CRS mismatch: raster={raster_epsg}, vector={vector_epsg}. "
"Reproject before ingestion."
)
for geom in vector_gdf.geometry:
out_image, _ = rasterio_mask(src, [geom], crop=True, nodata=np.nan)
results.append(float(np.nanmean(out_image)))
return np.array(results)
# Validation assertion
zonal_values = extract_zonal_mean(vectors, "ndvi_mosaic.tif")
assert zonal_values.shape[0] == len(vectors), (
f"Row count mismatch: {zonal_values.shape[0]} zonal values "
f"for {len(vectors)} geometries"
)The correct entry point for geometry-masked raster extraction is rasterio.mask.mask — not rasterio.mask.extract, which does not exist. It returns a masked array and an updated affine transform. Always handle nodata explicitly to prevent silent NaN contamination that corrupts downstream raster band math and index calculation.
Step 2: Encapsulating Spatial Feature Engineering in Custom Transformers
Standard scikit-learn transformers operate on dense arrays. To maintain pipeline compatibility, spatial operations must be encapsulated in classes inheriting from sklearn.base.BaseEstimator and sklearn.base.TransformerMixin. This ensures spatial transformations can be chained, serialized, and deployed alongside tabular scalers or encoders without breaking the fit/transform contract.
Common spatial features include:
- Distance to infrastructure (nearest-neighbour queries via
scipy.spatial.cKDTree) - Spatial lag variables (first-order queen or rook contiguity matrices)
- Terrain derivatives (slope, aspect, curvature from DEMs)
- Vector proximity metrics and buffer-derived counts within dynamic radii
Below is a production-ready template for a custom spatial transformer that calculates Euclidean distances to a reference layer. Fitted state (the KD-tree) is stored as a trailing-underscore attribute following scikit-learn convention, which signals that the object has been fitted and keeps serialization predictable:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from scipy.spatial import cKDTree
class DistanceToReferenceTransformer(BaseEstimator, TransformerMixin):
"""Compute min-distance from each sample to a set of reference coordinates.
Parameters
----------
reference_coords : array-like of shape (n_refs, 2)
Reference point coordinates in the training projected CRS.
metric : str, default "euclidean"
Distance metric passed to cKDTree.query.
"""
def __init__(self, reference_coords, metric: str = "euclidean"):
self.reference_coords = reference_coords
self.metric = metric
def fit(self, X, y=None):
self.tree_ = cKDTree(np.asarray(self.reference_coords))
return self
def transform(self, X):
if not hasattr(self, "tree_"):
raise RuntimeError("Call fit() before transform().")
distances, _ = self.tree_.query(np.asarray(X), k=1)
return distances.reshape(-1, 1)
def get_feature_names_out(self, input_features=None):
return np.array(["dist_to_reference"])Implementing get_feature_names_out enables set_output(transform="pandas") on the outer pipeline, preserving column names through the full transform chain and making downstream debugging far easier.
Step 3: Pipeline Assembly & Geography-Aware Validation
Once spatial features are engineered, combine them with tabular attributes through a unified Pipeline. This guarantees preprocessing steps are applied identically during training and inference, eliminating data leakage caused by manual preprocessing outside the model scope.
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import GradientBoostingRegressor
import numpy as np
# Reference infrastructure coordinates in UTM EPSG:32633
ref_coords = np.array([[500000, 5000000], [501000, 5001000]])
# Spatial distance branch
spatial_branch = Pipeline([
("dist_to_ref", DistanceToReferenceTransformer(ref_coords)),
("scaler", StandardScaler()),
])
# Tabular branch (e.g., spectral indices already extracted as columns)
tabular_branch = Pipeline([
("scaler", StandardScaler()),
])
# Combine branches then pass to estimator
preprocessor = FeatureUnion([
("spatial", spatial_branch),
("tabular", tabular_branch),
])
spatial_pipeline = Pipeline([
("features", preprocessor),
("model", GradientBoostingRegressor(n_estimators=300, learning_rate=0.05)),
])Standard K-fold cross-validation fails in geospatial contexts because spatially proximate samples share underlying environmental processes, artificially inflating performance metrics. Implementing Spatial Cross-Validation Strategies ensures training and validation folds are geographically separated, yielding realistic generalisation estimates. Block-based, buffered, or Voronoi tessellation splits should replace random shuffling during hyperparameter tuning.
from sklearn.model_selection import cross_val_score
# spatial_blocks: array of integer block IDs aligned with X rows
# Use GroupKFold so no block straddles train and validation
from sklearn.model_selection import GroupKFold
cv = GroupKFold(n_splits=5)
scores = cross_val_score(
spatial_pipeline,
X_train_coords, # shape (n_samples, 2) — projected coordinates
y_train,
groups=spatial_blocks,
scoring="r2",
cv=cv,
)
print(f"Spatial CV R²: {scores.mean():.3f} ± {scores.std():.3f}")Step 4: Diagnosing & Mitigating Spatial Leakage
Spatial autocorrelation violates the independent and identically distributed assumption underlying most classical ML algorithms. When residuals exhibit geographic clustering, the model is capturing unmodelled spatial processes rather than true predictive signals. This directly impacts deployment reliability — models trained on autocorrelated data often degrade rapidly when applied to unseen regions.
Compute global and local Moran’s I on pipeline residuals post-training. The esda library integrates naturally with libpysal weight matrices:
import libpysal
import esda
import numpy as np
# Fit the pipeline and compute residuals on held-out fold
spatial_pipeline.fit(X_train_coords, y_train)
residuals = y_val - spatial_pipeline.predict(X_val_coords)
# Build queen-contiguity weights from validation geometry centroids
w = libpysal.weights.KNN.from_array(X_val_coords, k=8)
w.transform = "r" # row-standardise
mi = esda.Moran(residuals, w)
print(f"Moran's I = {mi.I:.4f}, p = {mi.p_sim:.4f}")
if mi.p_sim < 0.05:
print("Significant residual autocorrelation detected — add spatial features or lag terms.")If significant clustering persists, consider:
- Adding spatial lag features (mean of k-nearest-neighbour target values) via spatial lag and neighbourhood statistics
- Incorporating higher-order neighbourhood features
- Applying spatial filtering techniques (e.g., eigenvector spatial filtering) before model ingestion
Detailed methodologies for identifying and correcting these dependencies are covered in Handling Spatial Autocorrelation.
Step 5: Serialization, Inference & Drift Monitoring
Production deployment requires deterministic serialization of the entire pipeline, not just the trained estimator. joblib is the standard for scikit-learn artifacts due to its efficient handling of NumPy arrays and Python object graphs.
import joblib
# Serialize the complete pipeline including custom transformers
joblib.dump(spatial_pipeline, "spatial_model_v1.joblib")
# Load for inference — validate CRS before calling predict
def load_and_predict(
pipeline_path: str,
inference_coords: np.ndarray,
incoming_epsg: int,
expected_epsg: int = 32633,
) -> np.ndarray:
"""Load serialized pipeline and run inference with CRS validation gate."""
if incoming_epsg != expected_epsg:
raise ValueError(
f"CRS mismatch at inference: incoming={incoming_epsg}, "
f"expected={expected_epsg}. Reproject before calling predict."
)
pipe = joblib.load(pipeline_path)
return pipe.predict(inference_coords)Inference automation introduces two critical failure modes:
- CRS drift — incoming data arrives in WGS84 while the pipeline expects a projected CRS. The validation gate above catches this explicitly.
- Feature distribution shift — environmental baselines change over time (land cover conversion, sensor recalibration). Deploy
evidentlyto monitor input feature distributions and model residuals against a reference dataset. Trigger automated retraining when statistical tests (Kolmogorov-Smirnov, Population Stability Index) exceed predefined thresholds.
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
drift_report = Report(metrics=[DataDriftPreset()])
drift_report.run(reference_data=reference_df, current_data=inference_df)
drift_report.save_html("drift_report.html")Verification & Testing
Confirm the pipeline behaves correctly before deployment with three complementary checks:
Numerical assertion — shape and range:
preds = spatial_pipeline.predict(X_val_coords)
assert preds.shape == (len(X_val_coords),), "Prediction count mismatch"
assert np.all(np.isfinite(preds)), "Non-finite values in predictions"Round-trip serialization check:
import tempfile, os
with tempfile.NamedTemporaryFile(suffix=".joblib", delete=False) as f:
tmp_path = f.name
try:
joblib.dump(spatial_pipeline, tmp_path)
reloaded = joblib.load(tmp_path)
np.testing.assert_array_almost_equal(
spatial_pipeline.predict(X_val_coords),
reloaded.predict(X_val_coords),
decimal=10,
err_msg="Serialization changed predictions — check custom transformer pickling",
)
print("Round-trip serialization: OK")
finally:
os.unlink(tmp_path)Spatial residual autocorrelation (Moran’s I p-value should exceed 0.05 in well-specified models):
After running the Moran’s I block from Step 4, a non-significant result (p_sim >= 0.05) confirms the model is not systematically missing spatially structured variance.
Troubleshooting & Common Errors
ValueError: Input contains NaN
Root cause: nodata pixels from raster extraction were not replaced before pipeline entry. Fix: call np.nan_to_num(X, nan=0.0) or impute with sklearn.impute.SimpleImputer as the first pipeline step.
NotFittedError: This DistanceToReferenceTransformer instance is not fitted yet
Root cause: transform() called before fit(), often because the pipeline was loaded from disk but fit_transform was called on a stale object. Fix: always call pipeline.fit(X, y) on the full pipeline object, not the sub-transformer.
CRSError: Input is not a CRS when building spatial weights
Root cause: libpysal receives a plain NumPy array without CRS metadata. Fix: pass coordinates directly as a 2-column NumPy array (KNN.from_array(coords, k=8)), not a GeoDataFrame.
AttributeError: 'Pipeline' object has no attribute 'tree_'
Root cause: the pipeline was pickled before fitting. The trailing-underscore convention marks fitted attributes — if tree_ is missing, the transformer was never fitted. Always pickle after fit().
Silent metric inflation in cross-validation
Root cause: random KFold used instead of spatial block splitting. Fix: switch to GroupKFold with a groups array encoding spatial blocks, or use the spatial-cross-validation approach described in Spatial Cross-Validation Strategies.
Prediction output is all-zero or constant
Root cause: the StandardScaler inside the pipeline was fitted on a column with zero variance (e.g., a constant nodata fill). Fix: add sklearn.feature_selection.VarianceThreshold(threshold=0.0) before the scaler to drop zero-variance columns.
Performance Optimisation
Spatial index caching — build cKDTree from reference coordinates once during fit() and reuse it for all transform() calls. Never rebuild the tree inside transform().
Chunked batch inference — for raster-heavy workloads, process data in tiles aligned with the training CRS. Use dask.array or xarray to parallelise pipeline.predict() across spatial partitions, then reassemble into GeoTIFFs:
import dask.array as da
# Load large feature raster as dask array, predict in 256x256 tiles
X_dask = da.from_zarr("features.zarr")
preds = X_dask.map_blocks(
lambda block: spatial_pipeline.predict(block.reshape(-1, block.shape[-1])).reshape(block.shape[:2]),
dtype=np.float32,
)
preds.to_zarr("predictions.zarr")Memory budgeting — for large vector datasets, consider optimizing memory usage for large vector datasets before feeding data into the pipeline. Downcasting coordinate arrays from float64 to float32 halves KD-tree memory consumption with negligible precision loss for metre-scale projected CRS.
Joblib parallel inference — for real-time vector inference, precompute reference coordinates and cache spatial indexes in memory. Batch inference can leverage joblib.Parallel to distribute predictions across spatially contiguous data subsets, minimising disk I/O per worker.
FAQ
Why does standard K-fold cross-validation overestimate accuracy for geospatial models?
Spatially proximate samples share underlying environmental processes. When random folds place neighbours in both train and test sets, the model sees near-identical patterns during validation that it would never encounter in a true out-of-sample geographic region. Geography-aware splits that buffer training and validation regions eliminate this optimism bias. The gap between random-fold and spatial-fold scores is itself a diagnostic — a large gap signals strong autocorrelation that the model has not fully resolved.
Can I use joblib to serialize a pipeline containing GeoDataFrame-dependent transformers?
Yes, but only if the transformer stores plain NumPy arrays as fitted attributes rather than GeoDataFrame objects. GeoDataFrame carries Fiona and GDAL C extensions whose ABI can differ between the training and inference environments, causing unpickling failures. Store only projected coordinate arrays (shape (n_refs, 2)) inside your transformer.
How do I detect CRS drift in production inference?
Store the training EPSG code alongside the serialized pipeline artifact (in a sidecar JSON file or as an attribute on a wrapper class). At inference time, read the incoming data’s CRS, compare it to the stored EPSG, and either raise a descriptive error or auto-reproject before calling transform(). Log CRS mismatches as structured events so you can track their frequency and identify upstream data pipeline issues.
What is the difference between spatial leakage and feature leakage?
Feature leakage occurs when future or target-correlated information enters the training feature set. Spatial leakage is a geographic form: autocorrelation between training and validation observations inflates apparent performance without any direct information leak. Both must be prevented, but spatial leakage requires geography-aware cross-validation rather than standard data-splitting rules.
Related
- Spatial Cross-Validation Strategies — geography-aware fold construction and block splitting
- Handling Spatial Autocorrelation — diagnosing and correcting autocorrelated residuals
- Gradient Boosting for Raster Data — XGBoost and LightGBM tuned for raster feature matrices
- Spatial Lag and Neighbourhood Statistics — building lag features and contiguity matrices
- CRS Alignment and Projection Handling — ensuring metric consistency across all pipeline inputs