Spatial Feature Engineering for Machine Learning

Spatial feature engineering transforms raw geographic data into structured, model-ready representations that capture spatial relationships, topological constraints, and environmental context. Unlike conventional tabular datasets, geospatial inputs carry inherent coordinate dependencies, scale variations, multi-dimensional structures, and non-stationary distributions — every one of which can quietly corrupt a model if ignored. The gap between loading a GeoTIFF and producing a correctly-aligned, numerically stable feature matrix is wide enough that most production geospatial ML failures trace back to it.

This page covers the full technical scope of that gap: from CRS alignment and projection handling through raster band math and spectral index calculation, vector proximity and buffer generation, spatial lag and neighborhood statistics, feature scaling for geospatial inputs, and temporal aggregation for time-series geodata. For each area, it explains why the problem is non-trivial in production, demonstrates the canonical Python patterns, and flags the failure modes that appear only after a pipeline has been running for weeks.

The intended audience is practitioners building repeatable pipelines — not proof-of-concept notebooks. Every technique below has been chosen because it transfers directly to production inference workflows, not just offline experimentation.

Foundational Prerequisites

Every technique on this page relies on a consistent, version-pinned Python environment. Mismatched GDAL and PyProj versions alone account for a disproportionate share of hard-to-reproduce CRS bugs. Pin the following before anything else:

# requirements.txt — tested stack for spatial feature engineering
geopandas==1.0.1
rasterio==1.4.3
pyproj==3.7.1
shapely==2.0.6
scikit-learn==1.6.1
xarray==2025.3.1
numpy==2.2.4
pandas==2.2.3
scipy==1.15.2
libpysal==4.12.1
dask[distributed]==2025.4.1
zarr==3.0.8

System dependencies must also be pinned. On Ubuntu/Debian:

sudo apt-get install -y \
    gdal-bin=3.8.* \
    libgdal-dev=3.8.* \
    libproj-dev=9.4.*

Verify the install:

import rasterio, geopandas, pyproj, xarray
print(rasterio.__version__)   # 1.4.3
print(geopandas.__version__)  # 1.0.1
print(pyproj.__version__)     # 3.7.1
print(xarray.__version__)     # 2025.3.1

A version mismatch between pyproj and the system PROJ library surfaces as CRSError: Input is not a CRS — catch it here rather than three pipeline stages downstream.

Core Techniques

CRS Alignment and Projection Handling

Every spatial join, distance calculation, and rasterisation operation assumes that all inputs share a common coordinate reference system. When they do not, the errors are rarely obvious: geometries appear to overlap when they do not, distance features measure degrees instead of metres, and spatial joins return empty results with no exception raised. CRS alignment and projection handling is therefore the first gate every ingestion pipeline must enforce.

The canonical pattern is to define a single target CRS at pipeline initialisation, validate every incoming dataset against it, and reproject immediately on load rather than deferring:

import geopandas as gpd
import rasterio
from pyproj import CRS

TARGET_CRS = CRS.from_epsg(32633)  # UTM Zone 33N — metric, local accuracy

def load_and_align_vector(path: str) -> gpd.GeoDataFrame:
    gdf = gpd.read_file(path)
    if gdf.crs is None:
        raise ValueError(f"No CRS in {path} — cannot safely reproject")
    if not gdf.crs.equals(TARGET_CRS):
        gdf = gdf.to_crs(TARGET_CRS)
    assert gdf.crs.equals(TARGET_CRS), "Reprojection failed"
    return gdf

For raster inputs, align the transform and grid extent at load time using rasterio.warp.reproject. Pipelines that skip this step and instead rely on coordinate-level numeric coincidence will produce silently wrong features that only reveal themselves when model predictions diverge geographically from ground truth.

See fixing projection mismatches in Pandas GeoDataFrames for a step-by-step walkthrough of the most common failure patterns.

Raster Band Math and Index Calculation

Raw satellite digital numbers carry little direct predictive signal. They must be converted to surface reflectance, atmospherically corrected, and then transformed into physically interpretable indices before entering a model. Raster band math and index calculation covers this entire chain: reading multi-band rasters with rasterio, performing vectorised numpy band operations, and deriving spectral indices that compress multi-spectral information into single-channel predictors.

The most commonly used indices in geospatial ML are:

NDVI (Normalised Difference Vegetation Index): (NIR - Red) / (NIR + Red) — crop yield, biomass, land cover classification
EVI (Enhanced Vegetation Index): corrects for atmospheric and canopy background effects in dense vegetation
NDWI (Normalised Difference Water Index): open water detection, flood risk
NDBI (Normalised Difference Built-Up Index): urban extent mapping

import numpy as np
import rasterio

def compute_ndvi(red: np.ndarray, nir: np.ndarray) -> np.ndarray:
    """Compute NDVI; returns float32 array in [-1, 1]."""
    red = red.astype(np.float32)
    nir = nir.astype(np.float32)
    denom = nir + red
    ndvi = np.where(denom == 0, np.nan, (nir - red) / denom)
    return ndvi.astype(np.float32)

with rasterio.open("sentinel2_scene.tif") as src:
    red = src.read(4)   # Band 4 — Red (Sentinel-2)
    nir  = src.read(8)  # Band 8 — NIR
    ndvi = compute_ndvi(red, nir)

Beyond spectral indices, texture features from gray-level co-occurrence matrices (GLCM) — contrast, correlation, energy — capture spatial heterogeneity that single-pixel indices miss. For large-scale processing, use rasterio’s windowed reading to keep memory flat regardless of scene size. See how to calculate NDVI and EVI with rasterio for the full implementation.

Vector Proximity and Buffer Generation

Vector layers encode discrete spatial entities — parcels, road segments, administrative zones, point-of-interest records — whose predictive value lies in their spatial relationships to observations rather than their attribute values alone. Vector proximity and buffer generation operationalises these relationships as numeric features: nearest-neighbour distances, multi-ring buffer aggregations, and catchment-weighted attribute sums.

import geopandas as gpd
import numpy as np

def nearest_neighbour_distance(
    targets: gpd.GeoDataFrame,
    sources: gpd.GeoDataFrame
) -> np.ndarray:
    """
    Return the distance from each target geometry to its nearest source geometry.
    Both GeoDataFrames must share the same projected CRS.
    """
    assert targets.crs == sources.crs, "CRS mismatch before distance calc"
    # geopandas 1.x sjoin_nearest returns distance column
    joined = gpd.sjoin_nearest(targets, sources, distance_col="_dist")
    return joined["_dist"].values

Topological validity is a hard prerequisite: shapely’s is_valid predicate must return True for every geometry before it enters a spatial join or buffer operation. Invalid geometries — self-intersections, unclosed rings, sliver polygons — cause silent errors or return geometrically nonsensical results. Apply buffer(0) as a snapping heuristic for minor invalidity, and flag anything that remains invalid for manual inspection rather than silently dropping it.

For distance matrices across large observation sets, see creating distance matrices for spatial features for R-tree-accelerated implementations that scale to millions of geometries.

Spatial Lag and Neighborhood Statistics

Geographic phenomena rarely distribute independently. Tobler’s First Law — nearby things are more related than distant things — means that a model trained without spatial context features will systematically under-fit local patterns. Spatial lag and neighborhood statistics operationalise this principle as engineering steps: computing spatially lagged values, local Moran’s I, Getis-Ord Gi* hotspot scores, and moving-window aggregations that feed directly into feature matrices.

import libpysal
from esda import Moran_Local
import geopandas as gpd

def local_morans_i_features(
    gdf: gpd.GeoDataFrame,
    column: str,
    k: int = 8
) -> gpd.GeoDataFrame:
    """Append Local Moran's I and cluster labels to the GeoDataFrame."""
    w = libpysal.weights.KNN.from_dataframe(gdf, k=k)
    w.transform = "r"  # row-standardise
    lm = Moran_Local(gdf[column].values, w)
    gdf = gdf.copy()
    gdf["local_moran_i"] = lm.Is
    gdf["local_moran_q"]  = lm.q   # 1=HH, 2=LH, 3=LL, 4=HL
    return gdf

Spatial weights matrix construction requires deliberate choices: queen/rook contiguity for regular grids, k-nearest neighbour for point clouds, or inverse-distance weighting for continuous surfaces. Row-standardisation (w.transform = "r") is essential for comparability across neighbourhood sizes. For a full walkthrough, see computing local Moran’s I for feature engineering.

Critically, spatial autocorrelation in the feature matrix requires spatial cross-validation strategies at evaluation time — random train/test splits will leak autocorrelated signal and produce inflated performance metrics.

Feature Scaling for Geospatial Inputs

Geospatial feature matrices are almost always heteroscedastic: elevation spans thousands of metres, spectral indices are bounded in [-1, 1], and population density can range from near-zero to tens of thousands per square kilometre. Without explicit normalisation, gradient-based optimisers converge slowly, distance-weighted algorithms are dominated by high-magnitude variables, and tree models that otherwise tolerate scale differences still suffer from poor split thresholds on skewed distributions.

Feature scaling for geospatial inputs covers the full pipeline: choosing between StandardScaler, RobustScaler, and QuantileTransformer based on the distribution of each feature type, fitting scalers exclusively on training folds to prevent leakage, and serialising scaler state alongside model artefacts for identical inference-time transforms.

from sklearn.preprocessing import RobustScaler
from sklearn.pipeline import Pipeline
import numpy as np

# RobustScaler is preferred for features with heavy-tailed distributions
# (elevation, proximity distances, population density)
scaler = RobustScaler(quantile_range=(5, 95))

# Fit only on training data — NEVER on the full dataset
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_val_scaled   = scaler.transform(X_val)   # apply fitted params to val

Serialise the fitted scaler with joblib.dump alongside the model weight file. Pipelines that refit scalers at inference time on individual tiles produce features that differ systematically from training-time distributions, causing degraded predictions on edge tiles or scenes with unusual coverage.

See optimizing memory usage for large vector datasets for scaling workflows that process datasets too large to fit in RAM.

Temporal Aggregation for Time-Series Geodata

Geospatial data is inherently dynamic. Seasonal cycles, diurnal patterns, and event-driven changes require temporal feature engineering to capture evolution rather than static snapshots. A single Sentinel-2 image of a crop field carries far less predictive signal than a composited time series that encodes phenological curves, growing-season trajectories, and anomaly departures from multi-year baselines.

Temporal aggregation for time-series geodata covers monthly compositing, rolling-window statistics, harmonic regression for seasonal decomposition, and the timestamp-aware train/test splits that prevent temporal leakage.

import xarray as xr
import numpy as np

def monthly_median_composite(
    da: xr.DataArray,
    time_dim: str = "time"
) -> xr.DataArray:
    """
    Collapse a daily or multi-date DataArray to monthly medians.
    Preserves spatial dimensions; handles NaN (cloud mask) via skipna.
    """
    return da.resample({time_dim: "1ME"}).median(skipna=True)

# Usage: da has dims (time, y, x)
monthly = monthly_median_composite(da)
assert "time" in monthly.dims
assert monthly.sizes["time"] == 12  # one composite per month for one year

Temporal alignment is the hidden complexity: irregular satellite revisit intervals, sensor drift between missions, and variable cloud cover all introduce gaps that require explicit imputation strategies. See aggregating daily satellite data to monthly features for a complete pipeline including gap-filling and quality flag propagation.

MLOps and Pipeline Integration

Feature engineering steps only deliver production value when they are automated, versioned, and reproducible at inference time. The engineering patterns above — CRS alignment, index calculation, spatial lag, temporal compositing — must be encapsulated as deterministic pipeline stages with explicit contracts between them.

Orchestration

Orchestration frameworks — Apache Airflow, Prefect, or Dagster — manage DAG execution across rasterisation, vector aggregation, and feature concatenation. Each stage should be idempotent: given the same inputs and parameters, it must produce bit-identical outputs. This requires pinning spatial index versions, fixing numpy random seeds for any stochastic steps, and enforcing deterministic sort order on spatial joins (which vary by execution environment without explicit sort_keys parameters).

Data Versioning

Spatial datasets are large and mutable. Satellite archives are updated retroactively as re-processing algorithms improve; administrative boundaries are reissued annually; elevation models are corrected as survey data accumulates. Data versioning tools such as DVC or LakeFS track raw inputs and intermediate feature stores alongside code, ensuring that model retraining uses snapshots that are reproducible months later.

The most critical versioning point is the serialised feature transformer state: the fitted StandardScaler or spatial weights matrix W that was active when the model was trained must be stored and loaded at inference time. A model retrained with new data but applied with old scaler parameters will produce systematically wrong features without raising an exception.

Containerised Inference

Inference services must bundle the full feature engineering stack — GDAL, PROJ, PyProj, rasterio, geopandas, and all fitted transformer artefacts — in a single container image. Spatial tiling handles large-area prediction requests: incoming bounding boxes are split into overlapping tiles (typically 512 × 512 pixels with a 64-pixel overlap), each tile is processed through the feature pipeline, predictions are generated, and outputs are merged using rasterio.merge with a blending function that eliminates the tile boundary artefacts that otherwise appear in seamless mosaics.

Performance and Scalability

Chunked I/O

Never load an entire raster scene into memory. Use rasterio’s windowed reading to process one block at a time, or xarray with Dask-backed chunks:

import dask.array as da
import xarray as xr

# Open with Dask chunking — lazy, memory-flat
ds = xr.open_dataset(
    "large_scene.nc",
    chunks={"time": 1, "y": 512, "x": 512},
    engine="netcdf4"
)
# Computation is deferred until .compute() or .load()
ndvi = (ds["nir"] - ds["red"]) / (ds["nir"] + ds["red"])

Chunk size selection matters: too small and task-graph overhead dominates; too large and worker memory limits are breached. For Sentinel-2 at 10 m resolution, {"y": 512, "x": 512} typically balances well across a 32 GB worker.

Cloud-Optimised Formats

Use Cloud-Optimised GeoTIFF (COG) for raster data and GeoParquet for vector data in production pipelines. Both formats support range requests, enabling lazy loading of spatial subsets without full-file downloads. Zarr v3 is the preferred format for large xarray time-series stacks: it supports parallel read/write across Dask workers and compresses efficiently with Blosc or Zstd.

Avoid writing or reading Shapefile in any production code path. Shapefile splits geometry, attributes, and index across three or more files, lacks native support for UTF-8 strings, and truncates column names at 10 characters — all of which produce silent data loss.

Memory Budgets

A working rule for spatial feature pipelines: allocate 3× the raw raster tile size as working memory per worker. A 512 × 512 × 12-band float32 tile is ~12 MB raw; allow ~36 MB per worker thread for intermediate arrays. For vector pipelines, monitor GeoDataFrame memory with gdf.memory_usage(deep=True) and cast geometry to float32 coordinates where sub-metre precision is not required to reduce footprint by 40–50%.

Common Failure Modes

1. Silent CRS mismatch on spatial join. A gpd.sjoin() called on two GeoDataFrames with different CRS values — one geographic (EPSG:4326), one projected (EPSG:32633) — does not raise an error in geopandas 0.x. It returns an empty or near-empty result, or matches on numerically close but geographically nonsensical coordinate pairs. Fix: assert left.crs == right.crs before every join.

2. Geometry invalidity silently propagating. shapely.is_valid returning False causes gdf.buffer(radius) and gdf.intersection(other) to produce empty or degenerate geometries downstream. Fix: validate at ingestion with gdf[~gdf.is_valid] and apply gdf.geometry = gdf.geometry.buffer(0) as a snapping heuristic. Log anything that remains invalid rather than dropping it silently.

3. Scaler refit at inference time. Refitting a StandardScaler on each inference tile produces tile-specific mean and variance estimates that differ from training-time parameters. Model predictions shift systematically between tiles, appearing as seams in continuous raster outputs. Fix: serialise the fitted scaler with joblib.dump at training time and load it with joblib.load at inference time.

4. Temporal leakage from random train/test splits. A random 80/20 split across time-series observations includes future observations in the training fold. Lag features and rolling statistics computed from the full sequence then contain future-derived signal that the model exploits at training time but cannot access at deployment. Fix: use strictly temporal splits — all training observations must precede all validation observations in calendar time.

5. Spatial autocorrelation leakage. Random spatial splits allow nearby observations to appear in both train and test sets. Because spatially autocorrelated features are nearly identical for close neighbours, the model memorises local spatial patterns rather than learning generalisable relationships. Fix: use block spatial cross-validation with minimum block sizes tuned to the spatial autocorrelation range of the target variable.

6. Band index offset errors. rasterio.read(1) returns band 1 (one-indexed). NumPy array indexing is zero-indexed. Confusing the two causes NDVI to be computed from the wrong bands — the error is numerically plausible (values in [-1, 1]) and will not trigger an assertion unless you validate the expected spectral range of each band. Fix: define named constants for band indices and assert reflectance ranges after loading.

Best Practices Checklist

Pin all spatial library versions (geopandas, rasterio, pyproj, shapely, GDAL system lib) in requirements.txt and lock with pip-compile or conda env export
Enforce a single target CRS at pipeline initialisation; reproject every input immediately on load, not lazily during joins
Validate geometry validity at ingestion; log and quarantine invalid geometries rather than silently dropping or patching them
Fit all feature scalers exclusively on training folds; serialise fitted state with joblib.dump and load at inference time
Use temporal splits (not random splits) for time-series features; use block spatial cross-validation for spatially autocorrelated targets
Process rasters in chunks via rasterio windowed reads or xarray + Dask; never load a full scene into memory
Store intermediate feature stores in GeoParquet (vector) or COG/Zarr (raster); avoid Shapefile in any production code path
Define named constants for raster band indices and assert reflectance ranges after loading each band
Version raw input snapshots with DVC or LakeFS alongside model artefacts and transformer state files
Monitor inference feature distributions against training baselines with Kolmogorov-Smirnov or Wasserstein distance; trigger retraining when drift thresholds are exceeded

FAQ

Why do spatial joins return empty results even when geometries visually overlap?

The most common cause is a CRS mismatch. When one GeoDataFrame is in EPSG:4326 (degrees) and the other in a projected CRS (metres), coordinate values are numerically incommensurable — no geometries appear to intersect. Older geopandas versions do not raise an error in this case; they silently return an empty join result. Always assert left.crs == right.crs before calling sjoin(). A secondary cause is geometry invalidity: self-intersecting polygons do not overlap in the topological sense even when they visually appear to.

Should I use StandardScaler or RobustScaler for geospatial features?

Use RobustScaler for features derived from physical measurements with heavy-tailed distributions: elevation, slope, proximity distances, and population density all have outliers that StandardScaler propagates into the scaled feature space, pulling the mean and variance away from the bulk of the data. Use StandardScaler for bounded spectral indices (NDVI, EVI, NDWI) that are already approximately normally distributed after cloud masking. Use QuantileTransformer for features with strongly multi-modal distributions before applying dimensionality reduction.

How do I prevent spatial leakage in cross-validation?

Replace sklearn.model_selection.KFold with a block-based spatial split: assign each observation to a spatial block (by H3 index, regular grid, or clustering), then hold out entire blocks as validation folds. The block size must exceed the spatial autocorrelation range of the target variable — if crop yield is autocorrelated within a 5 km radius, blocks smaller than 5 km will still leak signal between train and validation sets. libpysal and scikit-learn’s GroupKFold can implement this pattern with block assignments as group labels.

Why do my raster predictions show visible seams at tile boundaries?

Seams appear when tiles are processed independently and features at tile edges are computed from incomplete neighbourhoods — a 3 × 3 moving window centred on an edge pixel has missing neighbours outside the tile. Fix: use overlapping tiles with a buffer equal to the maximum neighbourhood radius of any windowed feature. After prediction, discard the buffer zone and merge only the central regions. rasterio.merge with method="first" or a distance-weighted blending function eliminates remaining boundary artefacts.

Can I use Dask with geopandas for large vector datasets?

Yes, via the dask-geopandas library, which wraps a GeoDataFrame in Dask partitions. Spatial operations like sjoin, buffer, and dissolve are parallelised across partitions. However, operations that require global spatial context — R-tree index construction, spatial weights matrix computation, spatial joins across partition boundaries — still require collecting results to a single process. Partition your data by spatial tiles (e.g., UTM grid squares) to minimise cross-partition communication during joins.