Gradient Boosting for Raster Geospatial Data

Train gradient boosting models on raster data with XGBoost and LightGBM: spatial cross-validation, memory-safe raster sampling, chunked inference, and production MLOps integration.

Gradient boosting has become the standard for high-accuracy spatial prediction — land cover classification, environmental variable interpolation, flood-risk scoring — because tree ensembles deliver robust performance on structured geospatial features without requiring massive labelled datasets or GPU infrastructure. The real challenge is not the algorithm itself: it is the chain of spatial preprocessing, validation, and inference steps that must be correct before model accuracy means anything.

This page covers the full production workflow, from raw raster ingestion to drift-aware monitoring. It sits under the Training Geospatial Predictive Models in Python pillar and connects directly to the sibling topics you will need: spatially aware validation, autocorrelation handling, and hyperparameter search.


Prerequisites and Environment Setup

The pipeline requires a Python 3.9+ environment. Pin these versions to avoid breaking API changes:

rasterio==1.3.10
rioxarray==0.15.5
xarray==2024.6.0
geopandas==0.14.4
shapely==2.0.4
xgboost==2.0.3
lightgbm==4.3.0
scikit-learn==1.5.0
numpy==1.26.4
dask==2024.6.2

Install with:

pip install rasterio rioxarray xarray geopandas shapely xgboost lightgbm scikit-learn numpy dask

System-level dependencies: GDAL 3.6+, PROJ 9.2+, libgeos 3.11+. On Ubuntu/Debian:

sudo apt-get install libgdal-dev libproj-dev libgeos-dev

You also need:

  • A spatially indexed training dataset (points or polygons) with verified ground-truth labels
  • Co-registered raster layers sharing identical CRS, resolution, and temporal windows — CRS alignment and projection handling must be applied before any step below
  • Minimum 32 GB RAM for in-memory sampling; Dask chunking is mandatory for datasets exceeding 16 GB

Step-by-Step Implementation

Step 1: Raster Ingestion and Spatial Alignment

Raster misalignment is the most common source of silent model degradation. Before feature extraction, every input layer must share identical extent, resolution, CRS, and nodata convention. Misaligned pixels introduce spatial bias that the booster will learn as real signal, inflating training metrics while destroying field performance.

import rioxarray
import xarray as xr
from pathlib import Path
import numpy as np


def align_rasters(
    raster_paths: list[Path],
    target_crs: str = "EPSG:4326",
    res: float = 0.0001,
) -> xr.Dataset:
    """Load, reproject, and stack raster layers into a single xarray Dataset.

    The first path defines the spatial reference grid. All subsequent
    layers are reprojected to match it exactly via reproject_match.
    """
    ref = (
        rioxarray.open_rasterio(raster_paths[0], masked=True)
        .rio.write_crs(target_crs)
        .rio.reproject(target_crs, resolution=(res, res))
    )

    aligned = [ref]
    for path in raster_paths[1:]:
        ds = rioxarray.open_rasterio(path, masked=True).rio.write_crs(target_crs)
        # reproject_match handles both geometric transform and resampling
        aligned.append(ds.rio.reproject_match(ref))

    merged = xr.merge(aligned, compat="override")
    return merged

reproject_match handles geometric transformation and resampling in one call. Always inspect nodata propagation after alignment — mismatched masks corrupt downstream feature matrices silently.

For continuous bands (spectral reflectance, temperature) use bilinear resampling. For categorical layers (land-use class, soil type) use nearest-neighbour. Pass resampling=rasterio.enums.Resampling.nearest or .bilinear to reproject_match explicitly rather than relying on the default.

Step 2: Spatial Feature Extraction

Once aligned, raster values must be sampled at training locations. For point data, direct coordinate extraction suffices. For polygons, compute zonal statistics (mean, median, standard deviation) per polygon before assembling the feature matrix. The function below handles point sampling with strict nodata filtering:

import geopandas as gpd
import pandas as pd


def extract_training_features(
    aligned_ds: xr.Dataset,
    labels_gdf: gpd.GeoDataFrame,
    feature_names: list[str],
    label_col: str = "label",
) -> tuple[np.ndarray, np.ndarray]:
    """Sample raster values at labelled point locations.

    Returns (X, y) as float32 arrays with nodata rows removed.
    """
    labels_gdf = labels_gdf.to_crs(aligned_ds.rio.crs)

    xs = xr.DataArray(labels_gdf.geometry.x.values, dims="points")
    ys = xr.DataArray(labels_gdf.geometry.y.values, dims="points")
    sampled = aligned_ds.sel(x=xs, y=ys, method="nearest")

    df = sampled.to_dataframe().reset_index()
    df.columns = feature_names + [label_col]
    clean_df = df.dropna()

    X = clean_df[feature_names].values.astype("float32")
    y = clean_df[label_col].values
    return X, y

Memory efficiency is critical at scale. For millions of training points, replace xarray.sel() with rasterio.sample:

import rasterio

def extract_with_rasterio(path: Path, coords: list[tuple]) -> np.ndarray:
    """Streaming point extraction — avoids loading the full raster into memory."""
    with rasterio.open(path) as src:
        values = list(src.sample(coords))
    return np.array(values, dtype="float32")

Applying raster band math and index calculation (NDVI, EVI, NDWI) before extraction expands the feature space considerably without additional labelled data.

Step 3: Model Training and Hyperparameter Optimisation

With a clean (X, y) matrix, gradient boosting training proceeds through the scikit-learn API. The configuration below applies to XGBoost 2.x — note that use_label_encoder was removed in version 2.0 and must be omitted entirely:

from xgboost import XGBClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split


def train_booster(
    X: np.ndarray,
    y: np.ndarray,
    test_size: float = 0.2,
    n_estimators: int = 1000,
) -> XGBClassifier:
    """Train an XGBoost classifier with early stopping on a holdout split."""
    le = LabelEncoder()
    y_encoded = le.fit_transform(y)

    X_tr, X_val, y_tr, y_val = train_test_split(
        X, y_encoded, test_size=test_size, random_state=42
    )

    model = XGBClassifier(
        n_estimators=n_estimators,
        max_depth=6,
        learning_rate=0.05,
        subsample=0.8,
        colsample_bytree=0.8,
        eval_metric="mlogloss",
        early_stopping_rounds=20,
        tree_method="hist",   # faster on CPU; switch to "gpu_hist" for GPU
    )

    model.fit(X_tr, y_tr, eval_set=[(X_val, y_val)], verbose=False)
    model.label_encoder_ = le
    return model

For production-grade performance, systematic hyperparameter optimisation is non-negotiable. Random search or Bayesian optimisation consistently outperforms grid search on geospatial feature spaces — see Hyperparameter Tuning for XGBoost on Geodata for domain-specific search spaces and early-stopping configurations tailored to multi-band raster inputs.

Step 4: Spatial Validation and Autocorrelation Mitigation

Standard k-fold cross-validation assumes independent and identically distributed samples. Geospatial data violates this assumption: nearby pixels share similar environmental conditions, so a random split leaks spatial information into the validation fold and inflates reported accuracy.

Block cross-validation partitions the study area into non-overlapping geographic tiles. Training and validation folds are separated in space, not just randomly shuffled. This is the correct evaluation strategy for any raster-based model.

The diagram below illustrates the difference between a random split (left) and a block split (right) over a 6×6 pixel grid:

Random vs Block Cross-Validation Left panel: random train/validation split showing interleaved pixels. Right panel: block spatial split showing geographically separated train and validation zones. Random split Block spatial split spatial leakage Train Validate Train pixels Validation pixels

Spatial cross-validation strategies covers block CV implementation, spatial buffering, and leave-one-region-out patterns. When autocorrelation persists despite spatial partitioning, model it explicitly — see handling spatial autocorrelation for Moran’s I diagnostics and eigenvector filtering.

Production validation should report both standard metrics (F1, RMSE) and spatial metrics (Moran’s I on residuals, spatially stratified confusion matrices) to catch models that perform well only in densely sampled regions.

Step 5: Chunked Inference and Automated Raster Generation

Predicting pixel-by-pixel is computationally prohibitive; loading entire scenes into memory fails on standard infrastructure. The rasterio window API streams predictions and writes output tiles incrementally:

import rasterio
from rasterio.windows import Window


def predict_raster_chunked(
    model,
    raster_paths: list[Path],
    output_path: Path,
    nodata_value: int = -9999,
) -> None:
    """Chunked inference using rasterio block windows.

    Reads each input band in the same window, predicts, and writes
    compressed output tile-by-tile to avoid memory exhaustion.
    """
    with rasterio.open(raster_paths[0]) as ref:
        meta = ref.meta.copy()
        meta.update(count=1, dtype="int32", compress="lzw", nodata=nodata_value)

        with rasterio.open(output_path, "w", **meta) as dst:
            for _, window in ref.block_windows(1):
                bands = []
                for p in raster_paths:
                    with rasterio.open(p) as src:
                        bands.append(src.read(1, window=window).astype("float32"))

                stack = np.stack(bands, axis=0)       # (n_bands, h, w)
                n_bands, h, w = stack.shape
                flat = stack.reshape(n_bands, -1).T    # (n_pixels, n_bands)

                nodata_mask = np.any(np.isnan(flat), axis=1)
                preds = np.full(flat.shape[0], nodata_value, dtype="int32")

                if (~nodata_mask).any():
                    preds[~nodata_mask] = model.predict(flat[~nodata_mask])

                dst.write(preds.reshape(h, w), 1, window=window)

This scales linearly with disk I/O rather than RAM. For multi-GPU or distributed environments, wrap the chunking logic with dask.array and joblib to parallelise across tiles.

Step 6: Drift Detection and Continuous Monitoring

Geospatial models degrade due to sensor recalibration, seasonal phenology, land cover change, and atmospheric variability. Production pipelines must monitor both covariate drift (input feature distribution shifts) and concept drift (the relationship between features and labels changes independently of the inputs).

Implement lightweight statistical tests on rolling feature windows:

from scipy import stats
import numpy as np


def compute_ks_drift(
    reference: np.ndarray,
    current: np.ndarray,
    threshold: float = 0.05,
) -> dict[str, float | bool]:
    """Kolmogorov-Smirnov drift test for a single continuous feature band.

    Returns statistic, p-value, and a drift flag.
    """
    stat, p_value = stats.ks_2samp(reference, current)
    return {
        "ks_statistic": float(stat),
        "p_value": float(p_value),
        "drift_detected": p_value < threshold,
    }


def population_stability_index(
    expected: np.ndarray,
    actual: np.ndarray,
    bins: int = 10,
) -> float:
    """PSI for categorical or binned raster class distributions."""
    exp_pct = np.histogram(expected, bins=bins)[0] / len(expected)
    act_pct = np.histogram(actual, bins=bins)[0] / len(actual)
    # Avoid log(0)
    exp_pct = np.where(exp_pct == 0, 1e-6, exp_pct)
    act_pct = np.where(act_pct == 0, 1e-6, act_pct)
    return float(np.sum((act_pct - exp_pct) * np.log(act_pct / exp_pct)))

When drift thresholds are breached, trigger automated retraining pipelines that pull the latest labelled samples, re-run spatial validation, and register updated model versions. Integrate with MLflow for lineage tracking — every raster prediction should be reproducible and auditable.


Verification and Testing

After completing Steps 1–5, run these checks before committing predictions to production:

import rasterio
import numpy as np


def verify_output_raster(
    output_path: Path,
    input_path: Path,
    nodata_value: int = -9999,
) -> None:
    """Assert output raster is spatially consistent with the input."""
    with rasterio.open(input_path) as src_ref, rasterio.open(output_path) as src_out:
        assert src_out.crs == src_ref.crs, f"CRS mismatch: {src_out.crs} vs {src_ref.crs}"
        assert src_out.transform == src_ref.transform, "Transform mismatch"
        assert src_out.shape == src_ref.shape, f"Shape mismatch: {src_out.shape} vs {src_ref.shape}"

        data = src_out.read(1)
        valid_pixels = data[data != nodata_value]
        assert len(valid_pixels) > 0, "Output contains only nodata — check input alignment"

        unique_classes = np.unique(valid_pixels)
        print(f"Valid pixels: {len(valid_pixels):,}")
        print(f"Predicted classes: {unique_classes}")
        print(f"Nodata fraction: {(data == nodata_value).mean():.3f}")

Visual sanity check — plot the output alongside an input band in the same extent and verify spatial patterns are coherent with the input imagery. Misaligned outputs show up immediately as class boundaries that do not match visible land features.


Troubleshooting and Common Errors

rasterio.errors.CRSError: Input is not a CRS

Cause: The layer was opened without calling rio.write_crs() first, or the source file has a corrupt or missing projection file (.prj).
Fix: Inspect the file with rasterio.open(path).crs and call .rio.write_crs("EPSG:XXXX") with the known CRS before any spatial operation.

ValueError: operands could not be broadcast together with shapes (H1,W1) (H2,W2)

Cause: Two raster layers have different pixel grids despite appearing to cover the same area. Even a single-pixel offset breaks stacking.
Fix: Run align_rasters() using reproject_match against a single reference before extracting features. Never assume layers from different sources are already aligned.

XGBoostError: Invalid label ... use integer labels

Cause: String class labels passed directly to XGBClassifier.fit().
Fix: Encode labels with sklearn.preprocessing.LabelEncoder before calling fit(). Store the encoder alongside the model for inverse-transforming predictions.

Silent all-nodata output raster

Cause: The nodata mask covers all pixels, usually because the input bands have incompatible nodata conventions (NaN vs. -9999 vs. 0) that did not get harmonised during alignment.
Fix: Explicitly convert all nodata representations to np.nan immediately after opening each layer: ds.where(ds != src_nodata).

MemoryError during xarray.sel() point extraction

Cause: Loading the full aligned dataset into memory before sampling.
Fix: Switch to rasterio.sample() for point extraction or use dask-backed xarray arrays with lazy evaluation.

Accuracy score drops sharply in a specific geographic quadrant

Cause: Training data was spatially clustered, leaving entire landscape types underrepresented. Standard metrics averaged this away.
Fix: Compute spatially stratified confusion matrices per geographic fold. Use spatial cross-validation strategies with explicit holdout regions that cover all landscape types.


Performance Optimisation

Chunked I/O over block windows: The rasterio block window pattern in Step 5 is the correct baseline. For maximum throughput on NVMe storage, align chunk sizes to the raster’s internal tile structure (src.block_shapes).

Dask parallelism for large arrays: Convert the aligned xr.Dataset to a Dask array before extraction with ds.chunk({"x": 1024, "y": 1024}). This enables parallelised .sel() and apply_ufunc calls across CPU cores without loading the full dataset.

Feature scaling before boosting: While gradient boosting is invariant to monotonic feature transformations, explicitly scaling geospatial inputs (log-transform skewed spectral bands, standardise elevation) can accelerate convergence and reduce sensitivity to outliers in corrupted pixels.

Spatial indexing for training point lookups: If training labels are stored as a GeoDataFrame with millions of rows, build an STRtree index with geopandas.sindex before any spatial join. This reduces nearest-neighbour lookup from O(n) to O(log n).

Cloud-Optimised GeoTIFF for output: Write output predictions as COG (Cloud-Optimised GeoTIFF) using rasterio’s TILED=YES and COMPRESS=LZW creation options. This enables HTTP range-request access from object storage without downloading the full file.


FAQ

Why does standard k-fold cross-validation overestimate accuracy for raster-based models?

Spatial autocorrelation means nearby pixels share highly similar environmental conditions. When training and validation pixels sit next to each other, the model effectively “sees” its validation set indirectly through the training data. Block cross-validation separates folds geographically so evaluation happens on truly unseen spatial regions. Without this, reported accuracy can be 15–30 percentage points higher than actual field performance.

Should I use XGBoost or LightGBM for multi-band satellite imagery classification?

Both achieve comparable accuracy on tabular raster features. LightGBM trains faster on high-cardinality feature sets and uses less memory. XGBoost has broader ecosystem support and more mature monotonic-constraint handling. Run a short hyperparameter search on a held-out spatial fold to decide for your specific sensor stack — the difference in tuned performance is usually small.

How do I handle nodata pixels during chunked inference?

Create a boolean mask from np.isnan on the flattened feature array. Apply the model only to unmasked rows, then write a sentinel value (e.g., -9999) into masked positions in the output array. Always set nodata in the output raster metadata to match this sentinel so downstream tools interpret masked pixels correctly.

What causes silent accuracy degradation in production geospatial boosting models?

The four leading causes are: (1) covariate drift from sensor recalibration or atmospheric correction changes; (2) seasonal phenology shifts that move feature distributions outside the training range; (3) CRS or resolution mismatches introduced by a pipeline update; (4) nodata mask changes that alter which pixels reach the model. Track PSI and KS statistics on each input band after every inference run.


Part of: Training Geospatial Predictive Models in Python