Hyperparameter Tuning for XGBoost on Geodata

Q: How do I generate a spatial_block column for GroupKFold?

Create a regular grid over your study extent using geopandas with a shapely grid generator, then assign each training point to a grid cell via geopandas.sjoin. The resulting integer cell ID becomes the group label for GroupKFold. Alternatively, use administrative boundaries (watershed, county, ecoregion) if they provide sufficient geographic stratification for your domain.

Use GroupKFold with a spatial block column and Optuna’s Bayesian sampler to tune XGBoost on raster or vector geodata. Replace standard K-fold with geographically disjoint folds, constrain max_depth to 3–6 to prevent memorization of local terrain artifacts, and search learning_rate, subsample, and regularization parameters jointly. Validate tuned models against a completely disjoint holdout region before registering in your MLOps pipeline.

Part of: Gradient Boosting for Raster Data

Why Standard Tuning Fails on Spatial Data

Geospatial datasets fundamentally violate the independent and identically distributed (i.i.d.) assumption that underpins standard hyperparameter search. When nearby pixels or adjacent polygons share the same environmental gradients, soil types, or spectral reflectance values, their feature vectors become highly correlated. Random K-fold splits place spatially adjacent training and validation samples in the same fold, allowing XGBoost to learn local geographic patterns that happen to appear in the validation fold — not because the model is generalising, but because it has already seen those patterns from immediate neighbours.

The consequence is silent metric inflation. RMSE or AUC during tuning looks excellent; field performance collapses when the model encounters an unseen landscape. Worse, the tuned hyperparameters reinforce the leakage: small max_depth limits that would constrain local memorization get penalised because they underfit on the artificially easy validation folds. You end up with a model that is precisely tuned to the wrong objective.

The fix is a two-pronged approach: partition training data into non-overlapping spatial blocks before any search begins, and constrain the parameter search space to values that force the model toward broader spatial trends rather than local noise. As detailed in spatial cross-validation strategies, geographic partitioning is not optional — it is a prerequisite for meaningful validation on any spatially autocorrelated dataset.

Spatial Tuning Flow

The diagram below shows how a standard random-search pipeline (left) compares with a spatially blocked Optuna search (right). The critical difference is where spatial block assignment happens — it must precede the search loop, not be generated inside it.

Geospatial Parameter Constraints

Default XGBoost search ranges overfit to raster artifacts. The table below lists recommended bounds and the geospatial reasoning behind each constraint. These are not arbitrary — they reflect the structure of spatially autocorrelated raster and vector feature spaces.

Parameter	Recommended Range	Geospatial Rationale
`max_depth`	3–6	Limits tree complexity to avoid memorizing micro-topography, sensor noise, or polygon edge artifacts. Deeper trees learn local spatial fingerprints rather than regional gradients.
`learning_rate`	0.01–0.1	Slower convergence paired with `n_estimators` 300–1500 stabilizes learning across spatial gradients. Fast rates plus shallow trees tend to partition on spatially correlated noise.
`subsample` / `colsample_bytree`	0.6–0.9	Forces feature diversity across correlated spectral bands, DEM derivatives, and proximity metrics produced by vector proximity and buffer generation.
`min_child_weight`	3–10	Prevents splits driven by isolated sensor errors, cloud shadows, or single-pixel outliers. Spatially sparse label distributions need a higher minimum.
`reg_alpha` / `reg_lambda`	1e-3 – 1.0 (log scale)	L1/L2 regularization is critical when neighboring pixels share near-identical feature vectors — without it, the model weight matrix becomes degenerate.
`gamma`	0.0–5.0	Raises the minimum loss reduction required for a split, filtering noise-driven partitions from cloud contamination or atmospheric correction artifacts.

When your feature space includes spectral indices such as NDVI or EVI — generated by raster band math and index calculation — these indices are often highly correlated across adjacent pixels. Higher colsample_bytree dropout (0.5–0.7) may be warranted to prevent the booster from relying on a single dominant index.

Core Principles for Spatial Hyperparameter Search

Assign spatial blocks before any data splitting. The block column must be derived from the raw dataset geometry, not from the feature matrix. Generate it once, persist it, and use the same blocks for every Optuna trial.
Use GroupKFold, not StratifiedKFold or plain KFold. The groups argument in GroupKFold.split() guarantees that no block appears in both train and validation within any fold.
Constrain search bounds, do not widen them. Wider bounds for max_depth or n_estimators produce faster apparent convergence in leaky folds but degrade on disjoint geography. Narrow bounds force the search to find parameters that generalise.
Minimise MSE on spatially blocked folds, not on a random holdout. The Optuna objective function should return the cross-validated score from GroupKFold, not from train_test_split.
Use a disjoint geographic holdout for final evaluation only. Reserve a geographic region from the study area before tuning begins. Report final metrics on this region, not on any fold used during search.
Store block definitions as versioned artifacts. The grid or administrative boundaries used to generate blocks must be stored alongside model weights. Reproducing a spatial CV split requires the exact same boundaries.

Production-Ready Code

The function below implements a complete spatially blocked Optuna search. It expects X as a pd.DataFrame, y as a pd.Series, and groups as a 1-D NumPy array of integer block IDs — one entry per training sample, matching the row order of X.

import logging
from typing import Optional

import numpy as np
import optuna
import pandas as pd
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GroupKFold

optuna.logging.set_verbosity(optuna.logging.WARNING)
logger = logging.getLogger(__name__)


def _spatial_objective(
    trial: optuna.Trial,
    X: pd.DataFrame,
    y: pd.Series,
    groups: np.ndarray,
    n_splits: int,
) -> float:
    """Optuna objective: spatially blocked cross-validation for XGBoost regression.

    Returns mean MSE across geo-disjoint folds. Lower is better.
    """
    params = {
        "max_depth": trial.suggest_int("max_depth", 3, 6),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.1, log=True),
        "n_estimators": trial.suggest_int("n_estimators", 300, 1500),
        "subsample": trial.suggest_float("subsample", 0.6, 0.9),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.6, 0.9),
        "min_child_weight": trial.suggest_int("min_child_weight", 3, 10),
        "reg_alpha": trial.suggest_float("reg_alpha", 1e-3, 1.0, log=True),
        "reg_lambda": trial.suggest_float("reg_lambda", 1e-3, 1.0, log=True),
        "gamma": trial.suggest_float("gamma", 0.0, 5.0),
        "objective": "reg:squarederror",
        "tree_method": "hist",
        "verbosity": 0,
    }

    gkf = GroupKFold(n_splits=n_splits)
    fold_scores: list[float] = []

    for train_idx, val_idx in gkf.split(X, y, groups):
        X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
        y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

        model = xgb.XGBRegressor(**params)
        model.fit(X_train, y_train, verbose=False)
        preds = model.predict(X_val)
        fold_scores.append(mean_squared_error(y_val, preds))

    return float(np.mean(fold_scores))


def tune_xgboost_spatial(
    X: pd.DataFrame,
    y: pd.Series,
    groups: np.ndarray,
    n_trials: int = 60,
    n_splits: int = 5,
    random_state: Optional[int] = 42,
) -> dict:
    """Run spatially blocked Optuna search for XGBoost on geodata.

    Args:
        X: Feature DataFrame (no geometry column).
        y: Target Series, same index as X.
        groups: Integer spatial block IDs per row. Same length as X.
        n_trials: Number of Optuna trials. 50–100 is sufficient for this
                  search space given the bounded ranges.
        n_splits: Number of GroupKFold folds. Must not exceed the number
                  of unique block IDs in groups.
        random_state: Seed for reproducibility.

    Returns:
        Dictionary of best hyperparameters.

    Raises:
        ValueError: If groups contains fewer unique IDs than n_splits.
    """
    if len(X) != len(y) or len(X) != len(groups):
        raise ValueError("X, y, and groups must have the same length.")

    n_unique_blocks = len(np.unique(groups))
    if n_unique_blocks < n_splits:
        raise ValueError(
            f"GroupKFold requires at least {n_splits} unique spatial blocks; "
            f"got {n_unique_blocks}. Reduce n_splits or generate finer blocks."
        )

    sampler = optuna.samplers.TPESampler(seed=random_state)
    study = optuna.create_study(direction="minimize", sampler=sampler)

    study.optimize(
        lambda trial: _spatial_objective(trial, X, y, groups, n_splits),
        n_trials=n_trials,
        show_progress_bar=False,
    )

    best = study.best_params
    logger.info(
        "Optuna search complete. Best MSE=%.4f with params=%s",
        study.best_value,
        best,
    )
    return best

Step-by-Step Walkthrough

Step 1: Generate spatial block IDs from your GeoDataFrame

Before running any search, assign block IDs to your training points. The example below creates a 5 km regular grid over the study extent and joins training points to grid cells.

import geopandas as gpd
import numpy as np
from shapely.geometry import box

# Load training points (must already have verified CRS)
gdf = gpd.read_parquet("training_points.parquet")

# Build a 5 km grid over the bounding box (assumes projected CRS in metres)
minx, miny, maxx, maxy = gdf.total_bounds
cell_size = 5_000  # metres
cols = np.arange(minx, maxx, cell_size)
rows = np.arange(miny, maxy, cell_size)

cells = []
for i, x0 in enumerate(cols):
    for j, y0 in enumerate(rows):
        cells.append({"block_id": i * len(rows) + j,
                      "geometry": box(x0, y0, x0 + cell_size, y0 + cell_size)})

grid_gdf = gpd.GeoDataFrame(cells, crs=gdf.crs)

# Spatial join: assign each point to its grid cell
joined = gpd.sjoin(gdf, grid_gdf[["block_id", "geometry"]], how="left", predicate="within")
joined["block_id"] = joined["block_id"].fillna(-1).astype(int)

# Drop any points that fell outside the grid (should be zero if grid covers extent)
assert (joined["block_id"] == -1).sum() == 0, "Some points have no block assignment"

Applying CRS alignment and projection handling before this step is mandatory — grid distances in degrees produce non-uniform cell sizes and meaningless geographic groupings.

Step 2: Assemble the feature matrix and groups array

feature_cols = [c for c in joined.columns
                if c not in ("geometry", "target", "block_id")]

X = joined[feature_cols].reset_index(drop=True)
y = joined["target"].reset_index(drop=True)
groups = joined["block_id"].values

# Verify shapes
assert len(X) == len(y) == len(groups)
print(f"Training samples: {len(X):,}")
print(f"Unique spatial blocks: {len(np.unique(groups))}")

If your features include spatial lag and neighborhood statistics — for example, local Moran’s I features or kernel density estimates — include them in feature_cols. These features encode explicit neighborhood structure, which means the constraint on max_depth becomes more important, not less.

Step 3: Run the spatially blocked search

best_params = tune_xgboost_spatial(
    X=X,
    y=y,
    groups=groups,
    n_trials=60,
    n_splits=5,
    random_state=42,
)
print("Best params:", best_params)

A 60-trial search with 5-fold GroupKFold runs in approximately 10–25 minutes on a 16-core CPU for a dataset of 100,000 training points, depending on n_estimators range and feature count.

Step 4: Retrain on full training set with best parameters

final_model = xgb.XGBRegressor(
    **best_params,
    objective="reg:squarederror",
    tree_method="hist",
    verbosity=0,
)
final_model.fit(X, y)

Do not retrain on the disjoint holdout region. The holdout exists only for final evaluation.

Step 5: Evaluate on a geographically disjoint holdout

# Load holdout — a region excluded from training and search entirely
holdout = gpd.read_parquet("holdout_region.parquet")
X_holdout = holdout[feature_cols].reset_index(drop=True)
y_holdout = holdout["target"].reset_index(drop=True)

preds_holdout = final_model.predict(X_holdout)
mse_holdout = mean_squared_error(y_holdout, preds_holdout)
rmse_holdout = mse_holdout ** 0.5

print(f"Holdout RMSE: {rmse_holdout:.4f}")

If holdout RMSE is substantially higher than the best cross-validated MSE from Optuna, the model is still overfitting to spatial structure. Review block granularity — cells may be too large, allowing within-cell autocorrelation to persist across folds.

Verification

After tuning, confirm that the spatial blocking actually enforced geographic separation:

from sklearn.model_selection import GroupKFold
import numpy as np

gkf = GroupKFold(n_splits=5)
for fold_idx, (train_idx, val_idx) in enumerate(gkf.split(X, y, groups)):
    train_blocks = set(groups[train_idx])
    val_blocks = set(groups[val_idx])
    overlap = train_blocks & val_blocks
    assert len(overlap) == 0, (
        f"Fold {fold_idx}: blocks {overlap} appear in both train and validation. "
        "Check block assignment — points may be duplicated across blocks."
    )
    print(f"Fold {fold_idx}: train_blocks={len(train_blocks)}, val_blocks={len(val_blocks)}, overlap=0 ✓")

Expected output confirms zero block overlap across all five folds. Any overlap indicates a data-joining error in block assignment.

As a secondary check, inspect feature importance from the final model. If a small set of spectral bands or proximity features dominates importance, cross-reference with feature scaling for geospatial inputs — unscaled raster bands with large absolute value ranges can distort XGBoost’s split-gain calculations even though gradient boosting is theoretically scale-invariant.

FAQ

Why does standard random search overestimate XGBoost performance on raster data?

Standard random search uses random K-fold splits, which place spatially adjacent training and validation pixels in the same fold. Because nearby pixels share highly correlated spectral, elevation, or land-cover features, the model effectively “peeks” at validation data through its nearest training neighbours. Replacing random K-fold with spatial GroupKFold or block cross-validation forces evaluation on geographically disjoint regions, producing realistic performance estimates.

How do I generate a `spatial_block` column for `GroupKFold`?

Create a regular grid over your study extent using geopandas with a shapely grid generator, then assign each training point to a grid cell via geopandas.sjoin. The resulting integer cell ID becomes the group label for GroupKFold. Alternatively, use administrative boundaries (watershed, county, ecoregion) if they provide sufficient geographic stratification for your domain.

Should I tune `n_estimators` with Optuna or use early stopping?

Use early stopping inside the Optuna objective function rather than treating n_estimators as a search parameter. Pass eval_set to XGBRegressor.fit() and set early_stopping_rounds=30 or 50. Optuna then searches learning_rate, max_depth, and regularization parameters, while n_estimators is determined automatically per trial. This significantly reduces the search space and prevents wasting trials on poorly converged models.

Gradient Boosting for Raster Data — full production workflow covering raster ingestion, spatial feature extraction, chunked inference, and drift monitoring
Spatial Cross-Validation Strategies — block CV, leave-one-region-out, and buffered split patterns
Reducing Spatial Leakage in Model Training — diagnosing and eliminating leakage at the data preparation stage
Implementing SpatialKFold in Python — library-level implementation details for spatial fold generation

Part of: Gradient Boosting for Raster Data · Training Geospatial Predictive Models in Python