Feature Scaling for Geospatial Inputs

Scale geospatial features for ML: normalize raster bands, handle mixed spatial units, apply heterogeneous ColumnTransformer pipelines to GeoDataFrames, and prevent inference drift.

Feature scaling is a foundational step in every machine learning pipeline, but geospatial inputs introduce constraints that standard tabular workflows rarely encounter. Raster scenes deliver spectral reflectance values, elevation integers, and thermal readings in the same dataset. Vector layers add proximity distances measured in kilometres, land-cover codes, and population densities spanning orders of magnitude. Without a deliberate, spatially aware normalization strategy, gradient-based models converge slowly, distance-based algorithms are dominated by high-magnitude features, and production inference pipelines accumulate silent distributional drift as seasons change or sensors degrade.

This page is part of Spatial Feature Engineering for Machine Learning. It covers the complete production workflow for scaling heterogeneous geospatial features: CRS validation, spatial missing-value handling, scaler selection by feature type, ColumnTransformer pipeline construction, serialization, and spatial drift monitoring.


Geospatial Feature Scaling Pipeline Five sequential stages: raw spatial inputs, CRS validation and extraction, missing-value handling, heterogeneous scaling via ColumnTransformer, and serialized preprocessor output. Raw Spatial Inputs raster · vector · indices CRS Validate & Extract project → 2-D matrix Mask & Impute sentinel → NaN SimpleImputer ColumnTransformer Standard · MinMax Robust · Quantile Serialized Preprocessor joblib · versioned

Prerequisites and Environment Setup

pip install "scikit-learn>=1.3" "geopandas>=0.14" "rasterio>=1.3" \
            "xarray>=2023.12" "joblib>=1.3" "numpy>=1.24" "pyproj>=3.6"

System dependencies (GDAL and PROJ must be present before the Python packages):

# Ubuntu / Debian
sudo apt-get install -y gdal-bin libgdal-dev libproj-dev

# macOS (Homebrew)
brew install gdal proj

Baseline data requirements before scaling begins:

  • CRS consistency: All inputs must share a projected CRS with consistent linear units — e.g., EPSG:32633 (UTM Zone 33N). CRS Alignment and Projection Handling covers the full reprojection workflow; mismatched CRS is the single most common root cause of subtly wrong scaled features.
  • Completed feature engineering: Scaling should follow spatial transformations. Run Raster Band Math and Index Calculation before normalizing band ratios, and generate proximity buffers via Vector Proximity and Buffer Generation before scaling distance columns.
  • Defined imputation policy: Rasters contain cloud masks, sensor dropouts, and edge padding. Vector attributes may have None entries. Decide on median imputation, spatial interpolation, or row-dropping before touching a scaler.

Step-by-Step Implementation

Step 1 — Ingest and Validate CRS

Geospatial scaling fails silently when spatial units are inconsistent. A 10-metre pixel value and a 10-degree lat/lon arc differ by orders of magnitude in real-world extent; scaling them on the same axis corrupts every distance-sensitive model. Always verify CRS before extracting arrays.

import rasterio
import geopandas as gpd

# Validate raster CRS
with rasterio.open("multispectral.tif") as src:
    crs = src.crs
    assert crs.is_projected, (
        f"Raster CRS {crs} is geographic (degrees). "
        "Reproject to a linear CRS before scaling."
    )
    print(f"Raster CRS: {crs}, linear units: {crs.linear_units}")

# Validate vector CRS and reproject if needed
gdf = gpd.read_file("parcels.gpkg")
TARGET_CRS = "EPSG:32633"
if gdf.crs.to_epsg() != 32633:
    gdf = gdf.to_crs(TARGET_CRS)
    print(f"Reprojected vector layer to {TARGET_CRS}")

assert gdf.crs.to_epsg() == 32633, "Vector CRS mismatch after reprojection."

If mismatches exist, use rasterio.warp.reproject for raster data and gdf.to_crs() for vector data. Never attempt to scale features whose underlying CRS differs — the fix belongs upstream, not in the scaler.

Step 2 — Extract Feature Arrays

Convert spatial objects to numerical matrices. Stack raster bands into a 2-D numpy.ndarray with shape (n_pixels, n_bands), and extract vector attributes into a pandas.DataFrame. For regional-scale datasets that cannot fit in RAM, see Optimizing Memory Usage for Large Vector Datasets for chunk-based extraction strategies that avoid MemoryError.

import numpy as np
import rasterio
import pandas as pd
import geopandas as gpd

# --- Raster extraction ---
with rasterio.open("multispectral.tif") as src:
    raster_arr = src.read()           # shape: (bands, height, width)
    nodata = src.nodata               # e.g. -9999.0 or None

# Move band axis to last position, then flatten to 2-D
raster_arr = np.moveaxis(raster_arr, 0, -1)   # (height, width, bands)
n_bands = raster_arr.shape[-1]
feature_matrix = raster_arr.reshape(-1, n_bands)  # (n_pixels, bands)

# Replace nodata sentinel with NaN before any further processing
if nodata is not None:
    feature_matrix = np.where(feature_matrix == nodata, np.nan, feature_matrix)

# --- Vector extraction ---
gdf = gpd.read_file("parcels.gpkg")
vector_df = gdf[["dist_to_road_km", "dist_to_water_km", "ndvi", "evi",
                  "land_cover_code"]].copy()

print(f"Raster matrix shape: {feature_matrix.shape}")
print(f"Vector DataFrame shape: {vector_df.shape}")

Step 3 — Handle Spatial Missing Values

Scikit-learn scalers raise ValueError when NaN values are present. Raster tiles routinely contain cloud cover, sensor dropouts, and edge padding, while vector datasets may carry incomplete attribute records from spatial joins. Apply masking and imputation before any scaler is fitted.

from sklearn.impute import SimpleImputer

# --- Raster: assert no unexpected NaN proportion ---
nan_fraction = np.isnan(feature_matrix).mean()
print(f"NaN fraction in raster matrix: {nan_fraction:.3%}")
if nan_fraction > 0.30:
    raise ValueError(
        f"NaN fraction {nan_fraction:.1%} exceeds 30% threshold. "
        "Check cloud mask or nodata sentinel configuration."
    )

# For moderate NaN: use numpy masked arrays for spatial operations,
# then impute with band medians before passing to sklearn
imputer_raster = SimpleImputer(strategy="median")
feature_matrix_clean = imputer_raster.fit_transform(feature_matrix)

# Confirm no NaN remains
assert not np.isnan(feature_matrix_clean).any(), \
    "NaN values remain after imputation — check imputer configuration."

# --- Vector: impute numeric columns; leave categoricals separate ---
numeric_cols = ["dist_to_road_km", "dist_to_water_km", "ndvi", "evi"]
imputer_vector = SimpleImputer(strategy="median")
vector_df[numeric_cols] = imputer_vector.fit_transform(vector_df[numeric_cols])

print("Missing value handling complete — all numeric columns NaN-free.")

Key principles:

  • Replace sentinel values (-9999, 32767, etc.) with np.nan immediately on ingestion — never let sentinels propagate into a scaler’s statistics.
  • Use numpy.ma.masked_invalid() for spatial operations that must respect the mask (e.g., per-band statistics before imputation).
  • Forward-fill or spatial interpolation is appropriate for time-series raster stacks; median imputation suits cross-sectional feature matrices.
  • Row-dropping is acceptable only when missingness is confirmed random and affects fewer than 5% of samples.

Step 4 — Select Scaler by Feature Type

Geospatial features follow distinct statistical distributions. Applying the same scaler to every column produces systematically biased inputs.

Feature type Typical distribution Recommended scaler
Spectral bands (raw DN or reflectance) Near-normal per band StandardScaler
Elevation / DEM Often right-skewed StandardScaler or QuantileTransformer
Normalized indices (NDVI, EVI, NDWI) Bounded [-1, 1] or [0, 1] MinMaxScaler
Proximity distances (road, water, urban edge) Heavy right-skewed, outliers RobustScaler
Density features (population, point counts) Highly skewed, zero-inflated QuantileTransformer
Categorical codes (land cover, soil type) Nominal OneHotEncoder — never scale

StandardScaler centres features to zero mean and unit variance, making it the correct default for spectral data whose physical values are roughly Gaussian within a homogeneous land-cover class. MinMaxScaler preserves the semantic bounds of normalized indices computed via Raster Band Math and Index Calculation. RobustScaler uses median and interquartile range, so it is resistant to the extreme values found in urban heat-island pixels, flood inundation extents, or anomalous soil moisture readings. QuantileTransformer maps arbitrary distributions to uniform or Gaussian output, which is especially useful for the skewed distance features generated by Vector Proximity and Buffer Generation. Spatial lag features from Spatial Lag and Neighborhood Statistics often benefit from RobustScaler because local spatial averages inherit the outlier structure of their source variable.

Step 5 — Build a Heterogeneous ColumnTransformer Pipeline

Never scale the entire feature matrix with a single transformer. Geospatial pipelines require different scalers applied to specific column subsets, wrapped in a Pipeline that guarantees identical preprocessing at training and inference time.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import (
    StandardScaler, MinMaxScaler, RobustScaler,
    QuantileTransformer, OneHotEncoder
)
from sklearn.impute import SimpleImputer

# Define feature groups by column name
spectral_bands    = ["band_1", "band_2", "band_3", "band_4"]
spectral_indices  = ["ndvi", "evi"]
proximity_cols    = ["dist_to_road_km", "dist_to_water_km"]
density_cols      = ["population_density", "building_count_500m"]
categorical_cols  = ["land_cover_code"]

preprocessor = ColumnTransformer(
    transformers=[
        ("spectral", Pipeline([
            ("impute", SimpleImputer(strategy="median")),
            ("scale",  StandardScaler()),
        ]), spectral_bands),

        ("indices", Pipeline([
            ("impute", SimpleImputer(strategy="median")),
            ("scale",  MinMaxScaler(feature_range=(0, 1))),
        ]), spectral_indices),

        ("proximity", Pipeline([
            ("impute", SimpleImputer(strategy="constant", fill_value=0)),
            ("scale",  RobustScaler()),
        ]), proximity_cols),

        ("density", Pipeline([
            ("impute", SimpleImputer(strategy="median")),
            ("scale",  QuantileTransformer(output_distribution="normal",
                                           random_state=42)),
        ]), density_cols),

        ("categorical", Pipeline([
            ("impute", SimpleImputer(strategy="most_frequent")),
            ("encode", OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
        ]), categorical_cols),
    ],
    remainder="drop",
    verbose_feature_names_out=True,
)

# Fit on training split only
X_scaled = preprocessor.fit_transform(X_train_df)

# Confirm shape and dtype
print(f"Scaled output shape: {X_scaled.shape}")
assert X_scaled.dtype == float, "Output matrix should be float64."

Wrap preprocessor in a full Pipeline alongside the estimator:

from sklearn.ensemble import GradientBoostingClassifier

full_pipeline = Pipeline([
    ("preprocess", preprocessor),
    ("model",      GradientBoostingClassifier(n_estimators=200, random_state=42)),
])

full_pipeline.fit(X_train_df, y_train)

Verification and Testing

After fitting, validate that scaled feature statistics match expectations before promoting to production.

import numpy as np

# Extract the fitted StandardScaler for spectral bands
spectral_scaler = preprocessor.named_transformers_["spectral"]["scale"]

# Mean should be ~0, std ~1 for training data
scaled_spectral = spectral_scaler.transform(
    X_train_df[spectral_bands].fillna(X_train_df[spectral_bands].median())
)
assert np.allclose(scaled_spectral.mean(axis=0), 0, atol=1e-6), \
    "StandardScaler means are not zero — check for data leakage."
assert np.allclose(scaled_spectral.std(axis=0),  1, atol=1e-6), \
    "StandardScaler std devs are not one — check for zero-variance columns."

# NDVI scaled output must remain in [0, 1]
index_scaler = preprocessor.named_transformers_["indices"]["scale"]
scaled_indices = index_scaler.transform(
    X_train_df[spectral_indices].fillna(X_train_df[spectral_indices].median())
)
assert scaled_indices.min() >= 0 and scaled_indices.max() <= 1, \
    "MinMaxScaler output outside [0, 1] — check for out-of-range NDVI values."

print("Scaling verification passed.")

Unit test pattern using pytest:

import pytest
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

def test_standard_scaler_zero_mean():
    rng = np.random.default_rng(0)
    X = pd.DataFrame(rng.normal(loc=500, scale=100, size=(1000, 4)),
                     columns=["b1", "b2", "b3", "b4"])
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    assert np.allclose(X_scaled.mean(axis=0), 0, atol=1e-6)
    assert np.allclose(X_scaled.std(axis=0),  1, atol=1e-6)

Troubleshooting and Common Errors

ValueError: Input X contains NaN

Root cause: A NaN value reached a scaler that does not handle missing data. Scalers operate after imputation in a Pipeline; if imputation was bypassed or a new NaN source was introduced (e.g., a log transform of zero-valued pixels), the scaler raises this error.

Fix: Add SimpleImputer as the first step in every sub-pipeline, and add an assertion assert not np.isnan(X).any() at the pipeline boundary.

ValueError: could not convert string to float

Root cause: A categorical or mixed-type column (e.g., "land_cover_code" containing strings) was routed to a numeric scaler via ColumnTransformer. This happens when remainder="passthrough" is used without explicitly listing all non-numeric columns.

Fix: Enumerate every categorical column explicitly and route them through OneHotEncoder or OrdinalEncoder. Use remainder="drop" to prevent unexpected columns from reaching numeric transformers.

Scaled NDVI values outside [0, 1] at inference

Root cause: The MinMaxScaler was fitted on a training set whose NDVI range did not include all values present at inference time (e.g., the training set lacked dense forest pixels, which appear at inference). The scaler extrapolates beyond its training bounds.

Fix: Fit the scaler on a comprehensive, stratified sample that covers the full expected feature range. Add a clip=True argument to MinMaxScaler (sklearn>=1.1) to hard-clamp inference outputs.

Silent distributional shift — model accuracy degrades over months

Root cause: Seasonal vegetation changes, sensor recalibration, or CRS misalignment in new data deliveries shift feature distributions. The serialized scaler was fitted on summer imagery; winter imagery arrives with markedly different band statistics.

Fix: Monitor Population Stability Index (PSI) per feature at each inference batch. Trigger a scaler refit when PSI exceeds 0.2. Never call fit_transform() during inference — only transform().

sklearn.exceptions.NotFittedError: This ColumnTransformer is not fitted yet

Root cause: The serialized preprocessor artifact was loaded but fit() was never called on it — or the deserialized object is a different class instance than expected (e.g., version mismatch).

Fix: Confirm the artifact was serialized after fit_transform() completes. Validate the deserialized object with sklearn.utils.validation.check_is_fitted(preprocessor) before calling transform().

CRSError: Input is not a CRS during per-tile inference

Root cause: Raster tiles passed at inference have no embedded CRS, or the tile’s CRS changed between dataset versions. CRS metadata is stripped by some cloud storage pipelines.

Fix: Enforce CRS validation at the API boundary before feature extraction. Log and reject tiles with absent or mismatched CRS rather than silently processing them.


Performance Optimisation

Chunked processing for large rasters: Rather than loading an entire scene into memory before reshaping, process tiles in a loop and apply only transform() (never fit_transform()) per tile:

import joblib
import numpy as np
import rasterio
from rasterio.windows import Window

preprocessor = joblib.load("scaler_v1.joblib")
TILE_SIZE = 512

with rasterio.open("large_scene.tif") as src:
    for row_off in range(0, src.height, TILE_SIZE):
        for col_off in range(0, src.width, TILE_SIZE):
            window = Window(col_off, row_off,
                            min(TILE_SIZE, src.width - col_off),
                            min(TILE_SIZE, src.height - row_off))
            tile = src.read(window=window)           # (bands, h, w)
            tile = np.moveaxis(tile, 0, -1)
            h, w, b = tile.shape
            flat = tile.reshape(-1, b).astype(float)
            flat[flat == src.nodata] = np.nan
            scaled_flat = preprocessor.transform(flat)
            # write or accumulate scaled_flat here

Parallel column transformation: ColumnTransformer accepts n_jobs=-1 to fit independent transformers in parallel — useful when fitting on a large training dataset with many feature groups:

preprocessor = ColumnTransformer(
    transformers=[...],
    remainder="drop",
    n_jobs=-1,   # parallel fit across transformer groups
)

Avoid redundant re-fitting: Serialise the preprocessor once with joblib.dump(preprocessor, "scaler_v1.joblib", compress=3) and load it for every downstream inference job. Never re-fit inside a containerised inference service — it resets the scaler’s statistics and silently corrupts predictions.

Memory-mapped arrays for out-of-core datasets: For datasets that exceed available RAM, use numpy.memmap or xarray backed by Zarr to stream feature arrays through the fitted scaler without materializing the full matrix. See Optimizing Memory Usage for Large Vector Datasets for the vector equivalent.


Production Deployment and MLOps Integration

Scaling parameters — means, standard deviations, percentile mappings — must be version-controlled alongside model weights. A model retrained on new data requires a new scaler artifact to match; mismatched scaler-model pairs are a leading cause of hard-to-diagnose accuracy regressions.

  1. Serialize immediately after training: joblib.dump(preprocessor, "scaler_v1.joblib"). Store the artifact in your model registry with the same version tag as the accompanying model file.
  2. Enforce schema validation: Validate incoming inference payloads against the column schema the scaler expects — column count, column names, and dtypes. Reject requests with unexpected column orders or missing bands before they reach the preprocessor.
  3. Containerize the preprocessing step: Package the scaler and schema validation logic in the same inference container as the model. Pin exact dependency versions — even patch-level scikit-learn upgrades can change floating-point rounding in transformers.
  4. Use transform() exclusively at inference: The fit_transform() method must never be called in production. Load the serialized artifact and call .transform() only.
  5. Spatial drift monitoring: Track PSI, mean, and variance per scaled feature at each inference batch. For geospatial data, Moran’s I computed on scaled residuals can reveal CRS drift or sensor recalibration before PSI metrics trigger — a sudden loss of spatial autocorrelation in an otherwise stable variable is a reliable early warning signal.
  6. Shadow deployments for scaler upgrades: When a scaler refit is warranted, route live traffic to a shadow instance running the new scaler. Compare prediction distributions and latency over 24–48 hours before promoting to primary.

FAQ

Why does StandardScaler produce different results when applied to individual raster tiles?

Each tile contains a different mix of land-cover types, so fitting the scaler independently per tile produces different means and standard deviations. Fit the scaler on the full training set (or a representative stratified sample spanning all expected cover types), serialize it, and apply .transform() to every tile at inference. Never fit per-tile.

Can I scale raw latitude and longitude coordinates as model features?

Scaling decimal degrees is mathematically valid but semantically misleading: one degree of longitude covers roughly 111 km at the equator but only 55 km at 60° N latitude. Apply CRS Alignment and Projection Handling first to convert coordinates to a linear CRS (e.g., UTM easting/northing in metres), then scale the projected values. Alternatively, derive distance-to-centroid or spatial lag features as geometry-agnostic proxies.

What PSI threshold should trigger a scaler refit for geospatial data?

A Population Stability Index above 0.2 on any scaled feature indicates a material distribution shift. For seasonal land-cover datasets, expect natural PSI fluctuation up to 0.1 between summer and winter imagery; set an alert at 0.2 and a hard refit trigger at 0.25. Consider computing PSI on the unscaled features as well — a shift in the raw data that the scaler’s fixed parameters can no longer compensate for is more informative than PSI on the scaled output alone.

Should I scale categorical land-cover codes?

No. Ordinal encoding followed by scaling imposes a false numeric ordering on nominal categories. Apply OneHotEncoder (tree-based models tolerate either) and exclude categorical columns from all numeric scalers using ColumnTransformer’s explicit column routing. Use remainder="drop" to prevent categorical columns from leaking into numeric transformers.


Part of: Spatial Feature Engineering for Machine Learning