Dimensionality Reduction for Spatial Data

Reduce geospatial feature dimensions with PCA, UMAP, and autoencoders: compress raster bands, preserve spatial structure, and improve model generalization.

High-dimensional geospatial datasets routinely overwhelm predictive modeling pipelines. Multispectral satellite imagery, LiDAR point clouds, and enriched vector layers often contain dozens to hundreds of correlated bands or attributes. Without compression these feature spaces introduce severe computational bottlenecks, amplify overfitting, and degrade model generalization across geographic extents. Dimensionality reduction for spatial data addresses these challenges by projecting correlated spatial features into a compact, information-dense latent space while preserving the topological and spectral relationships required for downstream inference.

This topic sits within the broader Training Geospatial Predictive Models in Python workflow, where it acts as a critical preprocessing gate between raw feature extraction and model training. Done correctly, spatial compression standardizes heterogeneous inputs from sources such as raster band math and index calculation and feature scaling for geospatial inputs, accelerates gradient-based optimization, and enables robust deployment of raster and vector models at scale.

Why Geospatial Feature Spaces Require Specialized Compression

Traditional machine learning assumes independent and identically distributed observations, but spatial data inherently violates this assumption through spatial autocorrelation and scale-dependent heterogeneity. Multispectral sensors capture overlapping electromagnetic responses—red-edge and near-infrared bands are particularly collinear—while terrain derivatives such as slope, aspect, and curvature share mathematical dependencies. Applying naive feature selection or unmodified linear transforms often strips away subtle geographic gradients or artificially groups spatially proximate observations.

Effective spatial compression must balance three competing objectives: variance retention (capturing dominant spectral or structural signals), topological preservation (maintaining neighborhood relationships across coordinate space), and computational tractability (enabling out-of-core or distributed processing for continental-scale rasters). Ignoring these constraints produces models that look good on training folds but fail when deployed to unseen watersheds, administrative boundaries, or temporal windows.

Spatial Dimensionality Reduction Pipeline Flow diagram showing raw spatial features (raster bands, vector attributes) passing through standardization, then branching to PCA for linear compression or UMAP for non-linear topology preservation, converging into a compact latent space that feeds downstream model training. Raw Spatial Features Raster bands Terrain indices Vector attributes Standardize StandardScaler Mask + impute PCA Linear, invertible 95% var threshold UMAP Non-linear topology OOS transform Compact Latent Space → Model training

Prerequisites and Environment Setup

Before implementing spatial compression pipelines, confirm the following baseline requirements are in place:

  • Python 3.10+ with isolated virtual environments (conda or venv)
  • Core geospatial libraries: rasterio>=1.3, geopandas>=0.14, xarray>=2023.6, shapely>=2.0, pyproj>=3.6
  • ML stack: scikit-learn>=1.4, umap-learn>=0.5, joblib>=1.3, mlflow>=2.12
  • Optional: dask[array]>=2024.1 for out-of-core chunked transforms
  • System dependencies: GDAL 3.8+, PROJ 9+
pip install "rasterio>=1.3" "geopandas>=0.14" "xarray>=2023.6" \
    "scikit-learn>=1.4" "umap-learn>=0.5" "joblib>=1.3" \
    "mlflow>=2.12" "dask[array]>=2024.1"

Pin all versions in requirements.txt or environment.yml to guarantee reproducible transformer behavior across staging and production environments.

Step-by-Step Implementation

Step 1: Data Ingestion and Spatial Alignment

Load raster bands and vector attributes into a unified spatial grid using xarray or geopandas. Every layer must share identical CRS, resolution, and spatial extent before any mathematical operations occur. Misaligned grids introduce silent artifacts that distort covariance matrices. Apply CRS alignment and projection handling rigorously at this stage; a single mismatched band corrupts the entire factorization.

import rasterio
import numpy as np
from rasterio.warp import reproject, Resampling

def load_aligned_bands(band_paths: list[str], reference_path: str) -> np.ndarray:
    """
    Load and align multiple raster bands to a reference grid.
    Returns array of shape (n_pixels, n_bands) with nodata masked to NaN.
    """
    with rasterio.open(reference_path) as ref:
        ref_crs = ref.crs
        ref_transform = ref.transform
        ref_shape = (ref.height, ref.width)

    bands = []
    for path in band_paths:
        with rasterio.open(path) as src:
            data = np.empty(ref_shape, dtype=np.float32)
            reproject(
                source=rasterio.band(src, 1),
                destination=data,
                dst_transform=ref_transform,
                dst_crs=ref_crs,
                resampling=Resampling.bilinear,
            )
            nodata = src.nodata
        if nodata is not None:
            data[data == nodata] = np.nan
        bands.append(data.ravel())

    X = np.stack(bands, axis=1)   # shape: (n_pixels, n_bands)
    assert X.shape[1] == len(band_paths), "Band count mismatch after alignment"
    return X

Step 2: Masking, Null Handling, and Standardization

Apply valid-data masks (cloud-free pixels, water bodies, urban exclusion zones) early. Missing values must be imputed or excluded before scaling; NaN propagation silently corrupts matrix factorization. Spatial features frequently exhibit heterogeneous variance across bands, so apply z-score normalization per feature using StandardScaler. Unstandardized inputs disproportionately weight high-variance bands such as thermal infrared while suppressing subtle signals like narrow-band vegetation indices computed through raster band math.

from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
import numpy as np

def prepare_feature_matrix(X_raw: np.ndarray) -> tuple[np.ndarray, np.ndarray]:
    """
    Remove all-NaN rows, impute remaining NaNs with column median,
    and z-score each feature. Returns (X_scaled, valid_mask).
    """
    valid_mask = ~np.all(np.isnan(X_raw), axis=1)
    X_valid = X_raw[valid_mask]

    imputer = SimpleImputer(strategy="median")
    X_imputed = imputer.fit_transform(X_valid)

    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X_imputed)

    assert not np.isnan(X_scaled).any(), "NaN survived scaling — check imputation"
    return X_scaled, valid_mask

Step 3: Fit the Reduction Transform

Fit a linear or non-linear transformer to the standardized matrix. PCA remains the baseline for spectral compression due to its deterministic output and invertible transformation matrix. For non-linear spatial relationships, UMAP consistently outperforms t-SNE in geospatial contexts by preserving both local neighborhoods and global structure while supporting out-of-sample transforms needed for inference. When feeding reduced features into graph neural networks for spatial data, UMAP’s topological guarantees are particularly valuable.

from sklearn.decomposition import PCA
from umap import UMAP
import joblib

def fit_spatial_reducer(
    X_scaled: np.ndarray,
    method: str = "pca",
    n_components: int | float = 0.95,
    random_state: int = 42,
) -> object:
    """
    Fit a spatial dimensionality reducer. method must be 'pca' or 'umap'.
    For PCA, n_components can be a float (variance fraction) or int.
    """
    if method == "pca":
        reducer = PCA(n_components=n_components, svd_solver="full", random_state=random_state)
    elif method == "umap":
        n_comp = n_components if isinstance(n_components, int) else 3
        reducer = UMAP(
            n_components=n_comp,
            n_neighbors=15,
            min_dist=0.1,
            metric="euclidean",
            random_state=random_state,
        )
    else:
        raise ValueError(f"Unknown method: {method!r}. Choose 'pca' or 'umap'.")

    reducer.fit(X_scaled)

    if method == "pca":
        retained = float(np.sum(reducer.explained_variance_ratio_))
        assert retained >= 0.90, f"Only {retained:.1%} variance retained — consider more components"

    return reducer

# Example usage
pca = fit_spatial_reducer(X_scaled, method="pca", n_components=0.95)
X_pca = pca.transform(X_scaled)

umap_reducer = fit_spatial_reducer(X_scaled, method="umap", n_components=5)
X_umap = umap_reducer.transform(X_scaled)

# Persist fitted transformers
joblib.dump({"pca": pca, "umap": umap_reducer}, "spatial_reducers.joblib")

Step 4: Spatial Validation and Leakage Prevention

Verify that reduced components retain meaningful geographic gradients. Reconstruct the spatial grid from reduced components and visually confirm that known land-cover boundaries align with component discontinuities. Then evaluate against ground-truth labels using spatially blocked cross-validation rather than random k-fold splits. Without spatial cross-validation strategies, proximate training and validation samples share autocorrelated residuals that inflate performance metrics. Additionally, handling spatial autocorrelation in residual analysis reveals whether the reduced feature space has removed or merely displaced autocorrelation structure.

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
import numpy as np

def validate_reducer(
    X_reduced: np.ndarray,
    y: np.ndarray,
    spatial_cv,         # e.g. SpatialKFold from spatial_cv_strategies
    n_jobs: int = -1,
) -> dict:
    """
    Evaluate reduced features using spatially blocked CV.
    Returns mean and std of accuracy across folds.
    """
    clf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=n_jobs)
    scores = cross_val_score(clf, X_reduced, y, cv=spatial_cv, scoring="accuracy")
    result = {"mean_accuracy": float(scores.mean()), "std_accuracy": float(scores.std())}
    print(f"Spatial CV accuracy: {result['mean_accuracy']:.3f} ± {result['std_accuracy']:.3f}")
    return result

Step 5: Serialization and MLOps Deployment

Bundle the scaler, reducer, and spatial metadata into a single versioned artifact for reproducible inference. Register the pipeline in MLflow with spatial metadata (CRS, extent, band names, scaling parameters) and attach drift monitoring thresholds to component variance ratios and neighborhood density metrics.

import mlflow
import mlflow.sklearn
import joblib
from pathlib import Path

def register_spatial_pipeline(
    scaler,
    reducer,
    metadata: dict,
    experiment_name: str = "spatial_dim_reduction",
) -> str:
    """
    Log scaler, reducer, and spatial metadata to MLflow.
    Returns the run ID for downstream lineage tracking.
    """
    mlflow.set_experiment(experiment_name)

    with mlflow.start_run() as run:
        # Log spatial metadata as params
        for key, val in metadata.items():
            mlflow.log_param(key, str(val))

        if hasattr(reducer, "explained_variance_ratio_"):
            mlflow.log_metric(
                "cumulative_variance_retained",
                float(reducer.explained_variance_ratio_.sum()),
            )

        # Serialize artifact bundle
        artifact_path = Path("pipeline_artifact.joblib")
        joblib.dump({"scaler": scaler, "reducer": reducer, "metadata": metadata}, artifact_path)
        mlflow.log_artifact(str(artifact_path))

    print(f"Registered pipeline run: {run.info.run_id}")
    return run.info.run_id

Verification and Testing

Confirm the implementation is correct at three levels:

Numerical assertion. Check that variance retention meets the domain threshold and that no NaN values survive the pipeline.

import numpy as np

def assert_pipeline_health(X_reduced: np.ndarray, pca, variance_threshold: float = 0.90) -> None:
    assert not np.isnan(X_reduced).any(), "Reduced matrix contains NaN"
    assert not np.isinf(X_reduced).any(), "Reduced matrix contains Inf"
    if hasattr(pca, "explained_variance_ratio_"):
        retained = pca.explained_variance_ratio_.sum()
        assert retained >= variance_threshold, (
            f"Variance retained {retained:.2%} below threshold {variance_threshold:.0%}"
        )
    print("Pipeline health check passed.")

Spatial sanity check. Reshape the first PCA component back to the raster grid and plot it. Clean spectral gradients that align with known land-cover boundaries indicate the transform is behaving correctly. Random noise or uniform gray output signals a corrupted covariance matrix — usually from un-standardized bands or unmasked nodata pixels.

Unit test pattern. Use a deterministic synthetic raster (e.g., a linearly graded 50×50 array across 8 bands) to verify that PCA recovers the known dominant gradient in the first component and that a round-trip (transform then inverse_transform) restores pixel values within floating-point tolerance.

Troubleshooting and Common Errors

LinAlgError: SVD did not converge

Root cause: The feature matrix contains all-zero or near-constant columns (e.g., a cloud-masked band where every pixel is nodata). Fix: Drop constant columns with sklearn.feature_selection.VarianceThreshold(threshold=1e-6) before PCA, and enforce nodata masking before feeding the matrix.

ValueError: Input contains NaN from PCA or StandardScaler

Root cause: Nulls survived the imputation step, usually because an entire row is NaN (an all-cloud pixel). Fix: Apply the valid_mask from prepare_feature_matrix to exclude all-NaN rows before fitting, as shown in Step 2.

UMAP MemoryError on large rasters

Root cause: UMAP constructs an approximate k-nearest-neighbor graph over the entire dataset; at 10M+ pixels this exhausts RAM. Fix: Fit UMAP on a stratified spatial sample (100k–500k pixels), then apply umap_reducer.transform() block-by-block to the full raster.

Silent spatial leakage in CV scores

Root cause: Random k-fold splits place spatially adjacent pixels in both training and validation sets, inflating accuracy by 15–40% in typical raster classification tasks. Fix: Replace KFold with SpatialKFold or block-based folds that enforce a minimum buffer distance between training and validation regions.

Component sign flips between runs

Root cause: PCA eigenvectors have arbitrary sign; re-fitting on a different random sample can invert a component. Fix: Use sklearn.utils.extmath.svd_flip or deterministically orient components by their maximum absolute coefficient. Never compare component values across runs without sign alignment.

UMAP: transform requested but model was not fit with transform_mode=‘graph’``

Root cause: Calling transform() on a UMAP fitted with transform_mode='embedding' raises this error in some umap-learn versions. Fix: Explicitly set transform_mode='graph' at fit time when out-of-sample transforms are required.

Performance Optimisation

Incremental PCA for large rasters. sklearn.decomposition.IncrementalPCA fits in mini-batches loaded via rasterio windowed reads, keeping memory usage proportional to batch size rather than the full dataset.

from sklearn.decomposition import IncrementalPCA
import rasterio
import numpy as np

def fit_incremental_pca(
    raster_path: str,
    n_components: int,
    batch_pixels: int = 50_000,
) -> IncrementalPCA:
    ipca = IncrementalPCA(n_components=n_components)
    with rasterio.open(raster_path) as src:
        n_bands = src.count
        for _, window in src.block_windows(1):
            batch = src.read(window=window)           # shape: (bands, h, w)
            X_batch = batch.reshape(n_bands, -1).T    # shape: (pixels, bands)
            valid = ~np.isnan(X_batch).any(axis=1)
            if valid.sum() >= n_components:
                ipca.partial_fit(X_batch[valid])
    return ipca

Stratified spatial sampling for UMAP. Divide the raster into a regular grid and sample a fixed number of pixels per cell to ensure geographic representativeness. This prevents UMAP from over-representing high-density urban or agricultural areas in the embedding.

Algorithm selection by workload. The following table summarizes trade-offs for common geospatial scenarios:

Method Best use case Spatial note Approximate cost
PCA Multispectral compression, gradient boosting inputs Preserves global variance; sensitive to outliers Low
IncrementalPCA Continental-scale rasters, limited RAM Same as PCA; processes block-by-block Low
UMAP Neighborhood preservation, graph network inputs Maintains local topology; OOS transform supported Medium
t-SNE Exploratory visualization only No OOS transform; distorts global scale High
Autoencoder Multi-modal fusion, custom latent priors Requires spatial regularization; GPU recommended High

FAQ

How many PCA components should I retain for multispectral imagery?

Retain enough components to explain 95–99% of cumulative variance, then validate with a scree plot. For Sentinel-2 (12 bands), this typically yields 4–6 components. Always confirm that spatially meaningful gradients survive by plotting each component back onto the raster grid; if component 5 shows only noise, the 95% threshold may be set too high for your specific scene.

Does UMAP support out-of-sample transforms for inference?

Yes. Calling umap_reducer.transform(X_new) after fitting applies the learned embedding to unseen data. However, large distributional shifts between training and inference geographies can degrade embedding quality; monitor component density at inference time and consider retraining the reducer when land-cover composition changes significantly.

Can I apply dimensionality reduction on chunked rasters without loading them into RAM?

Yes. Fit the scaler and IncrementalPCA on a stratified spatial sample using rasterio windowed reads, then apply the fitted transforms block-by-block. dask.array integrates cleanly with this pattern for continental-scale workloads where even the sample does not fit into memory.

Will standard k-fold cross-validation give reliable results after dimensionality reduction?

No. Spatially proximate samples share autocorrelated residuals, inflating accuracy estimates by 15–40% in typical land-cover classification tasks. Replace KFold with spatially blocked folds so that training and validation regions are geographically separated. See spatial cross-validation strategies for implementation details.


Part of: Training Geospatial Predictive Models in Python