High-dimensional geospatial datasets routinely overwhelm predictive modeling pipelines. Multispectral satellite imagery, LiDAR point clouds, and enriched vector layers often contain dozens to hundreds of correlated bands or attributes. Without compression these feature spaces introduce severe computational bottlenecks, amplify overfitting, and degrade model generalization across geographic extents. Dimensionality reduction for spatial data addresses these challenges by projecting correlated spatial features into a compact, information-dense latent space while preserving the topological and spectral relationships required for downstream inference.
This topic sits within the broader Training Geospatial Predictive Models in Python workflow, where it acts as a critical preprocessing gate between raw feature extraction and model training. Done correctly, spatial compression standardizes heterogeneous inputs from sources such as raster band math and index calculation and feature scaling for geospatial inputs, accelerates gradient-based optimization, and enables robust deployment of raster and vector models at scale.
Why Geospatial Feature Spaces Require Specialized Compression
Traditional machine learning assumes independent and identically distributed observations, but spatial data inherently violates this assumption through spatial autocorrelation and scale-dependent heterogeneity. Multispectral sensors capture overlapping electromagnetic responses—red-edge and near-infrared bands are particularly collinear—while terrain derivatives such as slope, aspect, and curvature share mathematical dependencies. Applying naive feature selection or unmodified linear transforms often strips away subtle geographic gradients or artificially groups spatially proximate observations.
Effective spatial compression must balance three competing objectives: variance retention (capturing dominant spectral or structural signals), topological preservation (maintaining neighborhood relationships across coordinate space), and computational tractability (enabling out-of-core or distributed processing for continental-scale rasters). Ignoring these constraints produces models that look good on training folds but fail when deployed to unseen watersheds, administrative boundaries, or temporal windows.
Prerequisites and Environment Setup
Before implementing spatial compression pipelines, confirm the following baseline requirements are in place:
- Python 3.10+ with isolated virtual environments (
condaorvenv) - Core geospatial libraries:
rasterio>=1.3,geopandas>=0.14,xarray>=2023.6,shapely>=2.0,pyproj>=3.6 - ML stack:
scikit-learn>=1.4,umap-learn>=0.5,joblib>=1.3,mlflow>=2.12 - Optional:
dask[array]>=2024.1for out-of-core chunked transforms - System dependencies: GDAL 3.8+, PROJ 9+
pip install "rasterio>=1.3" "geopandas>=0.14" "xarray>=2023.6" \
"scikit-learn>=1.4" "umap-learn>=0.5" "joblib>=1.3" \
"mlflow>=2.12" "dask[array]>=2024.1"Pin all versions in requirements.txt or environment.yml to guarantee reproducible transformer behavior across staging and production environments.
Step-by-Step Implementation
Step 1: Data Ingestion and Spatial Alignment
Load raster bands and vector attributes into a unified spatial grid using xarray or geopandas. Every layer must share identical CRS, resolution, and spatial extent before any mathematical operations occur. Misaligned grids introduce silent artifacts that distort covariance matrices. Apply CRS alignment and projection handling rigorously at this stage; a single mismatched band corrupts the entire factorization.
import rasterio
import numpy as np
from rasterio.warp import reproject, Resampling
def load_aligned_bands(band_paths: list[str], reference_path: str) -> np.ndarray:
"""
Load and align multiple raster bands to a reference grid.
Returns array of shape (n_pixels, n_bands) with nodata masked to NaN.
"""
with rasterio.open(reference_path) as ref:
ref_crs = ref.crs
ref_transform = ref.transform
ref_shape = (ref.height, ref.width)
bands = []
for path in band_paths:
with rasterio.open(path) as src:
data = np.empty(ref_shape, dtype=np.float32)
reproject(
source=rasterio.band(src, 1),
destination=data,
dst_transform=ref_transform,
dst_crs=ref_crs,
resampling=Resampling.bilinear,
)
nodata = src.nodata
if nodata is not None:
data[data == nodata] = np.nan
bands.append(data.ravel())
X = np.stack(bands, axis=1) # shape: (n_pixels, n_bands)
assert X.shape[1] == len(band_paths), "Band count mismatch after alignment"
return XStep 2: Masking, Null Handling, and Standardization
Apply valid-data masks (cloud-free pixels, water bodies, urban exclusion zones) early. Missing values must be imputed or excluded before scaling; NaN propagation silently corrupts matrix factorization. Spatial features frequently exhibit heterogeneous variance across bands, so apply z-score normalization per feature using StandardScaler. Unstandardized inputs disproportionately weight high-variance bands such as thermal infrared while suppressing subtle signals like narrow-band vegetation indices computed through raster band math.
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
import numpy as np
def prepare_feature_matrix(X_raw: np.ndarray) -> tuple[np.ndarray, np.ndarray]:
"""
Remove all-NaN rows, impute remaining NaNs with column median,
and z-score each feature. Returns (X_scaled, valid_mask).
"""
valid_mask = ~np.all(np.isnan(X_raw), axis=1)
X_valid = X_raw[valid_mask]
imputer = SimpleImputer(strategy="median")
X_imputed = imputer.fit_transform(X_valid)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_imputed)
assert not np.isnan(X_scaled).any(), "NaN survived scaling — check imputation"
return X_scaled, valid_maskStep 3: Fit the Reduction Transform
Fit a linear or non-linear transformer to the standardized matrix. PCA remains the baseline for spectral compression due to its deterministic output and invertible transformation matrix. For non-linear spatial relationships, UMAP consistently outperforms t-SNE in geospatial contexts by preserving both local neighborhoods and global structure while supporting out-of-sample transforms needed for inference. When feeding reduced features into graph neural networks for spatial data, UMAP’s topological guarantees are particularly valuable.
from sklearn.decomposition import PCA
from umap import UMAP
import joblib
def fit_spatial_reducer(
X_scaled: np.ndarray,
method: str = "pca",
n_components: int | float = 0.95,
random_state: int = 42,
) -> object:
"""
Fit a spatial dimensionality reducer. method must be 'pca' or 'umap'.
For PCA, n_components can be a float (variance fraction) or int.
"""
if method == "pca":
reducer = PCA(n_components=n_components, svd_solver="full", random_state=random_state)
elif method == "umap":
n_comp = n_components if isinstance(n_components, int) else 3
reducer = UMAP(
n_components=n_comp,
n_neighbors=15,
min_dist=0.1,
metric="euclidean",
random_state=random_state,
)
else:
raise ValueError(f"Unknown method: {method!r}. Choose 'pca' or 'umap'.")
reducer.fit(X_scaled)
if method == "pca":
retained = float(np.sum(reducer.explained_variance_ratio_))
assert retained >= 0.90, f"Only {retained:.1%} variance retained — consider more components"
return reducer
# Example usage
pca = fit_spatial_reducer(X_scaled, method="pca", n_components=0.95)
X_pca = pca.transform(X_scaled)
umap_reducer = fit_spatial_reducer(X_scaled, method="umap", n_components=5)
X_umap = umap_reducer.transform(X_scaled)
# Persist fitted transformers
joblib.dump({"pca": pca, "umap": umap_reducer}, "spatial_reducers.joblib")Step 4: Spatial Validation and Leakage Prevention
Verify that reduced components retain meaningful geographic gradients. Reconstruct the spatial grid from reduced components and visually confirm that known land-cover boundaries align with component discontinuities. Then evaluate against ground-truth labels using spatially blocked cross-validation rather than random k-fold splits. Without spatial cross-validation strategies, proximate training and validation samples share autocorrelated residuals that inflate performance metrics. Additionally, handling spatial autocorrelation in residual analysis reveals whether the reduced feature space has removed or merely displaced autocorrelation structure.
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
import numpy as np
def validate_reducer(
X_reduced: np.ndarray,
y: np.ndarray,
spatial_cv, # e.g. SpatialKFold from spatial_cv_strategies
n_jobs: int = -1,
) -> dict:
"""
Evaluate reduced features using spatially blocked CV.
Returns mean and std of accuracy across folds.
"""
clf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=n_jobs)
scores = cross_val_score(clf, X_reduced, y, cv=spatial_cv, scoring="accuracy")
result = {"mean_accuracy": float(scores.mean()), "std_accuracy": float(scores.std())}
print(f"Spatial CV accuracy: {result['mean_accuracy']:.3f} ± {result['std_accuracy']:.3f}")
return resultStep 5: Serialization and MLOps Deployment
Bundle the scaler, reducer, and spatial metadata into a single versioned artifact for reproducible inference. Register the pipeline in MLflow with spatial metadata (CRS, extent, band names, scaling parameters) and attach drift monitoring thresholds to component variance ratios and neighborhood density metrics.
import mlflow
import mlflow.sklearn
import joblib
from pathlib import Path
def register_spatial_pipeline(
scaler,
reducer,
metadata: dict,
experiment_name: str = "spatial_dim_reduction",
) -> str:
"""
Log scaler, reducer, and spatial metadata to MLflow.
Returns the run ID for downstream lineage tracking.
"""
mlflow.set_experiment(experiment_name)
with mlflow.start_run() as run:
# Log spatial metadata as params
for key, val in metadata.items():
mlflow.log_param(key, str(val))
if hasattr(reducer, "explained_variance_ratio_"):
mlflow.log_metric(
"cumulative_variance_retained",
float(reducer.explained_variance_ratio_.sum()),
)
# Serialize artifact bundle
artifact_path = Path("pipeline_artifact.joblib")
joblib.dump({"scaler": scaler, "reducer": reducer, "metadata": metadata}, artifact_path)
mlflow.log_artifact(str(artifact_path))
print(f"Registered pipeline run: {run.info.run_id}")
return run.info.run_idVerification and Testing
Confirm the implementation is correct at three levels:
Numerical assertion. Check that variance retention meets the domain threshold and that no NaN values survive the pipeline.
import numpy as np
def assert_pipeline_health(X_reduced: np.ndarray, pca, variance_threshold: float = 0.90) -> None:
assert not np.isnan(X_reduced).any(), "Reduced matrix contains NaN"
assert not np.isinf(X_reduced).any(), "Reduced matrix contains Inf"
if hasattr(pca, "explained_variance_ratio_"):
retained = pca.explained_variance_ratio_.sum()
assert retained >= variance_threshold, (
f"Variance retained {retained:.2%} below threshold {variance_threshold:.0%}"
)
print("Pipeline health check passed.")Spatial sanity check. Reshape the first PCA component back to the raster grid and plot it. Clean spectral gradients that align with known land-cover boundaries indicate the transform is behaving correctly. Random noise or uniform gray output signals a corrupted covariance matrix — usually from un-standardized bands or unmasked nodata pixels.
Unit test pattern. Use a deterministic synthetic raster (e.g., a linearly graded 50×50 array across 8 bands) to verify that PCA recovers the known dominant gradient in the first component and that a round-trip (transform then inverse_transform) restores pixel values within floating-point tolerance.
Troubleshooting and Common Errors
LinAlgError: SVD did not converge
Root cause: The feature matrix contains all-zero or near-constant columns (e.g., a cloud-masked band where every pixel is nodata). Fix: Drop constant columns with sklearn.feature_selection.VarianceThreshold(threshold=1e-6) before PCA, and enforce nodata masking before feeding the matrix.
ValueError: Input contains NaN from PCA or StandardScaler
Root cause: Nulls survived the imputation step, usually because an entire row is NaN (an all-cloud pixel). Fix: Apply the valid_mask from prepare_feature_matrix to exclude all-NaN rows before fitting, as shown in Step 2.
UMAP MemoryError on large rasters
Root cause: UMAP constructs an approximate k-nearest-neighbor graph over the entire dataset; at 10M+ pixels this exhausts RAM. Fix: Fit UMAP on a stratified spatial sample (100k–500k pixels), then apply umap_reducer.transform() block-by-block to the full raster.
Silent spatial leakage in CV scores
Root cause: Random k-fold splits place spatially adjacent pixels in both training and validation sets, inflating accuracy by 15–40% in typical raster classification tasks. Fix: Replace KFold with SpatialKFold or block-based folds that enforce a minimum buffer distance between training and validation regions.
Component sign flips between runs
Root cause: PCA eigenvectors have arbitrary sign; re-fitting on a different random sample can invert a component. Fix: Use sklearn.utils.extmath.svd_flip or deterministically orient components by their maximum absolute coefficient. Never compare component values across runs without sign alignment.
UMAP: transform requested but model was not fit with transform_mode=‘graph’``
Root cause: Calling transform() on a UMAP fitted with transform_mode='embedding' raises this error in some umap-learn versions. Fix: Explicitly set transform_mode='graph' at fit time when out-of-sample transforms are required.
Performance Optimisation
Incremental PCA for large rasters. sklearn.decomposition.IncrementalPCA fits in mini-batches loaded via rasterio windowed reads, keeping memory usage proportional to batch size rather than the full dataset.
from sklearn.decomposition import IncrementalPCA
import rasterio
import numpy as np
def fit_incremental_pca(
raster_path: str,
n_components: int,
batch_pixels: int = 50_000,
) -> IncrementalPCA:
ipca = IncrementalPCA(n_components=n_components)
with rasterio.open(raster_path) as src:
n_bands = src.count
for _, window in src.block_windows(1):
batch = src.read(window=window) # shape: (bands, h, w)
X_batch = batch.reshape(n_bands, -1).T # shape: (pixels, bands)
valid = ~np.isnan(X_batch).any(axis=1)
if valid.sum() >= n_components:
ipca.partial_fit(X_batch[valid])
return ipcaStratified spatial sampling for UMAP. Divide the raster into a regular grid and sample a fixed number of pixels per cell to ensure geographic representativeness. This prevents UMAP from over-representing high-density urban or agricultural areas in the embedding.
Algorithm selection by workload. The following table summarizes trade-offs for common geospatial scenarios:
| Method | Best use case | Spatial note | Approximate cost |
|---|---|---|---|
| PCA | Multispectral compression, gradient boosting inputs | Preserves global variance; sensitive to outliers | Low |
| IncrementalPCA | Continental-scale rasters, limited RAM | Same as PCA; processes block-by-block | Low |
| UMAP | Neighborhood preservation, graph network inputs | Maintains local topology; OOS transform supported | Medium |
| t-SNE | Exploratory visualization only | No OOS transform; distorts global scale | High |
| Autoencoder | Multi-modal fusion, custom latent priors | Requires spatial regularization; GPU recommended | High |
FAQ
How many PCA components should I retain for multispectral imagery?
Retain enough components to explain 95–99% of cumulative variance, then validate with a scree plot. For Sentinel-2 (12 bands), this typically yields 4–6 components. Always confirm that spatially meaningful gradients survive by plotting each component back onto the raster grid; if component 5 shows only noise, the 95% threshold may be set too high for your specific scene.
Does UMAP support out-of-sample transforms for inference?
Yes. Calling umap_reducer.transform(X_new) after fitting applies the learned embedding to unseen data. However, large distributional shifts between training and inference geographies can degrade embedding quality; monitor component density at inference time and consider retraining the reducer when land-cover composition changes significantly.
Can I apply dimensionality reduction on chunked rasters without loading them into RAM?
Yes. Fit the scaler and IncrementalPCA on a stratified spatial sample using rasterio windowed reads, then apply the fitted transforms block-by-block. dask.array integrates cleanly with this pattern for continental-scale workloads where even the sample does not fit into memory.
Will standard k-fold cross-validation give reliable results after dimensionality reduction?
No. Spatially proximate samples share autocorrelated residuals, inflating accuracy estimates by 15–40% in typical land-cover classification tasks. Replace KFold with spatially blocked folds so that training and validation regions are geographically separated. See spatial cross-validation strategies for implementation details.
Related
- Spatial Cross-Validation Strategies — spatially blocked fold generation to prevent autocorrelation leakage
- Handling Spatial Autocorrelation — diagnosing and correcting autocorrelated residuals before and after compression
- Feature Scaling for Geospatial Inputs — normalizing heterogeneous raster and vector features before matrix factorization
- Raster Band Math and Index Calculation — computing the derived spectral indices that feed into the reduction pipeline
- Gradient Boosting for Raster Data — using PCA-reduced raster features as inputs to XGBoost and LightGBM