Spectral index calculation is one of the highest-leverage steps in a geospatial ML feature pipeline. By applying algebraic operations across co-registered raster bands you convert raw digital numbers or surface reflectance values into physically meaningful signals — vegetation vigor, water content, urban heat, soil composition — that carry far more predictive power than individual bands alone. The challenge is not the formula itself; it is executing band math correctly at scale, across heterogeneous sensors and tile boundaries, in a way that is fully reproducible between training and inference runs.
This guide covers the complete production workflow: from metadata validation and radiometric scaling through vectorized computation, chunked I/O, and pipeline integration. It is part of the broader Spatial Feature Engineering for Machine Learning discipline, where raster-derived indices combine with vector proximity features, temporal aggregations, and neighborhood statistics to form the full feature matrix fed to your models.
Prerequisites and Environment Setup
The following dependencies are required. Pin exact versions in your requirements.txt to ensure reproducibility across environments:
# requirements.txt
rasterio==1.3.10
numpy==1.26.4
xarray==2024.2.0
rioxarray==0.15.3
dask[array]==2024.4.1
scikit-learn==1.4.2
pyproj==3.6.1Install with:
pip install rasterio==1.3.10 numpy==1.26.4 xarray==2024.2.0 \
rioxarray==0.15.3 "dask[array]==2024.4.1" \
scikit-learn==1.4.2 pyproj==3.6.1System-level dependencies: GDAL 3.6+ and PROJ 9.2+ must be installed before rasterio. On Ubuntu/Debian:
sudo apt-get install gdal-bin libgdal-dev proj-bin libproj-devVerify your environment before processing any imagery:
import rasterio, numpy, pyproj
print(rasterio.__version__) # expect 1.3.x
print(numpy.__version__) # expect 1.26.x
print(pyproj.__version__) # expect 3.6.xInput data requirements: multi-band GeoTIFFs (Sentinel-2 L2A or Landsat 8/9 Collection 2) with documented spectral band ordering and sensor-specific radiometric scaling factors available in the file metadata or a sidecar JSON.
Step-by-Step Implementation
Step 1 — Validate Metadata and Align Grids
Before any computation, confirm that all input bands share an identical coordinate reference system, spatial resolution, and affine transform. Mismatched projections or pixel grids produce silently wrong indices — the math succeeds but each pixel samples a different ground location on each band.
import rasterio
from rasterio.enums import Resampling
def validate_and_align(band_paths: list[str], reference_idx: int = 0) -> list[str]:
"""
Confirm all bands share CRS, resolution, and transform.
Returns paths unchanged if consistent; raises on mismatch.
"""
with rasterio.open(band_paths[reference_idx]) as ref:
ref_crs = ref.crs
ref_transform = ref.transform
ref_shape = (ref.height, ref.width)
for path in band_paths[1:]:
with rasterio.open(path) as src:
if src.crs != ref_crs:
raise ValueError(
f"CRS mismatch: {path} has {src.crs}, expected {ref_crs}. "
"Reproject with rasterio.warp.reproject before band math."
)
if (src.height, src.width) != ref_shape:
raise ValueError(
f"Grid mismatch: {path} is {src.height}x{src.width}, "
f"expected {ref_shape[0]}x{ref_shape[1]}."
)
if not all(
abs(src.transform[i] - ref_transform[i]) < 1e-8 for i in range(6)
):
raise ValueError(f"Affine transform mismatch in {path}.")
return band_paths
# Usage
paths = ["red_band.tif", "nir_band.tif"]
validate_and_align(paths)
assert len(paths) == 2If CRS Alignment and Projection Handling has not been applied upstream, fix projection mismatches before proceeding — attempting band math across different projections is a common source of subtly incorrect feature arrays.
Step 2 — Apply Radiometric Scaling
Raw digital numbers must be converted to surface reflectance before computing indices. Sentinel-2 L2A products use a fixed scale factor of 10,000. Landsat Collection 2 Level-2 products embed per-band gain and additive offset values in the MTL metadata file.
import numpy as np
import rasterio
SENTINEL2_SCALE = 10_000.0
SENTINEL2_NO_DATA = 0
def load_reflectance_band(
path: str,
scale: float = SENTINEL2_SCALE,
no_data_val: int = SENTINEL2_NO_DATA,
) -> np.ndarray:
"""
Load a single-band GeoTIFF and convert DN to [0, 1] reflectance.
No-data pixels are returned as NaN.
"""
with rasterio.open(path) as src:
data = src.read(1).astype(np.float32)
# Mask no-data before scaling
data = np.where(data == no_data_val, np.nan, data)
reflectance = data / scale
# Clip to physically valid range — values outside [0, 1] are sensor artefacts
reflectance = np.clip(reflectance, 0.0, 1.0)
return reflectance
red = load_reflectance_band("B04.tif")
nir = load_reflectance_band("B08.tif")
assert red.dtype == np.float32
assert np.nanmax(red) <= 1.0, "Red band contains values > 1.0 after scaling"Step 3 — Mask Clouds and Sensor Artifacts
Atmospheric interference, cloud cover, and sensor saturation introduce high-variance noise that skews index distributions. Set masked pixels to NaN rather than zero — zero NDVI implies bare soil, so cloud-masked pixels will cluster with valid bare-soil samples if you use zero as a fill value.
import numpy as np
import rasterio
# Sentinel-2 Scene Classification Layer (SCL) values to exclude:
# 0=no data, 1=saturated/defective, 3=cloud shadows,
# 8=cloud medium prob, 9=cloud high prob, 10=thin cirrus, 11=snow
INVALID_SCL_CLASSES = {0, 1, 3, 8, 9, 10, 11}
def build_valid_mask(scl_path: str) -> np.ndarray:
"""
Return a boolean array: True where pixels are valid for computation.
"""
with rasterio.open(scl_path) as src:
scl = src.read(1)
invalid = np.zeros_like(scl, dtype=bool)
for cls in INVALID_SCL_CLASSES:
invalid |= (scl == cls)
return ~invalid # True = valid
def apply_mask(array: np.ndarray, valid_mask: np.ndarray) -> np.ndarray:
"""Set invalid pixels to NaN in-place."""
out = array.copy()
out[~valid_mask] = np.nan
return out
valid = build_valid_mask("SCL.tif")
red_masked = apply_mask(red, valid)
nir_masked = apply_mask(nir, valid)
valid_fraction = valid.mean()
assert valid_fraction > 0.05, f"Too few valid pixels ({valid_fraction:.1%}) — check scene cloud cover."
print(f"Valid pixel fraction: {valid_fraction:.1%}")Step 4 — Execute Vectorized Band Math
Compute index formulas using NumPy broadcasting. Avoid pixel-level Python loops; leverage contiguous memory arrays and CPU vectorization. The epsilon term prevents division-by-zero errors over optically uniform surfaces (deep water, dense shadow).
import numpy as np
EPSILON = 1e-10 # prevents division by zero on uniform surfaces
def ndvi(red: np.ndarray, nir: np.ndarray) -> np.ndarray:
"""
Normalized Difference Vegetation Index.
Output range: [-1, 1]. High values indicate dense green vegetation.
"""
return (nir - red) / (nir + red + EPSILON)
def evi(blue: np.ndarray, red: np.ndarray, nir: np.ndarray) -> np.ndarray:
"""
Enhanced Vegetation Index — less prone to atmospheric and soil background noise than NDVI.
Standard coefficients: G=2.5, C1=6, C2=7.5, L=1.
"""
return 2.5 * (nir - red) / (nir + 6 * red - 7.5 * blue + 1 + EPSILON)
def ndwi(green: np.ndarray, nir: np.ndarray) -> np.ndarray:
"""
Normalized Difference Water Index (McFeeters 1996).
Positive values indicate open water surfaces.
"""
return (green - nir) / (green + nir + EPSILON)
# Compute all indices in a single pass
ndvi_arr = ndvi(red_masked, nir_masked)
evi_arr = evi(blue_masked, red_masked, nir_masked)
ndwi_arr = ndwi(green_masked, nir_masked)
# Sanity checks on output ranges
assert np.nanmin(ndvi_arr) >= -1.0 and np.nanmax(ndvi_arr) <= 1.0, "NDVI out of range"
assert np.nanmin(evi_arr) >= -1.0 and np.nanmax(evi_arr) <= 1.0, "EVI out of range"For a complete implementation guide covering EVI parameter tuning and batch processing across multiple scenes, see How to Calculate NDVI and EVI with Rasterio.
Step 5 — Chunked Processing for Large Scenes
Full Sentinel-2 tiles (10 m resolution, 100 km x 100 km) occupy ~3 GB per band uncompressed. Process scenes in fixed-size windows using rasterio.windows to stay within memory budgets. For distributed workloads, rioxarray with dask is the cleaner abstraction.
import rasterio
from rasterio.windows import Window
import numpy as np
CHUNK = 1024 # pixels per side; adjust to fit available RAM
def compute_ndvi_chunked(
red_path: str,
nir_path: str,
output_path: str,
chunk_size: int = CHUNK,
) -> None:
"""
Compute NDVI tile-by-tile and write to a compressed GeoTIFF.
No full-scene array is held in memory.
"""
with rasterio.open(red_path) as red_src, \
rasterio.open(nir_path) as nir_src:
profile = red_src.profile.copy()
profile.update(dtype="float32", count=1, compress="zstd", nodata=np.nan)
with rasterio.open(output_path, "w", **profile) as dst:
for row_off in range(0, red_src.height, chunk_size):
for col_off in range(0, red_src.width, chunk_size):
win = Window(
col_off, row_off,
min(chunk_size, red_src.width - col_off),
min(chunk_size, red_src.height - row_off),
)
red_chunk = red_src.read(1, window=win).astype(np.float32) / 10_000.0
nir_chunk = nir_src.read(1, window=win).astype(np.float32) / 10_000.0
ndvi_chunk = (nir_chunk - red_chunk) / (nir_chunk + red_chunk + EPSILON)
dst.write(ndvi_chunk, 1, window=win)
compute_ndvi_chunked("B04.tif", "B08.tif", "ndvi_output.tif")
print("Chunked NDVI complete.")Step 6 — Scale for ML Readiness
Raw index values can still contain outliers from sensor noise or suboptimal masking. Clip to physically plausible bounds, then apply standardization. Serialise the scaler alongside model weights so that training-time and inference-time transformations are identical. This is consistent with the normalization approach described in Feature Scaling for Geospatial Inputs.
import numpy as np
import pickle
from sklearn.preprocessing import RobustScaler
def prepare_index_for_training(
index_array: np.ndarray,
clip_min: float = -1.0,
clip_max: float = 1.0,
scaler_path: str | None = None,
) -> tuple[np.ndarray, RobustScaler]:
"""
Clip outliers, flatten valid pixels, and fit a RobustScaler.
Returns the scaled flat array and the fitted scaler.
"""
clipped = np.clip(index_array, clip_min, clip_max)
flat = clipped[~np.isnan(clipped)].reshape(-1, 1)
scaler = RobustScaler()
scaled = scaler.fit_transform(flat)
if scaler_path:
with open(scaler_path, "wb") as f:
pickle.dump(scaler, f)
return scaled.ravel(), scaler
ndvi_scaled, scaler = prepare_index_for_training(
ndvi_arr, scaler_path="ndvi_scaler.pkl"
)
assert abs(np.median(ndvi_scaled)) < 0.1, "Median should be near zero after RobustScaler"Verification and Testing
After completing all steps, run a suite of numerical and visual checks before feeding indices into training:
import numpy as np
import rasterio
def verify_index_output(index_path: str, expected_min: float, expected_max: float) -> None:
"""Confirm output file has valid dtype, range, and non-trivial valid pixel count."""
with rasterio.open(index_path) as src:
data = src.read(1).astype(np.float32)
profile = src.profile
assert profile["dtype"] == "float32", "Output must be float32"
valid = data[~np.isnan(data)]
assert len(valid) > 0, "No valid (non-NaN) pixels in output"
assert valid.min() >= expected_min, f"Values below {expected_min}"
assert valid.max() <= expected_max, f"Values above {expected_max}"
print(
f"OK — {len(valid):,} valid pixels | "
f"mean={valid.mean():.4f} | std={valid.std():.4f}"
)
verify_index_output("ndvi_output.tif", expected_min=-1.0, expected_max=1.0)A visual sanity check matters too: render a quicklook PNG from the NDVI GeoTIFF and confirm that water bodies are negative, dense forest is near +0.6 to +0.8, and bare soil or urban areas cluster near 0 to +0.2. Unexpected spatial patterns (e.g., index values following tile boundaries) indicate a grid alignment or scaling problem.
When combining raster indices with Spatial Lag and Neighborhood Statistics features downstream, also verify that both feature arrays cover the same spatial extent and share the same NaN mask.
Troubleshooting and Common Errors
NDVI contains values above 1.0 or below -1.0
Root cause: radiometric scaling was skipped. Raw Sentinel-2 DN values can exceed 10,000 in bright targets, producing ratios outside [-1, 1]. Fix: apply / 10_000.0 (or the appropriate sensor scale factor) before band math.
All valid pixels become NaN after masking
Root cause: the SCL band resolution does not match the reflectance band resolution. Sentinel-2 SCL is delivered at 20 m while bands B02/B03/B04/B08 are at 10 m. Resample the SCL to 10 m before masking: rasterio.warp.reproject(scl, dst_shape=reflectance.shape, ...).
rasterio.errors.CRSError: Input is not a CRS
Root cause: one of the input files has no embedded CRS (common with raw DN products or improperly exported files). Fix: assign the correct CRS using rasterio.open(path, 'r+') as dst: dst.crs = CRS.from_epsg(32632) — but first verify with rasterio info that the file genuinely lacks a projection rather than having a wrong one.
Memory overflow on large scenes
Root cause: loading the full band array into RAM before processing. Fix: use the chunked window loop in Step 5, or switch to rioxarray with chunks={"x": 1024, "y": 1024}.
Index output has stripe artifacts at tile boundaries
Root cause: inconsistent radiometric normalization across adjacent tiles (different acquisition dates, sun angles, or atmospheric correction versions). Fix: apply a radiometric normalization step (e.g., histogram matching or BRDF correction) before mosaicking, or compute indices per-tile before mosaicking rather than after.
EVI produces physically implausible spikes over water
Root cause: EVI is sensitive to negative reflectance values that can occur in atmospherically corrected blue bands over dark water surfaces. Fix: apply the water mask from NDWI before using EVI, or substitute NDVI for water-adjacent pixels. The Vector Proximity and Buffer Generation techniques can supply a reference water-body mask from vector data.
Performance Optimisation
Chunked I/O: Process scenes in 1024x1024 or 2048x2048 pixel windows. Larger chunks improve I/O efficiency but increase peak RAM; benchmark your hardware to find the optimal size.
Parallel tile processing: Use concurrent.futures.ProcessPoolExecutor to process multiple tiles simultaneously. Each tile is stateless and independent, making this embarrassingly parallel.
from concurrent.futures import ProcessPoolExecutor
tile_paths = [("B04_tile1.tif", "B08_tile1.tif", "ndvi_tile1.tif"),
("B04_tile2.tif", "B08_tile2.tif", "ndvi_tile2.tif")]
with ProcessPoolExecutor(max_workers=4) as pool:
pool.map(lambda args: compute_ndvi_chunked(*args), tile_paths)Dask for distributed computation: Open rasters with rioxarray.open_rasterio(path, chunks={"x": 1024, "y": 1024}) and apply NumPy arithmetic directly. Dask defers execution and can distribute across a local cluster or cloud workers.
Output compression: Write results with compress="zstd" and predictor=2 (floating-point predictor). ZSTD at level 1 achieves roughly 3x compression with negligible CPU overhead compared to uncompressed float32.
Lazy loading at inference time: Store outputs as Cloud Optimized GeoTIFFs (COGs) with overviews. Inference services can then read only the required spatial window without fetching the entire file — critical when your Temporal Aggregation for Time-Series Geodata pipeline needs to read hundreds of date slices per inference request.
FAQ
Why does my NDVI contain values outside [-1, 1]?
Values outside the valid range almost always indicate that radiometric scaling was skipped. Sentinel-2 raw DN values for bright surfaces can exceed 10,000, producing ratios well above 1.0 after band math. Apply / 10_000.0 (or your sensor’s specific scale factor) before computing any spectral index.
Should I use zero or NaN for masked (cloudy) pixels?
Always use NaN. Filling with zero creates a false spectral signal — zero NDVI corresponds to bare soil, so cloud-masked pixels will silently cluster with valid bare-soil samples during training, degrading model calibration on agricultural or vegetated targets.
How do I avoid tile-boundary artifacts in chunked processing?
For purely algebraic indices (NDVI, EVI, NDWI), each output pixel depends only on co-located input pixels, so there are no boundary artifacts from chunking. Boundary artifacts only arise when the computation has a spatial neighbourhood (e.g., focal statistics, Gaussian blur). If you layer band indices with neighbourhood aggregations from Spatial Lag and Neighborhood Statistics, add overlap at that stage, not during index computation.
Can I compute spectral indices directly on a Dask array?
Yes. Use rioxarray.open_rasterio(path, chunks={"x": 1024, "y": 1024}) to get a Dask-backed array, then apply standard NumPy arithmetic (nir - red) / (nir + red + 1e-10). Dask defers execution until .compute() is called and can distribute work across cores or a multi-node compute environment.