Should I use zero or NaN for masked pixels?

Always use NaN. Filling masked pixels with zero creates a false signal — zero NDVI implies bare soil, so cloudy pixels will cluster with valid bare-soil samples during training and corrupt model calibration.

How do I handle chunked processing without boundary artifacts?

Add an overlap (e.g., 64 px) between adjacent rasterio windows before computing neighborhood-sensitive operations, then trim the overlap from the output tile before writing. Pure algebraic indices like NDVI are pixel-local and require no overlap.

Can I compute band indices directly on a Dask array?

Yes. Open the raster with rioxarray and chunk it, then apply standard NumPy arithmetic. Dask defers execution and distributes computation across workers, making it practical for continental-scale mosaics without loading the full array into RAM.

Raster Band Math and Index Calculation

Q: Why does my NDVI contain values outside [-1, 1]?

Values outside the valid range almost always mean you skipped radiometric scaling. Raw digital numbers from Sentinel-2 must be divided by 10,000 before band math; Landsat Collection 2 requires its own gain/offset factors. Apply scaling before computing any index.

Spectral index calculation is one of the highest-leverage steps in a geospatial ML feature pipeline. By applying algebraic operations across co-registered raster bands you convert raw digital numbers or surface reflectance values into physically meaningful signals — vegetation vigor, water content, urban heat, soil composition — that carry far more predictive power than individual bands alone. The challenge is not the formula itself; it is executing band math correctly at scale, across heterogeneous sensors and tile boundaries, in a way that is fully reproducible between training and inference runs.

This guide covers the complete production workflow: from metadata validation and radiometric scaling through vectorized computation, chunked I/O, and pipeline integration. It is part of the broader Spatial Feature Engineering for Machine Learning discipline, where raster-derived indices combine with vector proximity features, temporal aggregations, and neighborhood statistics to form the full feature matrix fed to your models.

Prerequisites and Environment Setup

The following dependencies are required. Pin exact versions in your requirements.txt to ensure reproducibility across environments:

# requirements.txt
rasterio==1.3.10
numpy==1.26.4
xarray==2024.2.0
rioxarray==0.15.3
dask[array]==2024.4.1
scikit-learn==1.4.2
pyproj==3.6.1

Install with:

pip install rasterio==1.3.10 numpy==1.26.4 xarray==2024.2.0 \
    rioxarray==0.15.3 "dask[array]==2024.4.1" \
    scikit-learn==1.4.2 pyproj==3.6.1

System-level dependencies: GDAL 3.6+ and PROJ 9.2+ must be installed before rasterio. On Ubuntu/Debian:

sudo apt-get install gdal-bin libgdal-dev proj-bin libproj-dev

Verify your environment before processing any imagery:

import rasterio, numpy, pyproj
print(rasterio.__version__)   # expect 1.3.x
print(numpy.__version__)      # expect 1.26.x
print(pyproj.__version__)     # expect 3.6.x

Input data requirements: multi-band GeoTIFFs (Sentinel-2 L2A or Landsat 8/9 Collection 2) with documented spectral band ordering and sensor-specific radiometric scaling factors available in the file metadata or a sidecar JSON.

Step-by-Step Implementation

Step 1 — Validate Metadata and Align Grids

Before any computation, confirm that all input bands share an identical coordinate reference system, spatial resolution, and affine transform. Mismatched projections or pixel grids produce silently wrong indices — the math succeeds but each pixel samples a different ground location on each band.

import rasterio
from rasterio.enums import Resampling

def validate_and_align(band_paths: list[str], reference_idx: int = 0) -> list[str]:
    """
    Confirm all bands share CRS, resolution, and transform.
    Returns paths unchanged if consistent; raises on mismatch.
    """
    with rasterio.open(band_paths[reference_idx]) as ref:
        ref_crs = ref.crs
        ref_transform = ref.transform
        ref_shape = (ref.height, ref.width)

    for path in band_paths[1:]:
        with rasterio.open(path) as src:
            if src.crs != ref_crs:
                raise ValueError(
                    f"CRS mismatch: {path} has {src.crs}, expected {ref_crs}. "
                    "Reproject with rasterio.warp.reproject before band math."
                )
            if (src.height, src.width) != ref_shape:
                raise ValueError(
                    f"Grid mismatch: {path} is {src.height}x{src.width}, "
                    f"expected {ref_shape[0]}x{ref_shape[1]}."
                )
            if not all(
                abs(src.transform[i] - ref_transform[i]) < 1e-8 for i in range(6)
            ):
                raise ValueError(f"Affine transform mismatch in {path}.")
    return band_paths

# Usage
paths = ["red_band.tif", "nir_band.tif"]
validate_and_align(paths)
assert len(paths) == 2

If CRS Alignment and Projection Handling has not been applied upstream, fix projection mismatches before proceeding — attempting band math across different projections is a common source of subtly incorrect feature arrays.

Step 2 — Apply Radiometric Scaling

Raw digital numbers must be converted to surface reflectance before computing indices. Sentinel-2 L2A products use a fixed scale factor of 10,000. Landsat Collection 2 Level-2 products embed per-band gain and additive offset values in the MTL metadata file.

import numpy as np
import rasterio

SENTINEL2_SCALE = 10_000.0
SENTINEL2_NO_DATA = 0

def load_reflectance_band(
    path: str,
    scale: float = SENTINEL2_SCALE,
    no_data_val: int = SENTINEL2_NO_DATA,
) -> np.ndarray:
    """
    Load a single-band GeoTIFF and convert DN to [0, 1] reflectance.
    No-data pixels are returned as NaN.
    """
    with rasterio.open(path) as src:
        data = src.read(1).astype(np.float32)
    # Mask no-data before scaling
    data = np.where(data == no_data_val, np.nan, data)
    reflectance = data / scale
    # Clip to physically valid range — values outside [0, 1] are sensor artefacts
    reflectance = np.clip(reflectance, 0.0, 1.0)
    return reflectance

red = load_reflectance_band("B04.tif")
nir = load_reflectance_band("B08.tif")

assert red.dtype == np.float32
assert np.nanmax(red) <= 1.0, "Red band contains values > 1.0 after scaling"

Step 3 — Mask Clouds and Sensor Artifacts

Atmospheric interference, cloud cover, and sensor saturation introduce high-variance noise that skews index distributions. Set masked pixels to NaN rather than zero — zero NDVI implies bare soil, so cloud-masked pixels will cluster with valid bare-soil samples if you use zero as a fill value.

import numpy as np
import rasterio

# Sentinel-2 Scene Classification Layer (SCL) values to exclude:
# 0=no data, 1=saturated/defective, 3=cloud shadows,
# 8=cloud medium prob, 9=cloud high prob, 10=thin cirrus, 11=snow
INVALID_SCL_CLASSES = {0, 1, 3, 8, 9, 10, 11}

def build_valid_mask(scl_path: str) -> np.ndarray:
    """
    Return a boolean array: True where pixels are valid for computation.
    """
    with rasterio.open(scl_path) as src:
        scl = src.read(1)
    invalid = np.zeros_like(scl, dtype=bool)
    for cls in INVALID_SCL_CLASSES:
        invalid |= (scl == cls)
    return ~invalid   # True = valid

def apply_mask(array: np.ndarray, valid_mask: np.ndarray) -> np.ndarray:
    """Set invalid pixels to NaN in-place."""
    out = array.copy()
    out[~valid_mask] = np.nan
    return out

valid = build_valid_mask("SCL.tif")
red_masked = apply_mask(red, valid)
nir_masked = apply_mask(nir, valid)

valid_fraction = valid.mean()
assert valid_fraction > 0.05, f"Too few valid pixels ({valid_fraction:.1%}) — check scene cloud cover."
print(f"Valid pixel fraction: {valid_fraction:.1%}")

Step 4 — Execute Vectorized Band Math

Compute index formulas using NumPy broadcasting. Avoid pixel-level Python loops; leverage contiguous memory arrays and CPU vectorization. The epsilon term prevents division-by-zero errors over optically uniform surfaces (deep water, dense shadow).

import numpy as np

EPSILON = 1e-10  # prevents division by zero on uniform surfaces

def ndvi(red: np.ndarray, nir: np.ndarray) -> np.ndarray:
    """
    Normalized Difference Vegetation Index.
    Output range: [-1, 1]. High values indicate dense green vegetation.
    """
    return (nir - red) / (nir + red + EPSILON)

def evi(blue: np.ndarray, red: np.ndarray, nir: np.ndarray) -> np.ndarray:
    """
    Enhanced Vegetation Index — less prone to atmospheric and soil background noise than NDVI.
    Standard coefficients: G=2.5, C1=6, C2=7.5, L=1.
    """
    return 2.5 * (nir - red) / (nir + 6 * red - 7.5 * blue + 1 + EPSILON)

def ndwi(green: np.ndarray, nir: np.ndarray) -> np.ndarray:
    """
    Normalized Difference Water Index (McFeeters 1996).
    Positive values indicate open water surfaces.
    """
    return (green - nir) / (green + nir + EPSILON)

# Compute all indices in a single pass
ndvi_arr  = ndvi(red_masked, nir_masked)
evi_arr   = evi(blue_masked, red_masked, nir_masked)
ndwi_arr  = ndwi(green_masked, nir_masked)

# Sanity checks on output ranges
assert np.nanmin(ndvi_arr) >= -1.0 and np.nanmax(ndvi_arr) <= 1.0, "NDVI out of range"
assert np.nanmin(evi_arr)  >= -1.0 and np.nanmax(evi_arr)  <= 1.0, "EVI out of range"

For a complete implementation guide covering EVI parameter tuning and batch processing across multiple scenes, see How to Calculate NDVI and EVI with Rasterio.

Step 5 — Chunked Processing for Large Scenes

Full Sentinel-2 tiles (10 m resolution, 100 km x 100 km) occupy ~3 GB per band uncompressed. Process scenes in fixed-size windows using rasterio.windows to stay within memory budgets. For distributed workloads, rioxarray with dask is the cleaner abstraction.

import rasterio
from rasterio.windows import Window
import numpy as np

CHUNK = 1024  # pixels per side; adjust to fit available RAM

def compute_ndvi_chunked(
    red_path: str,
    nir_path: str,
    output_path: str,
    chunk_size: int = CHUNK,
) -> None:
    """
    Compute NDVI tile-by-tile and write to a compressed GeoTIFF.
    No full-scene array is held in memory.
    """
    with rasterio.open(red_path) as red_src, \
         rasterio.open(nir_path) as nir_src:

        profile = red_src.profile.copy()
        profile.update(dtype="float32", count=1, compress="zstd", nodata=np.nan)

        with rasterio.open(output_path, "w", **profile) as dst:
            for row_off in range(0, red_src.height, chunk_size):
                for col_off in range(0, red_src.width, chunk_size):
                    win = Window(
                        col_off, row_off,
                        min(chunk_size, red_src.width  - col_off),
                        min(chunk_size, red_src.height - row_off),
                    )
                    red_chunk = red_src.read(1, window=win).astype(np.float32) / 10_000.0
                    nir_chunk = nir_src.read(1, window=win).astype(np.float32) / 10_000.0
                    ndvi_chunk = (nir_chunk - red_chunk) / (nir_chunk + red_chunk + EPSILON)
                    dst.write(ndvi_chunk, 1, window=win)

compute_ndvi_chunked("B04.tif", "B08.tif", "ndvi_output.tif")
print("Chunked NDVI complete.")

Step 6 — Scale for ML Readiness

Raw index values can still contain outliers from sensor noise or suboptimal masking. Clip to physically plausible bounds, then apply standardization. Serialise the scaler alongside model weights so that training-time and inference-time transformations are identical. This is consistent with the normalization approach described in Feature Scaling for Geospatial Inputs.

import numpy as np
import pickle
from sklearn.preprocessing import RobustScaler

def prepare_index_for_training(
    index_array: np.ndarray,
    clip_min: float = -1.0,
    clip_max: float = 1.0,
    scaler_path: str | None = None,
) -> tuple[np.ndarray, RobustScaler]:
    """
    Clip outliers, flatten valid pixels, and fit a RobustScaler.
    Returns the scaled flat array and the fitted scaler.
    """
    clipped = np.clip(index_array, clip_min, clip_max)
    flat = clipped[~np.isnan(clipped)].reshape(-1, 1)
    scaler = RobustScaler()
    scaled = scaler.fit_transform(flat)
    if scaler_path:
        with open(scaler_path, "wb") as f:
            pickle.dump(scaler, f)
    return scaled.ravel(), scaler

ndvi_scaled, scaler = prepare_index_for_training(
    ndvi_arr, scaler_path="ndvi_scaler.pkl"
)
assert abs(np.median(ndvi_scaled)) < 0.1, "Median should be near zero after RobustScaler"

Verification and Testing

After completing all steps, run a suite of numerical and visual checks before feeding indices into training:

import numpy as np
import rasterio

def verify_index_output(index_path: str, expected_min: float, expected_max: float) -> None:
    """Confirm output file has valid dtype, range, and non-trivial valid pixel count."""
    with rasterio.open(index_path) as src:
        data = src.read(1).astype(np.float32)
        profile = src.profile

    assert profile["dtype"] == "float32", "Output must be float32"
    valid = data[~np.isnan(data)]
    assert len(valid) > 0, "No valid (non-NaN) pixels in output"
    assert valid.min() >= expected_min, f"Values below {expected_min}"
    assert valid.max() <= expected_max, f"Values above {expected_max}"
    print(
        f"OK — {len(valid):,} valid pixels | "
        f"mean={valid.mean():.4f} | std={valid.std():.4f}"
    )

verify_index_output("ndvi_output.tif", expected_min=-1.0, expected_max=1.0)

A visual sanity check matters too: render a quicklook PNG from the NDVI GeoTIFF and confirm that water bodies are negative, dense forest is near +0.6 to +0.8, and bare soil or urban areas cluster near 0 to +0.2. Unexpected spatial patterns (e.g., index values following tile boundaries) indicate a grid alignment or scaling problem.

When combining raster indices with Spatial Lag and Neighborhood Statistics features downstream, also verify that both feature arrays cover the same spatial extent and share the same NaN mask.

Troubleshooting and Common Errors

NDVI contains values above 1.0 or below -1.0

Root cause: radiometric scaling was skipped. Raw Sentinel-2 DN values can exceed 10,000 in bright targets, producing ratios outside [-1, 1]. Fix: apply / 10_000.0 (or the appropriate sensor scale factor) before band math.

All valid pixels become NaN after masking

Root cause: the SCL band resolution does not match the reflectance band resolution. Sentinel-2 SCL is delivered at 20 m while bands B02/B03/B04/B08 are at 10 m. Resample the SCL to 10 m before masking: rasterio.warp.reproject(scl, dst_shape=reflectance.shape, ...).

`rasterio.errors.CRSError: Input is not a CRS`

Root cause: one of the input files has no embedded CRS (common with raw DN products or improperly exported files). Fix: assign the correct CRS using rasterio.open(path, 'r+') as dst: dst.crs = CRS.from_epsg(32632) — but first verify with rasterio info that the file genuinely lacks a projection rather than having a wrong one.

Memory overflow on large scenes

Root cause: loading the full band array into RAM before processing. Fix: use the chunked window loop in Step 5, or switch to rioxarray with chunks={"x": 1024, "y": 1024}.

Index output has stripe artifacts at tile boundaries

Root cause: inconsistent radiometric normalization across adjacent tiles (different acquisition dates, sun angles, or atmospheric correction versions). Fix: apply a radiometric normalization step (e.g., histogram matching or BRDF correction) before mosaicking, or compute indices per-tile before mosaicking rather than after.

EVI produces physically implausible spikes over water

Root cause: EVI is sensitive to negative reflectance values that can occur in atmospherically corrected blue bands over dark water surfaces. Fix: apply the water mask from NDWI before using EVI, or substitute NDVI for water-adjacent pixels. The Vector Proximity and Buffer Generation techniques can supply a reference water-body mask from vector data.

Performance Optimisation

Chunked I/O: Process scenes in 1024x1024 or 2048x2048 pixel windows. Larger chunks improve I/O efficiency but increase peak RAM; benchmark your hardware to find the optimal size.

Parallel tile processing: Use concurrent.futures.ProcessPoolExecutor to process multiple tiles simultaneously. Each tile is stateless and independent, making this embarrassingly parallel.

from concurrent.futures import ProcessPoolExecutor

tile_paths = [("B04_tile1.tif", "B08_tile1.tif", "ndvi_tile1.tif"),
              ("B04_tile2.tif", "B08_tile2.tif", "ndvi_tile2.tif")]

with ProcessPoolExecutor(max_workers=4) as pool:
    pool.map(lambda args: compute_ndvi_chunked(*args), tile_paths)

Dask for distributed computation: Open rasters with rioxarray.open_rasterio(path, chunks={"x": 1024, "y": 1024}) and apply NumPy arithmetic directly. Dask defers execution and can distribute across a local cluster or cloud workers.

Output compression: Write results with compress="zstd" and predictor=2 (floating-point predictor). ZSTD at level 1 achieves roughly 3x compression with negligible CPU overhead compared to uncompressed float32.

Lazy loading at inference time: Store outputs as Cloud Optimized GeoTIFFs (COGs) with overviews. Inference services can then read only the required spatial window without fetching the entire file — critical when your Temporal Aggregation for Time-Series Geodata pipeline needs to read hundreds of date slices per inference request.

FAQ

Why does my NDVI contain values outside [-1, 1]?

Values outside the valid range almost always indicate that radiometric scaling was skipped. Sentinel-2 raw DN values for bright surfaces can exceed 10,000, producing ratios well above 1.0 after band math. Apply / 10_000.0 (or your sensor’s specific scale factor) before computing any spectral index.

Should I use zero or NaN for masked (cloudy) pixels?

Always use NaN. Filling with zero creates a false spectral signal — zero NDVI corresponds to bare soil, so cloud-masked pixels will silently cluster with valid bare-soil samples during training, degrading model calibration on agricultural or vegetated targets.

How do I avoid tile-boundary artifacts in chunked processing?

For purely algebraic indices (NDVI, EVI, NDWI), each output pixel depends only on co-located input pixels, so there are no boundary artifacts from chunking. Boundary artifacts only arise when the computation has a spatial neighbourhood (e.g., focal statistics, Gaussian blur). If you layer band indices with neighbourhood aggregations from Spatial Lag and Neighborhood Statistics, add overlap at that stage, not during index computation.

Can I compute spectral indices directly on a Dask array?

Yes. Use rioxarray.open_rasterio(path, chunks={"x": 1024, "y": 1024}) to get a Dask-backed array, then apply standard NumPy arithmetic (nir - red) / (nir + red + 1e-10). Dask defers execution until .compute() is called and can distribute work across cores or a multi-node compute environment.

Part of: Spatial Feature Engineering for Machine Learning