How to Calculate NDVI and EVI with Rasterio

Calculate NDVI and EVI with Rasterio and NumPy: step-by-step tutorial for computing vegetation indices from multi-spectral satellite imagery for geospatial machine learning workflows.

To calculate NDVI and EVI with Rasterio, read the required spectral bands into NumPy arrays, apply the standard vegetation index formulas using vectorized arithmetic, mask invalid pixels, and write the results to new GeoTIFFs while preserving the original geospatial metadata. Rasterio’s windowed I/O combined with NumPy broadcasting enables memory-efficient processing suitable for batch inference, automated feature pipelines, and large-scale spatial ML workloads.

Mathematical Foundation & ML Context

Vegetation indices convert raw surface reflectance into bounded, physically interpretable features that mitigate sensor-specific radiometric drift and atmospheric interference. In machine learning workflows, these indices act as primary predictors for crop yield regression, canopy cover estimation, and ecological anomaly detection.

NDVI = (NIR - Red) / (NIR + Red)
EVI  = 2.5 * ((NIR - Red) / (NIR + 6*Red - 7.5*Blue + 1))

EVI corrects for atmospheric scattering and soil background noise by incorporating the blue band alongside gain/offset coefficients, making it significantly more robust in high-biomass regions where NDVI saturates near 0.8. When constructing training datasets for Spatial Feature Engineering for Machine Learning, index calculation must occur before tiling, augmentation, or vector masking to preserve spatial alignment and prevent data leakage. For deeper coverage of band combinations and normalization strategies, see the Raster Band Math and Index Calculation guide.

Production-Ready Implementation

The following function processes large rasters using native block windows, enforces float32 precision, applies safe division masking, and writes outputs with standardized LZW compression and explicit nodata handling.

import rasterio
import numpy as np
from rasterio.windows import Window

def calculate_vegetation_indices(
    input_path: str,
    ndvi_out: str,
    evi_out: str,
    band_indices: tuple = (1, 2, 3),  # (Blue, Red, NIR)
    scale_factor: float = 1.0,
    nodata: float = -9999.0
) -> None:
    """Calculate NDVI and EVI from a multi-band raster using windowed I/O."""
    blue_idx, red_idx, nir_idx = band_indices

    with rasterio.open(input_path) as src:
        if src.count < max(band_indices):
            raise ValueError(f"Input must contain at least {max(band_indices)} bands.")

        meta = src.meta.copy()
        meta.update(dtype="float32", count=1, compress="lzw", nodata=nodata)

        with rasterio.open(ndvi_out, "w", **meta) as dst_ndvi, \
             rasterio.open(evi_out, "w", **meta) as dst_evi:

            for ji, window in src.block_windows():
                # Read bands as float32 and apply radiometric scaling
                blue = src.read(blue_idx, window=window).astype("float32") * scale_factor
                red  = src.read(red_idx, window=window).astype("float32") * scale_factor
                nir  = src.read(nir_idx, window=window).astype("float32") * scale_factor

                # Mask invalid, zero, or negative reflectance values
                valid = (blue > 0) & (red > 0) & (nir > 0)

                ndvi = np.full_like(nir, nodata, dtype="float32")
                evi  = np.full_like(nir, nodata, dtype="float32")

                # Safe division using boolean indexing to avoid RuntimeWarnings
                ndvi[valid] = (nir[valid] - red[valid]) / (nir[valid] + red[valid])

                evi_denom = nir[valid] + 6.0 * red[valid] - 7.5 * blue[valid] + 1.0
                evi[valid] = 2.5 * (nir[valid] - red[valid]) / evi_denom

                dst_ndvi.write(ndvi, 1, window=window)
                dst_evi.write(evi, 1, window=window)

Key Implementation Considerations

  • Windowed I/O vs. Arbitrary Chunking: Iterating over src.block_windows() aligns reads/writes with the underlying GeoTIFF tile structure, minimizing disk I/O overhead. Arbitrary chunk_size slicing often forces partial tile decompression and increases latency. See Rasterio’s windowed read/write documentation for performance benchmarks.
  • Radiometric Scaling: Sentinel-2 and Landsat Level-2 products often store reflectance as scaled integers (e.g., DN * 0.0001). Apply scale_factor immediately after reading to preserve numerical precision during division.
  • Division Safety: Boolean masking (valid) prevents ZeroDivisionError and NaN propagation. Avoid np.where or np.divide with out parameters when working with large windows, as they allocate temporary arrays that can trigger OOM errors on constrained inference nodes.
  • Output Precision & Compression: float32 balances dynamic range with storage efficiency. LZW compression is lossless and CPU-light, making it ideal for intermediate feature layers. For archival, consider ZSTD or DEFLATE with predictor=2.

Pipeline Integration & MLOps

Vegetation indices should be computed as a deterministic preprocessing step before spatial joins, patch extraction, or model training. This guarantees that every pixel in the training tensor corresponds to the exact same ground location across temporal stacks.

When deploying to production:

  1. Validate Band Order: Never hardcode band positions. Parse metadata or use a configuration registry to map Blue, Red, and NIR to dataset-specific indices.
  2. Handle Edge Cases: Cloud masks, shadow layers, and water bodies frequently return near-zero reflectance. The valid mask above excludes them, but you may need to intersect with a separate quality mask for strict MLOps compliance.
  3. Automate Drift Detection: Store index statistics (mean, std, 95th percentile) per tile during calculation. Sudden shifts in EVI distributions often indicate sensor degradation, atmospheric anomalies, or land-cover change. For standardized index derivation and atmospheric correction references, consult the NASA LP DAAC MODIS Vegetation Indices guide.

By embedding index calculation directly into your ingestion pipeline, you eliminate redundant compute during training, ensure reproducible feature generation, and maintain strict spatial alignment across heterogeneous raster sources.