How to Calculate NDVI and EVI with Rasterio

Calculate NDVI and EVI with Rasterio and NumPy: compute vegetation indices from multi-spectral satellite imagery for geospatial machine learning feature engineering.

To calculate NDVI and EVI with Rasterio, open the multi-band source, read the Blue, Red, and NIR bands as float32 arrays, apply the index formulas with boolean-masked safe division, and write the results to new GeoTIFFs with preserved geospatial metadata. Using src.block_windows() keeps memory usage flat even on scenes that are too large to load whole.

Part of: Raster Band Math and Index Calculation — see also the parent Spatial Feature Engineering for Machine Learning guide.


Why Skipping Proper Index Calculation Corrupts ML Pipelines

Vegetation index errors are silent. A mismatched band order or a missing radiometric scale factor does not raise an exception — it simply shifts every pixel value into the wrong numerical range. When those values feed a regression model trained on correctly scaled data, predictions look plausible but carry a systematic bias that only surfaces at validation time or, worse, in production.

Three failure modes occur repeatedly in real pipelines:

  • Band-order inversion: Swapping NIR and Red inverts NDVI sign, producing negative values for dense vegetation and positive values for bare soil. A gradient-boosted regressor will learn the inverted signal if the training and inference sets share the same mistake, making the error invisible during cross-validation but catastrophic when a corrected data source is introduced.
  • Missing radiometric scale factor: Sentinel-2 Level-2A stores surface reflectance as 16-bit integers scaled by 10 000. Skipping the * 0.0001 multiplication produces NDVI values clustered near zero because both numerator and denominator are dominated by large, nearly equal integers.
  • Computing after reprojection: Reprojection interpolates raw DN or reflectance values across pixel grids. NDVI ratios computed from interpolated bands mix spectral information from adjacent land-cover types, blurring class boundaries and inflating spatial autocorrelation. When you also rely on Temporal Aggregation for Time-Series Geodata to stack seasonal composites, index-then-reproject order is mandatory to preserve clean temporal signals.

Principles for Correct Vegetation Index Engineering

  • Radiometric scaling first, everything else second. Apply the sensor-specific scale factor immediately after src.read(), before any arithmetic.
  • Validate band order from metadata, not assumptions. Parse the file’s band descriptions or maintain a configuration registry; never rely on positional convention.
  • Use native block windows. Aligning reads and writes to the GeoTIFF tile structure eliminates partial-tile decompression overhead. Arbitrary chunk slicing forces extra I/O.
  • Mask at the source, not the destination. Build a valid boolean array from raw reflectance conditions before any division. Do not use np.divide(out=...) with a fill — it still allocates the full denominator array and suppresses RuntimeWarning without actually preventing NaN propagation in edge cases.
  • Preserve nodata semantics end-to-end. Set nodata in the output meta and write the same sentinel value for invalid pixels so downstream tools — including Feature Scaling for Geospatial Inputs normalizers — can honour the mask.
  • Compute indices before any spatial join or patch extraction. This guarantees every sample in the training tensor corresponds to the exact same ground location across temporal and sensor stacks.

Index Formulas

NDVI = (NIR - Red) / (NIR + Red)
EVI  = 2.5 × (NIR - Red) / (NIR + 6×Red − 7.5×Blue + 1)

EVI incorporates the Blue band to correct for atmospheric scattering and soil background reflectance. In high-biomass regions NDVI saturates near 0.8 and loses discriminative power; EVI continues to respond linearly, making it a stronger predictor for dense-canopy classification tasks. As part of Raster Band Math and Index Calculation, both indices serve as primary inputs to downstream models and as targets for radiometric quality checks.


NDVI and EVI Computation Pipeline A flow diagram showing five stages: read spectral bands, apply radiometric scale, build valid-pixel mask, compute NDVI and EVI in parallel, and write compressed GeoTIFFs. Read Bands Blue / Red / NIR block_windows() Scale × 0.0001 → float32 Valid Mask blue > 0 red > 0, nir > 0 NDVI (NIR−Red)/ (NIR+Red) EVI 2.5×(NIR−Red)/ (denom) Write GeoTIFF LZW + nodata

Production-Ready Code

The function below uses native block-window iteration, enforces radiometric scaling, applies boolean-masked safe division, and writes both outputs simultaneously to avoid redundant band reads.

import logging
from typing import Tuple

import numpy as np
import rasterio

logger = logging.getLogger(__name__)


def calculate_vegetation_indices(
    input_path: str,
    ndvi_out: str,
    evi_out: str,
    band_indices: Tuple[int, int, int] = (1, 2, 3),  # (Blue, Red, NIR)
    scale_factor: float = 1.0,
    nodata: float = -9999.0,
) -> None:
    """Write NDVI and EVI GeoTIFFs from a multi-band surface reflectance raster.

    Args:
        input_path:   Path to the source multi-band GeoTIFF (e.g. Sentinel-2 L2A).
        ndvi_out:     Destination path for the NDVI single-band output.
        evi_out:      Destination path for the EVI single-band output.
        band_indices: 1-based (Blue, Red, NIR) band positions; parse from metadata
                      rather than assuming a fixed order.
        scale_factor: Radiometric scale applied to raw DN values (0.0001 for S2 L2A).
        nodata:       Sentinel value written for invalid and masked pixels.
    """
    blue_idx, red_idx, nir_idx = band_indices

    with rasterio.open(input_path) as src:
        required_bands = max(band_indices)
        if src.count < required_bands:
            raise ValueError(
                f"Source has {src.count} band(s); band index {required_bands} requested."
            )

        meta = src.meta.copy()
        meta.update(dtype="float32", count=1, compress="lzw", nodata=nodata)
        logger.info(
            "Processing %s: %d×%d pixels, %d band(s)",
            input_path, src.width, src.height, src.count,
        )

        with rasterio.open(ndvi_out, "w", **meta) as dst_ndvi, \
             rasterio.open(evi_out,  "w", **meta) as dst_evi:

            for _ji, window in src.block_windows():
                # Read and scale bands in one pass per window
                blue = src.read(blue_idx, window=window).astype("float32") * scale_factor
                red  = src.read(red_idx,  window=window).astype("float32") * scale_factor
                nir  = src.read(nir_idx,  window=window).astype("float32") * scale_factor

                # Strict positivity check excludes clouds, shadows, and water
                valid = (blue > 0) & (red > 0) & (nir > 0)

                ndvi = np.full_like(nir, nodata, dtype="float32")
                evi  = np.full_like(nir, nodata, dtype="float32")

                # NDVI: safe division — denominator is always > 0 inside `valid`
                ndvi[valid] = (
                    (nir[valid] - red[valid])
                    / (nir[valid] + red[valid])
                )

                # EVI: atmospheric and soil-background correction via Blue band
                evi_denom = nir[valid] + 6.0 * red[valid] - 7.5 * blue[valid] + 1.0
                evi[valid] = 2.5 * (nir[valid] - red[valid]) / evi_denom

                dst_ndvi.write(ndvi, 1, window=window)
                dst_evi.write(evi,  1, window=window)

        logger.info("Written: %s  %s", ndvi_out, evi_out)

Step-by-Step Walkthrough

1. Identify band positions from file metadata

Never hardcode band order. Read the file’s descriptions or maintain a sensor registry:

with rasterio.open("sentinel2_l2a.tif") as src:
    print(src.descriptions)
    # ('B02 - Blue', 'B04 - Red', 'B08 - NIR', ...)

# Map description strings to 1-based indices
desc = [d.lower() if d else "" for d in src.descriptions]
blue_idx = next(i + 1 for i, d in enumerate(desc) if "blue" in d or "b02" in d)
red_idx  = next(i + 1 for i, d in enumerate(desc) if "red"  in d or "b04" in d)
nir_idx  = next(i + 1 for i, d in enumerate(desc) if "nir"  in d or "b08" in d)

2. Apply radiometric scaling

Sentinel-2 Level-2A encodes surface reflectance as uint16 with a scale of 10 000 and an offset of -1000 (since Collection 1). Landsat Collection 2 uses a scale of 0.0000275 and an offset of -0.2. Pass the correct scale_factor to avoid the near-zero clustering bug described in the failure-modes section above.

# Sentinel-2 Level-2A (SAFE / COG format, Collection 0/1 compatibility)
scale_factor = 0.0001

calculate_vegetation_indices(
    "sentinel2_l2a.tif",
    ndvi_out="ndvi.tif",
    evi_out="evi.tif",
    band_indices=(blue_idx, red_idx, nir_idx),
    scale_factor=scale_factor,
)

3. Intersect the valid mask with an external quality layer

Sentinel-2 Scene Classification Layer (SCL) and Landsat QA_PIXEL band flag clouds and cloud shadows with near-zero reflectance that can slip through a simple positivity check. Widen the mask before passing it to the formula:

import numpy as np
import rasterio

def get_quality_mask(scl_path: str, window) -> np.ndarray:
    """Return True where SCL class is vegetated, non-vegetated, or bare soil (4,5,6)."""
    with rasterio.open(scl_path) as scl:
        scl_band = scl.read(1, window=window)
    return np.isin(scl_band, [4, 5, 6])

Pass this array via an optional external_mask parameter and AND it with valid inside the window loop for strict cloud-free outputs.

4. Verify numerical range post-computation

NDVI must fall within [-1, 1] and EVI is typically bounded to [-1, 2] for surface vegetation:

with rasterio.open("ndvi.tif") as src:
    data = src.read(1, masked=True)

assert data.min() >= -1.0, f"NDVI underflow: {data.min():.4f}"
assert data.max() <=  1.0, f"NDVI overflow: {data.max():.4f}"
print(f"NDVI range: {data.min():.3f}{data.max():.3f}  mean: {data.mean():.3f}")

5. Register outputs in your feature pipeline

Once the index rasters are validated, register them as deterministic feature layers before Temporal Aggregation for Time-Series Geodata stacks them into monthly composites, or before patch extraction feeds them to a CNN. This sequencing prevents recomputation and guarantees reproducible feature hashes across training runs.


Verification

Run this block after calculate_vegetation_indices to confirm spatial integrity, nodata handling, and value range in one pass:

import rasterio
import numpy as np

def verify_index_output(path: str, index_name: str = "index") -> None:
    with rasterio.open(path) as src:
        data = src.read(1, masked=True)
        nodata_count = data.mask.sum()
        valid_count  = (~data.mask).sum()

        print(f"--- {index_name} verification ---")
        print(f"  CRS:       {src.crs}")
        print(f"  Transform: {src.transform}")
        print(f"  Shape:     {src.height} × {src.width}")
        print(f"  Nodata px: {nodata_count:,}  Valid px: {valid_count:,}")
        print(f"  Range:     [{data.min():.4f}, {data.max():.4f}]")
        print(f"  Mean±SD:   {data.mean():.4f} ± {data.std():.4f}")

        # Hard assertions
        assert src.crs is not None, "CRS missing from output"
        assert data.min() >= -1.5, f"Unexpected underflow: {data.min()}"
        assert data.max() <=  2.5, f"Unexpected overflow: {data.max()}"
        print("  PASS")

verify_index_output("ndvi.tif", "NDVI")
verify_index_output("evi.tif",  "EVI")

Expected output for a healthy Sentinel-2 summer scene over mixed farmland:

--- NDVI verification ---
  CRS:       EPSG:32632
  Transform: | 10.00, 0.00, 300000.00 |  ...
  Shape:     10980 × 10980
  Nodata px: 124,832  Valid px: 19,875,168
  Range:     [-0.1823, 0.8791]
  Mean±SD:   0.4312 ± 0.1987
  PASS

FAQ

Why does my NDVI output contain NaN values even after masking invalid pixels?

NaN values usually appear when the denominator (NIR + Red) evaluates to exactly zero after radiometric scaling. Scale Sentinel-2 integers by 0.0001 before the validity check; then ensure your valid mask requires both nir > 0 and red > 0 so the denominator is always strictly positive before division. Avoid np.divide with where=valid — it still evaluates the division for all pixels, producing NaN in masked positions that can propagate if the mask is later dropped.

Should I compute NDVI before or after spatial reprojection?

Compute vegetation indices before reprojecting. Reprojection resamples raw pixel values through bilinear or cubic interpolation, which mixes adjacent-pixel spectral values and corrupts the tight ratios that NDVI and EVI depend on. Mismatched CRS between your index rasters and your training geometries is better resolved with CRS Alignment and Projection Handling applied to the single-band index outputs, not to the multi-band source.

When should I prefer EVI over NDVI as a machine learning feature?

Prefer EVI in training datasets covering high-biomass areas — dense forests, tropical crops, or multi-layer canopies — where NDVI saturates near 0.8 and loses discriminative power. EVI’s atmospheric correction and soil-background coefficient also make index values more consistent across sensor generations, reducing covariate shift when combining Landsat and Sentinel-2 scenes in the same training set.



Part of: Raster Band Math and Index Calculation · Spatial Feature Engineering for Machine Learning