TL;DR: Use xarray.open_mfdataset with chunks={"time": 15} to load a daily satellite stack lazily, apply a bitwise cloud mask with .where(), then call .resample(time="1MS").mean() (plus .median(), .std(), and .quantile()) to produce calendar-aligned monthly statistics. Track valid observation counts and mask months that fall below your minimum threshold before writing to Zarr. Part of Temporal Aggregation for Time-Series Geodata.
Why This Fails in Geospatial ML Pipelines
Daily satellite archives contain a high fraction of unusable observations. For Sentinel-2 at mid-latitudes, cloud cover renders 40–70 % of acquisitions partially or fully invalid depending on the season. If you resample raw reflectance without first removing cloudy pixels, the monthly mean is dominated by artificially bright or dark cloud-contaminated values. The failure is silent: xarray produces a numerical result, the training pipeline proceeds, and the model learns a spurious cloud-cover signal as if it were a land-cover feature.
The second failure mode is structural: aggregating across a month boundary without calendar alignment. A naive 30-day rolling window shifts the aggregation window differently every month — February receives 28 days of data while July receives 30. Downstream models trained on rolling-window features and evaluated on calendar-month targets show systematic residual patterns that track month length rather than the underlying process.
A third, often overlooked failure is partial-coverage poisoning. Months at the start or end of a data archive may have only 1–2 valid observations. A mean of two pixels is not statistically equivalent to a mean of 20, yet both are labelled with the same monthly timestamp. Without an explicit valid_obs_count variable, a gradient-boosted model trained on these features will treat sparse and dense months as interchangeable, degrading accuracy on the sparse months at inference time. This directly affects the reliability of raster band math features such as NDVI and EVI, which are highly sensitive to cloud-contaminated reflectance.
Core Principles for Reliable Monthly Aggregation
- Mask before you reduce. Apply cloud and shadow flags bitwise against the QA band before any
.resample()call. Never apply the mask after aggregation — averaging first and masking second produces biased statistics because cloudy pixels contribute to the mean. - Use calendar-aligned windows. The
1MS(month start) frequency in xarray handles leap years, variable month lengths, and timezone edge cases deterministically. Avoid rolling windows for calendar features. - Track observation counts as a first-class variable. Compute
valid_obs = valid_mask.resample(time="1MS").sum()and store it alongside every statistical feature. Downstream quality gates, drift monitors, and imputation strategies all depend on knowing how many valid observations backed each estimate. - Compute multiple statistics in the same pass. Mean, median, percentiles, and standard deviation each tell a different story about phenological dynamics. Computing them together avoids re-reading the same chunked data multiple times.
- Set a minimum observation threshold and expose it. Hard-code nothing. Pass
min_valid_obsas a parameter so different downstream tasks (dense crop mapping vs. sparse change detection) can apply appropriate thresholds without re-running the aggregation. - Export to consolidated Zarr. Write the feature cube with
consolidated=Trueso distributed training workers can read metadata from a single JSON file rather than scanning thousands of chunk files.
Pipeline Architecture
The diagram below shows how daily files flow through masking, resampling, and quality gating before being written to a cloud-native feature store.
Production-Ready Code
The function below handles a full year of daily Sentinel-2 NetCDF or Zarr files. It validates inputs, applies a bitwise QA cloud mask, computes four monthly statistics plus observation counts, applies a quality threshold, and exports a consolidated Zarr feature cube.
import logging
from pathlib import Path
import numpy as np
import xarray as xr
logger = logging.getLogger(__name__)
def aggregate_daily_to_monthly(
daily_dir: Path,
output_path: Path,
cloud_mask_band: str = "QA60",
feature_bands: list[str] | None = None,
cloud_bit: int = 10,
min_valid_obs: int = 3,
spatial_chunks: int = 1024,
) -> xr.Dataset:
"""Aggregate daily satellite reflectance to monthly statistical features.
Args:
daily_dir: Directory containing daily NetCDF or Zarr files.
output_path: Destination path for the consolidated Zarr output.
cloud_mask_band: Name of the QA integer band (e.g. 'QA60' for Sentinel-2).
feature_bands: Reflectance bands to aggregate. Defaults to Sentinel-2 VNIR.
cloud_bit: Bit position of the cloud flag in cloud_mask_band.
min_valid_obs: Minimum valid observations required to keep a monthly pixel.
spatial_chunks: Chunk size along y and x dimensions.
Returns:
xr.Dataset with monthly statistics and valid_obs_count.
Raises:
FileNotFoundError: If daily_dir does not exist or contains no matching files.
ValueError: If cloud_mask_band or any feature_band is absent from the dataset.
"""
if feature_bands is None:
feature_bands = ["B02", "B03", "B04", "B08"]
# --- Input validation ---
if not daily_dir.exists():
raise FileNotFoundError(f"Input directory not found: {daily_dir}")
daily_files = sorted(daily_dir.glob("*.nc")) + sorted(daily_dir.glob("*.zarr"))
if not daily_files:
raise FileNotFoundError(f"No .nc or .zarr files found in {daily_dir}")
logger.info("Found %d daily files in %s", len(daily_files), daily_dir)
# --- Stage 1: Lazy load with explicit chunking ---
# Keep time chunks small (15 days) to fit resampling windows in worker RAM.
# Maximise spatial chunks to align with satellite tile boundaries.
ds = xr.open_mfdataset(
[str(f) for f in daily_files],
combine="by_coords",
chunks={"time": 15, "y": spatial_chunks, "x": spatial_chunks},
engine="netcdf4" if daily_files[0].suffix == ".nc" else "zarr",
parallel=True,
)
missing = [b for b in [cloud_mask_band] + feature_bands if b not in ds]
if missing:
raise ValueError(f"Bands not found in dataset: {missing}")
# --- Stage 2: Bitwise cloud masking ---
# Isolate the cloud flag bit, then invert to get a True=valid boolean mask.
# Cast to bool explicitly; the QA band is typically uint16 and .where() needs bool.
cloud_flag = (ds[cloud_mask_band] & (1 << cloud_bit)) != 0
valid_mask = (~cloud_flag).astype(bool)
# Apply mask: invalid (cloudy) pixels become NaN.
valid_data = ds[feature_bands].where(valid_mask)
# --- Stage 3: Calendar-aligned monthly resampling ---
# "1MS" = Month Start; handles leap years and variable month lengths.
# Compute statistics in separate passes so dask builds independent task graphs.
logger.info("Computing monthly statistics for bands: %s", feature_bands)
monthly_mean = valid_data.resample(time="1MS").mean(dim="time")
monthly_median = valid_data.resample(time="1MS").median(dim="time")
monthly_p90 = valid_data.resample(time="1MS").quantile(q=0.9, dim="time")
monthly_std = valid_data.resample(time="1MS").std(dim="time")
# --- Stage 4: Observation count quality gate ---
valid_obs = valid_mask.resample(time="1MS").sum(dim="time").rename("valid_obs_count")
quality_mask = valid_obs >= min_valid_obs
logger.info("Masking months with fewer than %d valid observations", min_valid_obs)
# Merge all statistics into a single dataset and apply the quality mask.
monthly_features = xr.merge([
monthly_mean.rename({b: f"{b}_mean" for b in feature_bands}),
monthly_median.rename({b: f"{b}_median" for b in feature_bands}),
monthly_p90.rename({b: f"{b}_p90" for b in feature_bands}),
monthly_std.rename({b: f"{b}_std" for b in feature_bands}),
valid_obs,
])
monthly_features = monthly_features.where(quality_mask)
# Attach provenance attributes so the feature store knows the origin.
monthly_features.attrs.update({
"source_dir": str(daily_dir),
"cloud_mask_band": cloud_mask_band,
"cloud_bit": cloud_bit,
"min_valid_obs": min_valid_obs,
"feature_bands": feature_bands,
})
# --- Stage 5: Export consolidated Zarr ---
# consolidated=True writes a single .zmetadata file — critical for
# distributed training workers that read metadata before opening chunks.
logger.info("Writing monthly feature cube to %s", output_path)
monthly_features.to_zarr(output_path, mode="w", consolidated=True)
return monthly_featuresStep-by-Step Walkthrough
Step 1 — Prepare the daily file directory
Organise daily NetCDF outputs with a consistent naming convention so open_mfdataset can sort them chronologically. The combine="by_coords" strategy reads the time coordinate from each file’s metadata, so the physical filename order does not need to be perfect as long as timestamps are encoded in the files.
from pathlib import Path
daily_dir = Path("/data/sentinel2/2023/daily")
output_path = Path("/data/sentinel2/2023/monthly_features.zarr")
# Sanity check: confirm daily files contain expected bands
import xarray as xr
probe = xr.open_dataset(str(sorted(daily_dir.glob("*.nc"))[0]))
print(probe.data_vars) # Confirm QA60, B02, B03, B04, B08 are present
probe.close()Step 2 — Run the aggregation
Call aggregate_daily_to_monthly with your site-specific parameters. The function is lazy until .to_zarr() triggers computation, so the call returns quickly with the task graph built.
from aggregate import aggregate_daily_to_monthly
ds_monthly = aggregate_daily_to_monthly(
daily_dir=daily_dir,
output_path=output_path,
cloud_mask_band="QA60",
feature_bands=["B02", "B03", "B04", "B08"],
cloud_bit=10,
min_valid_obs=3,
)Step 3 — Inspect observation counts
Before handing the feature cube to a model, verify the valid_obs_count distribution. Months with pervasive cloud cover in tropical regions can fall entirely below threshold and appear as full-NaN slices.
import numpy as np
obs = ds_monthly["valid_obs_count"]
print("Monthly valid obs stats:")
print(f" min: {float(obs.min()):.0f}")
print(f" median: {float(obs.median()):.0f}")
print(f" max: {float(obs.max()):.0f}")
# Flag months where more than 20 % of pixels are below threshold
nan_frac = np.isnan(ds_monthly["B04_mean"]).mean(dim=["y", "x"])
sparse_months = nan_frac.where(nan_frac > 0.2, drop=True)
print(f"\nMonths with >20 % NaN pixels: {len(sparse_months)}")
print(sparse_months["time"].values)Step 4 — Validate spatial consistency
Confirm the aggregated cube retains the same spatial extent and CRS as the daily inputs. Incorrect CRS alignment introduced at the open or export stage will silently corrupt downstream spatial joins.
original = xr.open_dataset(str(sorted(daily_dir.glob("*.nc"))[0]))
aggregated = xr.open_zarr(output_path)
assert original["x"].shape == aggregated["x"].shape, "x dimension mismatch"
assert original["y"].shape == aggregated["y"].shape, "y dimension mismatch"
assert float(original["x"].min()) == float(aggregated["x"].min()), "x origin shifted"
print("Spatial consistency: PASS")Step 5 — Register in the feature store
Once the Zarr cube passes validation, register it in your feature store with the provenance metadata. The valid_obs_count variable should be indexed separately so quality-aware queries can filter feature vectors without loading reflectance bands.
# Example: log provenance to MLflow or a custom feature registry
import mlflow
with mlflow.start_run(run_name="monthly_s2_features_2023"):
mlflow.log_param("min_valid_obs", 3)
mlflow.log_param("cloud_bit", 10)
mlflow.log_param("feature_bands", ["B02", "B03", "B04", "B08"])
mlflow.log_artifact(str(output_path / ".zmetadata"), artifact_path="zarr_meta")
mlflow.log_metric("total_months", int(aggregated.sizes["time"]))
mlflow.log_metric("sparse_month_count", len(sparse_months))Verification
After running the pipeline, confirm correctness with three checks:
import numpy as np
import xarray as xr
ds = xr.open_zarr("/data/sentinel2/2023/monthly_features.zarr", consolidated=True)
# 1. All time labels are month-start dates (day == 1)
assert all(t.item().day == 1 for t in ds["time"]), "Non-month-start timestamps found"
# 2. No pixel has a valid_obs_count below the threshold (only NaN or >= threshold)
obs = ds["valid_obs_count"].values
valid_obs = obs[~np.isnan(obs)]
assert (valid_obs >= 3).all(), "Pixels below min_valid_obs threshold survived masking"
# 3. Feature band values are within physically plausible reflectance range (0–1 for SR)
b04_mean = ds["B04_mean"].values
b04_valid = b04_mean[~np.isnan(b04_mean)]
assert b04_valid.min() >= 0.0, "Negative reflectance values found"
assert b04_valid.max() <= 1.0, "Reflectance values exceed 1.0 — check scale factor"
print("All verification checks passed.")If check 3 fails with values in the range 0–10 000, your daily files store reflectance as integer DN values scaled by 10 000. Divide by 10_000.0 during or after loading, before the cloud mask step, to normalise to [0, 1]. This is a common issue when mixing raster band math steps that assume normalised reflectance with raw Sentinel-2 Level-2A downloads.
FAQ
Why does xarray resample produce NaN months even with valid daily data?
The most common cause is a dtype mismatch between the cloud mask and the feature bands. If valid_mask is uint16 rather than bool, the .where() call interprets non-zero integers as True for some pixels and silently passes cloudy observations through, which corrupts the monthly mean. Always cast explicitly: valid_mask = (~cloud_flag).astype(bool). A secondary cause is incomplete time coordinate coverage: if daily files use non-UTC timestamps and open_mfdataset cannot align them, resample produces empty bins without raising an error.
Should I use 1MS (month start) or 1ME (month end) for resampling?
Use 1MS for feature engineering. Month-start labels place the aggregated value at the first day of the window, which aligns naturally with forecasting targets (predictions reference a start date). Month-end labels shift timestamps by up to 30 days and introduce subtle temporal leakage when monthly features are joined to event datasets indexed by month-start. The mismatch is especially damaging when features feed into spatial cross-validation folds that split on calendar months.
How many valid observations per month are sufficient?
Three is a widely used floor for single-band statistics (mean, median). For percentile features such as p90 or p10, use at least 5–8 observations because extreme quantiles are statistically unreliable with very small samples. Expose min_valid_obs as a configurable parameter rather than a hard-coded constant, and store it in the Zarr attributes and your feature store metadata so downstream users can apply tighter filters without re-running the aggregation.
Related
- Temporal Aggregation for Time-Series Geodata — parent topic covering weekly, seasonal, and multi-year aggregation strategies
- Raster Band Math and Index Calculation — computing NDVI, EVI, and custom indices before or after monthly aggregation
- How to Calculate NDVI and EVI with Rasterio — band-ratio preprocessing that pairs with monthly cloud-masked composites
- Feature Scaling for Geospatial Inputs — normalising monthly reflectance statistics before model training
- Spatial Cross-Validation Strategies — how to structure train/test splits across monthly feature cubes without temporal leakage
Part of: Temporal Aggregation for Time-Series Geodata · Spatial Feature Engineering for Machine Learning