Computing Local Moran’s I for feature engineering transforms raw geospatial observations into spatially contextualized predictors by quantifying local spatial autocorrelation. Unlike global metrics that summarize an entire dataset, Local Indicators of Spatial Association (LISA) evaluate how each observation relates to its immediate neighbors. This produces statistically significant hotspot (HH), coldspot (LL), and spatial outlier (HL/LH) classifications that serve as high-signal inputs for tree-based models, neural networks, and spatial regression pipelines. When integrated into a broader Spatial Feature Engineering for Machine Learning workflow, these metrics replace heuristic distance counts with mathematically rigorous neighborhood context, improving model generalization and enabling automated spatial drift detection in production.
Core Mechanics & Statistical Output
Local Moran’s I measures the covariance between a target value and the spatially lagged average of its neighbors. The estimator returns four critical outputs per observation:
- Local I statistic (
Is): Direction and magnitude of local autocorrelation. Positive values indicate clustering (similar values nearby); negative values indicate dispersion (dissimilar values nearby). - Pseudo p-value (
p_sim): Significance derived from conditional permutation tests. Standard practice uses 999 permutations to approximate the null distribution of spatial randomness. - Z-score (
z_sim): Standardized deviation from the expected mean under the null hypothesis. Useful for continuous scaling in neural networks. - Cluster classification (
q): Categorical label mapping to spatial regimes:1=HH(high surrounded by high),2=LH(low surrounded by high),3=LL(low surrounded by low),4=HL(high surrounded by low).
These outputs directly feed into Spatial Lag and Neighborhood Statistics pipelines, where they act as engineered features that capture localized spatial dependence without requiring manual spatial joins or window aggregations.
Production-Ready Implementation
The following pipeline computes Local Moran’s I using libpysal and esda, returning a clean DataFrame aligned to the original index. It handles missing values, row-standardizes weights for stable ML scaling, and maps cluster codes to human-readable labels.
import geopandas as gpd
import libpysal
import esda
import numpy as np
import pandas as pd
def compute_local_moran_features(
gdf: gpd.GeoDataFrame,
value_col: str,
w_type: str = "queen",
k_neighbors: int = 4,
permutations: int = 999,
random_state: int = 42
) -> pd.DataFrame:
"""
Computes Local Moran's I features for ML pipelines.
Returns DataFrame with I, p-value, z-score, and cluster type.
"""
# 1. Build spatial weights
if w_type == "queen":
w = libpysal.weights.Queen.from_dataframe(gdf)
elif w_type == "knn":
w = libpysal.weights.KNN.from_dataframe(gdf, k=k_neighbors)
else:
raise ValueError("Unsupported weights type. Use 'queen' or 'knn'.")
# Row-standardize for stable ML scaling
w.transform = "r"
# 2. Extract values and handle missing data
y = gdf[value_col].values.astype(float)
mask = ~np.isnan(y)
y_clean = y[mask]
# Subset weights matrix to valid observations
w_sub = w[mask, mask]
# 3. Compute Local Moran's I (ref: https://pysal.org/esda/stable/source/generated/esda.Moran_Local.html)
lm = esda.Moran_Local(
y_clean, w_sub, permutations=permutations,
n_jobs=-1, random_state=random_state
)
# 4. Map cluster codes to labels (1=HH, 2=LH, 3=LL, 4=HL)
cluster_map = {1: "HH", 2: "LH", 3: "LL", 4: "HL", 0: "NS"}
clusters = pd.Series(lm.q).map(cluster_map).fillna("NS")
# 5. Reconstruct full DataFrame aligned to original index
results = pd.DataFrame({
"local_moran_i": np.nan,
"p_sim": np.nan,
"z_sim": np.nan,
"cluster_type": "NS"
}, index=gdf.index)
valid_idx = gdf.index[mask]
results.loc[valid_idx, "local_moran_i"] = lm.Is
results.loc[valid_idx, "p_sim"] = lm.p_sim
results.loc[valid_idx, "z_sim"] = lm.z_sim
results.loc[valid_idx, "cluster_type"] = clusters.values
return resultsStatistical Validation & Feature Encoding
Raw LISA outputs require careful preprocessing before ingestion into predictive models:
- Significance filtering: Mask
local_moran_iandz_simwherep_sim > 0.05(or your domain threshold). Non-significant observations introduce noise that degrades gradient-based optimization. - Target encoding for clusters: Convert
cluster_typeto numeric representations. For tree ensembles, one-hot encoding preserves categorical boundaries. For linear or neural architectures, use target encoding or ordinal mapping (HH=1, LH=2, LL=3, HL=4, NS=0) with explicit handling of the non-significant class. - Multicollinearity checks: Local Moran’s I correlates strongly with spatial lag and distance-weighted averages. Compute VIF scores post-merge; drop or regularize redundant features to prevent coefficient instability.
- Distribution alignment: Apply
PowerTransformerorQuantileTransformertolocal_moran_iandz_simbefore feeding them into deep learning pipelines. Spatial statistics often exhibit heavy tails that disrupt learning rate schedules.
MLOps Integration & Inference Architecture
Deploying spatial autocorrelation features requires deterministic neighbor resolution across training and inference:
- Weight matrix serialization: Cache the spatial weights object (
w) usingjobliborpickle. Recomputing contiguity or KNN graphs at scale introduces unacceptable latency. - Inference neighbor resolution: For new observations, map them to existing neighbors using a spatial index (
scipy.spatial.cKDTreeorgeopandas.sjoin_nearest). Apply the cached weights structure to maintain consistent lag calculations. - Drift monitoring: Track the proportion of significant clusters (
HH/LL) and the medianp_simacross inference batches. A systemic shift toward non-significant (NS) classifications often signals spatial regime changes, sensor degradation, or sampling bias. - Pipeline orchestration: Wrap the LISA computation in a scikit-learn-compatible transformer. Implement
fit()to build and serialize weights, andtransform()to apply the precomputed matrix to incoming data. This guarantees reproducibility across CI/CD deployments.
Scaling, Optimization & Pitfalls
- Weights selection: Queen contiguity excels for areal data (census tracts, parcels). KNN adapts to irregular point distributions. Avoid fixed distance-band weights unless your domain enforces a strict search radius.
- Boundary effects: Edge observations have fewer neighbors, which can artificially deflate local statistics. Apply toroidal correction, buffer boundaries, or explicitly model edge effects in spatial regression.
- Permutation trade-offs: 999 permutations balance statistical rigor and compute time. For real-time inference, precompute significance thresholds or switch to analytical approximations where the spatial process meets stationarity assumptions.
- Memory constraints: Large GeoDataFrames trigger OOM errors during weight matrix construction. Chunk spatially contiguous blocks, compute LISA independently, and concatenate results. Alternatively, use sparse matrix representations (
libpysal.weights.W.sparse) to reduce footprint. - Version compatibility: Recent
libpysalreleases streamline geometry handling. If you encounter deprecation warnings, passgdf.geometryexplicitly or upgrade to the latest PySAL release. Consult the libpysal spatial weights documentation for migration guidance.
Conclusion
Computing Local Moran’s I for feature engineering bridges spatial statistics and machine learning by converting neighborhood relationships into structured, model-ready signals. When deployed alongside robust MLOps practices, these features improve predictive accuracy, enable automated spatial drift detection, and eliminate fragile manual joins. Validate significance thresholds, serialize neighbor structures, and monitor cluster distributions in production to maintain model reliability across heterogeneous landscapes.