Computing Local Moran's I for Feature Engineering

Q: Can I use Local Moran's I with point data rather than polygons?

Yes. Use KNN weights (`libpysal.weights.KNN.from_dataframe(gdf, k=5)`) instead of queen contiguity. KNN weights are topology-independent and work directly with point GeoDataFrames, making them the correct choice whenever you have sensor networks, GPS traces, or geolocated records rather than areal units.

Q: How do I handle non-significant LISA values before feeding them into a model?

Mask `local_moran_i` and `z_sim` to NaN where `p_sim > 0.05`, then impute with zero or the column mean depending on your model's sensitivity to zeros. Non-significant observations provide no autocorrelation signal and will degrade gradient-based optimisers if included raw.

Use esda.Moran_Local from PySAL on a row-standardized weights matrix built with libpysal. The function returns four per-observation outputs — Is, p_sim, z_sim, and q — that you append directly to your feature matrix after masking non-significant rows. This technique lives inside Spatial Lag and Neighborhood Statistics, which covers the full range of neighborhood-encoding strategies for production ML pipelines.

Why Local Autocorrelation Features Fail Silently in ML Pipelines

Global spatial statistics summarize an entire dataset with a single number. That number can appear healthy while masking severe local pockets of spatial structure — hotspots, coldspots, and spatial outliers — that carry strong predictive signal in real-world geospatial data. Feeding raw attribute values into a gradient boosting model without encoding this local structure leaves the model blind to neighborhood context, forcing it to re-learn spatial patterns from coordinate proxies that generalize poorly across regions.

The failure mode is subtle. A property value model trained on parcel-level attributes may achieve reasonable holdout RMSE on uniformly sampled splits while consistently over-predicting in gentrifying corridors (high surrounded by low, HL) and under-predicting in established high-value districts (high surrounded by high, HH). Without spatial cross-validation strategies and explicit LISA features, this spatially structured residual never surfaces in standard evaluation metrics.

There is also a data engineering trap: libpysal weights objects do not support boolean array indexing. If your GeoDataFrame contains NaN values in the target column and you attempt to build a weights matrix on the full frame, then subset the LISA results with a boolean mask, the index alignment breaks silently. Rows shift, producing features that are misaligned with labels — a catastrophic and invisible label leak.

Core LISA Principles for Feature Engineering

Build weights on clean subsets only. Filter to non-NaN rows, build the weights matrix, compute LISA, then re-align to the original index using .loc. Never filter after building weights.
Row-standardize immediately. Set w.transform = "r" before calling Moran_Local. This converts raw neighbor counts to proportional weights so lag values are interpretable local averages regardless of neighborhood size.
Mask by significance before modelling. Non-significant observations (p_sim > 0.05) carry no autocorrelation signal. Including them as raw LISA values injects noise. Mask to NaN or zero post-computation.
Serialize the weights matrix, not just the model. The neighbor graph is a function of training data topology. Rebuilding it at inference produces different neighbor assignments and breaks feature comparability.
Check for islands before computing. Isolated geometries with zero neighbors produce degenerate lag values. Identify them with w.islands and either impute neighbors or flag those rows as uncomputable.
Account for edge effects. Boundary observations have fewer neighbors than interior ones, which deflates their local statistics. Apply toroidal correction or buffer your study area and clip results after computation.

LISA Cluster Taxonomy

The diagram below shows the four spatial regimes that Moran_Local assigns via the q attribute, mapped against the Moran scatter plot axes. Understanding the quadrant geometry clarifies what every quadrant code means in feature space.

Production-Ready Code

The function below computes all four LISA outputs, handles NaN values correctly (filtering before building weights, re-aligning after), and returns a DataFrame that can be merged directly onto the original GeoDataFrame’s index.

import geopandas as gpd
import libpysal
import esda
import numpy as np
import pandas as pd
import logging

logger = logging.getLogger(__name__)


def compute_local_moran_features(
    gdf: gpd.GeoDataFrame,
    value_col: str,
    w_type: str = "queen",
    k_neighbors: int = 4,
    permutations: int = 999,
    significance_threshold: float = 0.05,
    random_state: int = 42,
) -> pd.DataFrame:
    """
    Compute Local Moran's I features aligned to the original GeoDataFrame index.

    Parameters
    ----------
    gdf : GeoDataFrame with valid geometries and a CRS set.
    value_col : Column to analyse for spatial autocorrelation.
    w_type : 'queen' for polygon contiguity; 'knn' for point or irregular data.
    k_neighbors : Neighbour count when w_type='knn'.
    permutations : Conditional randomisation count for pseudo p-values.
    significance_threshold : Rows with p_sim above this are masked to NaN/NS.
    random_state : Seed for reproducible permutation tests.

    Returns
    -------
    DataFrame indexed like gdf with columns:
        local_moran_i  — raw I statistic (NaN where p_sim > threshold or input NaN)
        p_sim          — pseudo p-value from conditional permutations
        z_sim          — standardised deviation under spatial randomness null
        cluster_type   — 'HH', 'LH', 'LL', 'HL', or 'NS' (not significant)
    """
    # Initialise output aligned to the full index so callers can merge cleanly
    results = pd.DataFrame(
        {"local_moran_i": np.nan, "p_sim": np.nan, "z_sim": np.nan, "cluster_type": "NS"},
        index=gdf.index,
    )

    # --- Step 1: filter to valid rows BEFORE building weights ---
    # libpysal W objects index positionally; boolean masking after construction
    # silently misaligns rows. Always rebuild weights on the clean subset.
    y = gdf[value_col].to_numpy(dtype=float)
    valid_mask = ~np.isnan(y)
    n_valid = valid_mask.sum()

    if n_valid < 4:
        logger.warning("Too few valid observations (%d) for LISA; returning empty.", n_valid)
        return results

    gdf_clean = gdf.loc[valid_mask].copy()
    y_clean = y[valid_mask]
    logger.info("Computing LISA on %d / %d observations.", n_valid, len(gdf))

    # --- Step 2: build spatial weights on the clean subset ---
    if w_type == "queen":
        w = libpysal.weights.Queen.from_dataframe(gdf_clean)
    elif w_type == "knn":
        w = libpysal.weights.KNN.from_dataframe(gdf_clean, k=k_neighbors)
    else:
        raise ValueError(f"Unsupported w_type '{w_type}'. Choose 'queen' or 'knn'.")

    # Row-standardize: each row sums to 1.0 → lag values are local averages
    w.transform = "r"

    if w.islands:
        logger.warning("%d island(s) detected — their LISA will be unreliable.", len(w.islands))

    # --- Step 3: compute Local Moran's I ---
    lm = esda.Moran_Local(
        y_clean, w,
        permutations=permutations,
        n_jobs=-1,
        random_state=random_state,
    )

    # --- Step 4: map quadrant codes to readable labels ---
    # q codes: 1=HH, 2=LH, 3=LL, 4=HL; 0 means non-significant
    code_to_label = {1: "HH", 2: "LH", 3: "LL", 4: "HL", 0: "NS"}
    raw_clusters = pd.Series(lm.q, index=gdf_clean.index).map(code_to_label).fillna("NS")

    # --- Step 5: apply significance mask ---
    insignificant = lm.p_sim > significance_threshold
    masked_i = np.where(insignificant, np.nan, lm.Is)
    masked_z = np.where(insignificant, np.nan, lm.z_sim)
    masked_clusters = raw_clusters.copy()
    masked_clusters.loc[gdf_clean.index[insignificant]] = "NS"

    # --- Step 6: write back to the full-index result frame ---
    results.loc[gdf_clean.index, "local_moran_i"] = masked_i
    results.loc[gdf_clean.index, "p_sim"] = lm.p_sim
    results.loc[gdf_clean.index, "z_sim"] = masked_z
    results.loc[gdf_clean.index, "cluster_type"] = masked_clusters.values

    return results

Step-by-Step Walkthrough

The walkthrough uses a synthetic parcel GeoDataFrame with a property_value column and queen contiguity weights.

Step 1: Prepare the GeoDataFrame

Your input must have a projected CRS and valid polygon geometries. CRS mismatches before weight construction are a leading cause of incorrect neighborhood assignments; apply CRS alignment and projection handling before computing any spatial statistics.

import geopandas as gpd
from shapely.validation import make_valid

gdf = gpd.read_file("parcels.gpkg")
# Validate geometries before any spatial weight construction
gdf["geometry"] = gdf["geometry"].apply(lambda g: make_valid(g) if g is not None else None)
gdf = gdf.dropna(subset=["geometry"])
gdf = gdf.to_crs("EPSG:32633")  # Projected CRS, units in metres

Step 2: Run the LISA computation

lisa_features = compute_local_moran_features(
    gdf,
    value_col="property_value",
    w_type="queen",
    permutations=999,
    significance_threshold=0.05,
    random_state=42,
)

The returned DataFrame shares gdf’s index. Rows with NaN property_value inputs carry NaN in the LISA columns and "NS" in cluster_type.

Step 3: Merge features and encode cluster labels

gdf = gdf.join(lisa_features)

# One-hot encode cluster_type for tree ensembles
cluster_dummies = pd.get_dummies(gdf["cluster_type"], prefix="cluster")
gdf = pd.concat([gdf, cluster_dummies], axis=1)

For linear models or neural architectures, consider target encoding cluster_type against the response variable rather than one-hot encoding, to avoid introducing five collinear binary columns.

Step 4: Check multicollinearity before training

LISA features correlate with the spatial lag computed in the parent Spatial Lag and Neighborhood Statistics pipeline. Compute variance inflation factors before combining them:

from statsmodels.stats.outliers_influence import variance_inflation_factor

feature_cols = ["property_value", "lag_property_value", "local_moran_i", "z_sim"]
X_check = gdf[feature_cols].dropna().values
vif_scores = [variance_inflation_factor(X_check, i) for i in range(X_check.shape[1])]
for col, vif in zip(feature_cols, vif_scores):
    print(f"{col}: VIF={vif:.2f}")

Drop or regularize columns with VIF above 10. z_sim and local_moran_i often exhibit high collinearity; prefer z_sim for continuous architectures (better-scaled) and drop local_moran_i to reduce redundancy.

Step 5: Scale before training

Spatial statistics frequently exhibit heavy tails. Apply a PowerTransformer to the numeric LISA columns before training, especially for deep learning pipelines where learning rate schedules are sensitive to outlier gradients. Coordinate this step with your broader feature scaling for geospatial inputs strategy to ensure consistent treatment across all numeric feature groups.

from sklearn.preprocessing import PowerTransformer
import numpy as np

pt = PowerTransformer(method="yeo-johnson")
lisa_numeric = ["local_moran_i", "z_sim"]
valid_rows = gdf[lisa_numeric].notna().all(axis=1)
gdf.loc[valid_rows, lisa_numeric] = pt.fit_transform(gdf.loc[valid_rows, lisa_numeric])

Verification

After merging LISA features, run these three checks before proceeding to model training:

import numpy as np

# 1. Index alignment: no rows should be missing from the merge
assert len(gdf) == len(lisa_features), "Row count mismatch after merge"

# 2. Significance consistency: where p_sim > 0.05, local_moran_i must be NaN
insig = gdf["p_sim"] > 0.05
assert gdf.loc[insig, "local_moran_i"].isna().all(), \
    "Non-significant rows contain non-NaN Moran values"

# 3. Cluster distribution: HH+LL should dominate in spatially autocorrelated data
dist = gdf["cluster_type"].value_counts(normalize=True)
print(dist)
# Expect HH and LL to account for the majority of significant observations
# A near-uniform distribution across all four types suggests weak autocorrelation
# or an incorrect weights specification

# 4. Island check: islands produce unreliable LISA values
import libpysal
w_check = libpysal.weights.Queen.from_dataframe(gdf.dropna(subset=["property_value"]))
if w_check.islands:
    print(f"Warning: {len(w_check.islands)} island(s) detected")

For a visual sanity check, plot local_moran_i against the spatial lag of property_value — the relationship should be approximately linear with slope equal to the global Moran’s I. Large deviations indicate either data quality issues or a poorly specified weights matrix.

FAQ

Can I use Local Moran’s I with point data rather than polygons?

Yes. Use KNN weights (libpysal.weights.KNN.from_dataframe(gdf, k=5)) instead of queen contiguity. KNN weights are topology-independent and work directly with point GeoDataFrames, making them the correct choice whenever you have sensor networks, GPS traces, or geolocated records rather than areal units.

How do I handle non-significant LISA values before feeding them into a model?

Mask local_moran_i and z_sim to NaN where p_sim > 0.05, then impute with zero or the column mean depending on your model’s sensitivity to zeros. Non-significant observations provide no autocorrelation signal and will degrade gradient-based optimisers if included raw.

Why does recomputing weights at inference time break reproducibility?

Spatial weights objects encode the exact neighbor graph from training data. Recomputing them at inference on a different observation set changes neighbor assignments, producing lag values that are not comparable to training features. Serialize the weights matrix with joblib after fitting and load it deterministically at inference. Map new observations to existing neighbors via a spatial index rather than rebuilding the full graph.

Related

Spatial Lag and Neighborhood Statistics — parent guide covering the full range of neighborhood feature strategies
Creating Distance Matrices for Spatial Features — complementary proximity features that combine well with LISA outputs
Reducing Spatial Leakage in Model Training — preventing autocorrelation from inflating evaluation metrics
Feature Scaling for Geospatial Inputs — scaling strategies for the heavy-tailed distributions that LISA produces

Part of: Spatial Lag and Neighborhood Statistics · Spatial Feature Engineering for Machine Learning