Spatial Lag and Neighborhood Statistics

Compute spatial lag and neighborhood statistics with PySAL: encode local spatial autocorrelation as high-signal features for geospatial machine learning models.

Spatial dependence violates the i.i.d. assumption that underpins most classical machine learning algorithms. When observations are geographically proximate their attributes tend to correlate — a principle formalized as Tobler’s First Law of Geography. Capturing this structure requires spatial lag and neighborhood statistics: transformations that convert raw coordinates and attributes into context-aware features encoding local spatial relationships. Within a Spatial Feature Engineering for Machine Learning pipeline these statistics serve as critical inputs for gradient boosting, random forests, and neural architectures, enabling them to learn spatially structured patterns without naive coordinate injection.

Computing reliable spatial lags in production is non-trivial. Neighborhood topology depends on geometry validity and a consistent CRS alignment and projection handling baseline; weights matrices must be row-standardized and stored as sparse structures to remain tractable on large datasets; and the resulting lag columns need drift monitoring at inference time because any geometry update silently invalidates the cached topology.


Spatial Lag Computation Workflow Five-stage pipeline diagram showing how raw geometries are transformed into spatial lag features: topology validation, CRS standardization, spatial weights construction, row standardization, and lag computation via matrix-vector multiplication. Topology Validation CRS Standardisation Weights Matrix W (sparse) Row Standardisation Lag Features W · y make_valid() EPSG:32633 KNN / Queen / Distance w.transform = "r" ML feature columns Spatial Lag Computation Pipeline Deterministic, reproducible, sparse-matrix-safe

Prerequisites and Environment Setup

Spatial lag calculations require validated geometries, a projected CRS, and sparse-aware numerical libraries. Pin these exact versions for reproducible CI/CD environments:

geopandas>=0.14
libpysal>=4.9
scipy>=1.11
numpy>=1.24
pandas>=2.0
shapely>=2.0
scikit-learn>=1.3
joblib>=1.3

Install with:

pip install geopandas libpysal scipy numpy pandas shapely scikit-learn joblib

System dependencies required before the pip install:

# Ubuntu / Debian
sudo apt-get install -y gdal-bin libgdal-dev libspatialindex-dev proj-bin

# macOS (Homebrew)
brew install gdal proj spatialindex

Confirm the installation:

import geopandas, libpysal, scipy, numpy
print(geopandas.__version__, libpysal.__version__, scipy.__version__)
# Expected: 0.14.x  4.9.x  1.11.x (or later)

Step-by-Step Implementation

Step 1: Topology Validation and CRS Standardization

Spatial weights break silently when geometries self-intersect, overlap, or live in mismatched projections. Always validate and reproject before constructing any neighborhood structure. Applying CRS alignment and projection handling at the very first step prevents distance distortions from propagating through every downstream lag feature.

import geopandas as gpd
from shapely.validation import make_valid

gdf = gpd.read_file("input_parcels.gpkg")

# Repair any self-intersecting or degenerate geometries
gdf["geometry"] = gdf["geometry"].apply(
    lambda geom: make_valid(geom) if geom is not None else None
)
gdf = gdf.dropna(subset=["geometry"])

# Reproject to a metric CRS — real-world metres for distance-based weights
gdf = gdf.to_crs("EPSG:32633")

# Inline validation
assert gdf.crs.is_projected, "CRS must be projected for distance-based weights"
assert gdf["geometry"].is_valid.all(), "Geometries still invalid after make_valid()"
print(f"Loaded {len(gdf)} valid geometries in {gdf.crs.to_epsg()}")

Step 2: Spatial Weights Matrix Construction

The weights matrix W encodes neighborhood relationships. Three strategies dominate in production:

Strategy Best for libpysal class
Queen contiguity Regular administrative polygons with shared borders weights.Queen.from_dataframe
K-nearest neighbors Point datasets, irregular polygons weights.KNN.from_dataframe
Distance band Sensor networks with known radius thresholds weights.DistanceBand.from_dataframe

For irregular administrative boundaries or point-based sensor networks, KNN or Delaunay triangulation typically outperforms contiguity-based approaches. Choose k based on the minimum number of neighbors that keeps the w.islands list empty:

import libpysal

# KNN with k=4 — adjust k until w_knn.islands is empty
w_knn = libpysal.weights.KNN.from_dataframe(gdf, k=4, use_index=True)
w_knn.transform = "r"  # Row-standardize immediately

# Sanity check: no isolated observations
assert len(w_knn.islands) == 0, (
    f"Found {len(w_knn.islands)} islands — increase k or switch weight type"
)
print(f"W matrix: {w_knn.n} observations, {w_knn.n_components} component(s)")

Step 3: Row-Standardization and Sparsity Enforcement

Row-standardization converts raw neighbor counts into proportional weights that sum to 1.0 per row, making lag values interpretable as local weighted averages. Even when initialized with transform='r', downstream operations can silently convert sparse matrices to dense formats. Cast to scipy.sparse.csr_matrix before any lag computation:

from scipy.sparse import csr_matrix

W_sparse = csr_matrix(w_knn.sparse)

# Verify sparsity — should be >0.90 for any realistic dataset
sparsity = 1.0 - (W_sparse.nnz / (W_sparse.shape[0] ** 2))
print(f"Sparsity: {sparsity:.4f}")
assert sparsity > 0.90, "Matrix unexpectedly dense — check weight type and k"

# Confirm row sums == 1.0 for row-standardized weights
import numpy as np
row_sums = np.asarray(W_sparse.sum(axis=1)).ravel()
np.testing.assert_allclose(row_sums, 1.0, atol=1e-6,
    err_msg="Row sums deviate from 1.0 — re-apply row standardisation")

Sparse CSR storage reduces memory overhead by 80–95% compared to a dense equivalent and accelerates the W @ y product significantly for datasets with thousands of observations.

Step 4: First-Order Spatial Lag Computation

For vector data, spatial lag is a straightforward matrix-vector product. The result for each observation is the row-standardized weighted mean of its neighbors’ attribute values:

# First-order spatial lag: Wy
y = gdf["property_value"].values.astype(np.float64)
lag_1 = W_sparse @ y
gdf["lag1_property_value"] = lag_1

# Assert output is finite and same length as input
assert np.isfinite(lag_1).all(), "NaN/Inf in lag — check for islands or NaN inputs"
assert len(lag_1) == len(gdf)
print(f"Lag range: [{lag_1.min():.2f}, {lag_1.max():.2f}], mean={lag_1.mean():.2f}")

For raster data, spatial lag uses focal operations (moving-window convolution) rather than explicit topology matrices. When combining raster-derived features with vector pipelines, consult Raster Band Math and Index Calculation for alignment strategies and memory-efficient tiling approaches.

Step 5: Higher-Order Lags and Local Variance Features

First-order lag captures the immediate neighborhood; second-order lag extends to neighbors-of-neighbors and often reveals broader spatial trends. Combining both orders and adding local variance gives the model access to multi-scale spatial context:

# Second-order lag: W^2 y  (neighbors-of-neighbors)
W2_sparse = W_sparse @ W_sparse
lag_2 = W2_sparse @ y
gdf["lag2_property_value"] = lag_2

# Local variance: mean of squared deviations from the local mean
# Var_i = W_i · (y - Wy_i)^2
deviations_sq = (y - lag_1) ** 2
local_var = W_sparse @ deviations_sq
gdf["local_var_property_value"] = local_var

# Spatial z-score: how far each obs is from its local mean in local-std units
local_std = np.sqrt(local_var)
gdf["spatial_zscore_property_value"] = np.where(
    local_std > 0, (y - lag_1) / local_std, 0.0
)

print(gdf[["property_value", "lag1_property_value",
           "lag2_property_value", "spatial_zscore_property_value"]].describe())

When combining distance-to-amenity buffers produced by Vector Proximity and Buffer Generation with lag features, monitor for multicollinearity — both capture local context from different angles and may inflate variance inflation factor (VIF) scores.

Step 6: Temporal Lag Features for Time-Series Geodata

When your dataset includes repeated observations over time, you can compute spatial lags per time slice and then aggregate — or compute temporal lags of the spatial lag itself. This is especially common for satellite-derived features. See Temporal Aggregation for Time-Series Geodata for the date-bucketing and xarray integration patterns that feed smoothly into the spatial lag step shown here.

# Example: compute spatial lag independently per monthly time slice
results = []
for month, subset in gdf.groupby("year_month"):
    w_m = libpysal.weights.KNN.from_dataframe(subset.reset_index(), k=4)
    w_m.transform = "r"
    W_m = csr_matrix(w_m.sparse)
    y_m = subset["ndvi"].values.astype(np.float64)
    subset = subset.copy()
    subset["lag_ndvi"] = W_m @ y_m
    results.append(subset)

gdf_lagged = gpd.GeoDataFrame(pd.concat(results, ignore_index=True))

Step 7: Feature Scaling and Pipeline Serialization

Spatial lags and local variance features often exhibit different scales and skewness compared with raw attributes. Integrate them into a scikit-learn pipeline alongside the other feature scaling for geospatial inputs transformations to guarantee consistent normalization during training and inference:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import joblib

feature_cols = [
    "area_sqm",
    "distance_to_transit",
    "lag1_property_value",
    "lag2_property_value",
    "local_var_property_value",
    "spatial_zscore_property_value",
]
X = gdf[feature_cols].values

pipeline = Pipeline([
    ("scaler", StandardScaler()),
    # ("model", GradientBoostingRegressor())  # add your estimator here
])

pipeline.fit(X)
joblib.dump(pipeline, "spatial_lag_pipeline_v1.joblib")
print("Pipeline serialized — include W_sparse in the same versioned artifact.")

Verification and Testing

Confirm the implementation is numerically correct before promoting to production.

Moran’s I sanity check — a positive lag feature on a positively autocorrelated attribute should yield Moran’s I > 0 with p < 0.05:

from libpysal.weights import lag_spatial
from esda.moran import Moran

mi = Moran(y, w_knn)
print(f"Moran's I: {mi.I:.4f}  p-value: {mi.p_norm:.4f}")
assert mi.I > 0, "Expected positive autocorrelation for property values"
assert mi.p_norm < 0.05, "Moran's I not significant — check topology or sample size"

Unit test pattern — use a known synthetic grid where ground truth lags are calculable:

import numpy as np
from scipy.sparse import csr_matrix

# 3-obs line: [1, 2, 3] with equal-weight connections 1↔2, 2↔3
# Row-standardized W:
#   obs 0: only neighbour is obs 1  → lag = 2.0
#   obs 1: neighbours are obs 0 and 2 → lag = (1+3)/2 = 2.0
#   obs 2: only neighbour is obs 1  → lag = 2.0
W_test = csr_matrix([[0, 1, 0],
                     [0.5, 0, 0.5],
                     [0, 1, 0]], dtype=np.float64)
y_test = np.array([1.0, 2.0, 3.0])
expected = np.array([2.0, 2.0, 2.0])
np.testing.assert_allclose(W_test @ y_test, expected, atol=1e-10)
print("Unit test passed.")

Visual sanity check — plot the raw attribute and its spatial lag side-by-side with gdf.plot(column=...) and confirm that high-value clusters in the raw map correspond to high-lag areas.


Troubleshooting and Common Errors

Error / Symptom Root Cause Fix
NaN in lag column Islands (zero-neighbor observations) produce undefined weighted sums Run w.islands; increase k or switch to DistanceBand; fill residual NaN with global mean
MemoryError during W @ y Dense matrix conversion triggered by a numpy broadcast or toarray() call Enforce csr_matrix everywhere; never call .toarray() on large W; process in spatial tiles
libpysal.weights.util.WSP.sparse returns wrong shape Index mismatch after reset_index() or dropna() Always pass use_index=True and reset the GeoDataFrame index before passing to KNN.from_dataframe
AssertionError: CRS must be projected Data loaded in EPSG:4326 (geographic degrees) Call gdf.to_crs("EPSG:32633") (or the local UTM zone) before building weights
Row sums deviate from 1.0 w.transform applied before sparse conversion, then re-converted Apply transform='r' before calling .sparse; never re-transform after casting to csr_matrix
Model accuracy drops after boundary update Cached W topology no longer matches updated geometries Rebuild W when geometry delta exceeds a threshold (e.g., >5% parcel changes); version W alongside data snapshots

Performance Optimisation

Chunked spatial tiling. For datasets exceeding available RAM, partition by spatial tile using gdf.sindex.query_bulk() or H3 hex grids. Compute weights and lags per tile, then concatenate:

import numpy as np

TILE_SIZE = 50_000  # adjust to fit comfortably in RAM
n = len(gdf)
lag_out = np.empty(n, dtype=np.float64)

for start in range(0, n, TILE_SIZE):
    end = min(start + TILE_SIZE, n)
    subset = gdf.iloc[start:end].reset_index(drop=True)
    w_tile = libpysal.weights.KNN.from_dataframe(subset, k=4)
    w_tile.transform = "r"
    W_tile = csr_matrix(w_tile.sparse)
    y_tile = subset["property_value"].values.astype(np.float64)
    lag_out[start:end] = W_tile @ y_tile

gdf["lag1_property_value"] = lag_out

Note: tile-boundary observations will have truncated neighborhoods. Use a buffer overlap (e.g., 10% tile overlap) and discard the buffer zone after computation for precise edge handling.

Parallel weight construction with joblib. Building KNN weights is CPU-bound and embarrassingly parallel across independent subsets:

from joblib import Parallel, delayed

def build_lag_for_group(subset_df, k=4, col="property_value"):
    w = libpysal.weights.KNN.from_dataframe(subset_df.reset_index(drop=True), k=k)
    w.transform = "r"
    W = csr_matrix(w.sparse)
    return W @ subset_df[col].values.astype(np.float64)

groups = [gdf[gdf["region"] == r].copy() for r in gdf["region"].unique()]
lags = Parallel(n_jobs=-1)(delayed(build_lag_for_group)(g) for g in groups)

Drift monitoring at inference time. Spatial lag features are inherently dynamic — adding new observations or updating boundaries alters neighborhood definitions even when raw attributes are stable. Cache the training W matrix alongside your model artifact and monitor lag-column PSI (Population Stability Index) at inference time. Rebuild W only when geometry delta exceeds your configured threshold.


FAQ

Why does my spatial lag produce NaN values for some observations?

Isolated geometries (islands) have zero neighbors under the chosen weight scheme. Their weighted sum is undefined, so the W @ y product yields NaN. Diagnose with w.islands — if the list is non-empty, either increase k (for KNN) or widen the distance band. For the rare case where genuine isolation is a valid data characteristic, fill the NaN with the global attribute mean and add a boolean is_island indicator column so the model can learn the distinction.

Should I use queen contiguity or KNN weights for polygon data?

Queen contiguity respects the physical topology of polygons — two polygons are neighbors if they share even a single vertex. This is appropriate for regular administrative units (census tracts, municipalities) where adjacency carries substantive meaning. KNN is preferable for irregular polygons with highly variable sizes, or when you need a guaranteed fixed number of neighbors per observation. For point data, KNN or distance-band weights are the only applicable options.

How do I prevent the spatial weights matrix from rebuilding on every inference call?

Serialize the W matrix as a scipy.sparse .npz file alongside the fitted sklearn pipeline:

from scipy.sparse import save_npz, load_npz

save_npz("W_training_v1.npz", W_sparse)
# At inference time:
W_inference = load_npz("W_training_v1.npz")
lag_inference = W_inference @ X_new["property_value"].values

Treat the W artifact as versioned infrastructure: track it in DVC or MLflow alongside the data snapshot it was built from, and bump the version whenever geometry boundaries change beyond your configured delta threshold.

How does spatial lag relate to Local Moran’s I?

Spatial lag Wy is the raw weighted neighborhood mean; Computing Local Moran’s I for Feature Engineering uses the lag as an intermediate quantity to compute a standardized local autocorrelation statistic that classifies each observation as a hotspot (HH), coldspot (LL), or spatial outlier (HL/LH). For feature engineering, both contribute complementary signal: the lag gives the model a continuous context value, while LISA classifications add a categorical spatial-regime feature.


Part of: Spatial Feature Engineering for Machine Learning