Creating Distance Matrices for Spatial Features

Create spatial distance matrices for ML in Python: compute pairwise proximity metrics from vector geometries using scipy and GeoPandas in production pipelines.

TL;DR: Project your GeoDataFrame to a metric CRS, extract centroid coordinates with gdf.geometry.centroid, and call scipy.spatial.distance.pdist + squareform to build a symmetric pairwise distance matrix. For datasets larger than ~5 000 points, convert to scipy.sparse.csr_matrix by zeroing entries beyond your interaction radius. The parent technique is covered in Vector Proximity and Buffer Generation.


Distance matrix pipeline Five stages shown as left-to-right boxes connected by arrows: GeoDataFrame → CRS / validity check → centroid extraction → pdist / squareform → sparse conversion → distance matrix output. GeoDataFrame (any geometry) CRS + validity → metric CRS drop invalids centroid extraction pdist / squareform N×N dense sparse CSR (optional) → ML-ready distance matrix

Why This Fails in Geospatial ML Pipelines

Distance matrix computation looks trivial until the first production incident. Three categories of silent failure are responsible for most problems:

Scale-corrupted distances from geographic coordinates. When a GeoDataFrame arrives in EPSG:4326 — the default for most geospatial data sources — its x values are degrees of longitude and its y values are degrees of latitude. Calling pdist directly on these produces angle-differences, not distances in metres. The resulting matrix has correct topology (closer features produce smaller numbers) but wrong magnitudes. Distance-sensitive models such as k-nearest-neighbour regressors or kernel SVMs silently learn the wrong neighbourhood radii; gradient boosting tree splits on distance thresholds produce splits that mean nothing physically.

Invalid geometries propagating NaN into the matrix. Self-intersecting polygons and empty geometries yield NaN or inf from .centroid. Those propagate through pdist into an entire row and column of the matrix. scikit-learn transformers do not raise an error — they pass NaN silently into the model, corrupting every tree split that touches the affected row.

Index misalignment between the filtered GeoDataFrame and downstream feature arrays. After dropping invalid geometries, the surviving row indices no longer match the original integer sequence. If the resulting matrix is joined to other features by positional index rather than by the GeoDataFrame’s preserved index, features from one geometry end up matched to the wrong sample.

Core Principles

  • Validate-first, fail-fast. Check gdf.crs is None and gdf.geometry.is_valid before any computation. Raise, do not silently drop.
  • Enforce metric CRS before extraction. Call gdf.to_crs(target_crs) and warn when a conversion fires; never rely on the caller having projected already.
  • Centroid extraction is appropriate for most tasks. For point datasets it is a no-op. For polygon datasets it is correct when polygon centroids represent the relevant spatial entity; use nearest-point distances only when boundary proximity matters.
  • Use scipy.pdist for Euclidean; sklearn.pairwise_distances for everything else. pdist calls vectorised C routines and avoids the redundant lower-triangle computation; scikit-learn adds parallel execution and pluggable metrics.
  • Return index-aligned outputs. Attach the surviving GeoDataFrame index to the matrix so downstream joins are safe regardless of how many rows were dropped.
  • Sparsify at a meaningful physical radius. Choose sparse_threshold based on domain knowledge (e.g., the service radius in metres of the spatial entity), not as an arbitrary percentile of matrix values.

Production-Ready Code

import warnings
from typing import Union

import numpy as np
import geopandas as gpd
from scipy.spatial.distance import pdist, squareform
from scipy.sparse import csr_matrix
from sklearn.metrics import pairwise_distances


def build_spatial_distance_matrix(
    gdf: gpd.GeoDataFrame,
    metric: str = "euclidean",
    target_crs: str = "EPSG:3857",
    sparse_threshold: float = 0.0,
) -> tuple[Union[np.ndarray, csr_matrix], np.ndarray]:
    """
    Compute a pairwise distance matrix from vector geometries.

    Parameters
    ----------
    gdf : GeoDataFrame
        Input data with any geometry type (points, polygons, lines).
        Must carry an assigned CRS.
    metric : str
        Distance metric compatible with scipy/sklearn, e.g. 'euclidean',
        'manhattan', 'chebyshev'.
    target_crs : str
        EPSG code for the metric CRS to project into when the GDF is
        geographic. EPSG:3857 is a safe global default; prefer a local UTM
        zone (e.g. 'EPSG:32633') for higher accuracy within a region.
    sparse_threshold : float
        Distances strictly greater than this value are zeroed and the
        matrix is returned as a CSR sparse matrix. Set to 0.0 (default)
        to return a dense numpy array.

    Returns
    -------
    matrix : np.ndarray or scipy.sparse.csr_matrix
        Symmetric (N x N) distance matrix where N is the number of valid
        geometries surviving the validity filter.
    valid_index : np.ndarray
        The original GeoDataFrame integer index values corresponding to
        the rows/columns of *matrix*, enabling safe re-join to other
        feature arrays.

    Raises
    ------
    ValueError
        If *gdf* is empty or has no CRS assigned.
    """
    if gdf.empty:
        raise ValueError("Input GeoDataFrame is empty.")
    if gdf.crs is None:
        raise ValueError(
            "GeoDataFrame has no CRS. Assign one with gdf.set_crs() before "
            "computing distances."
        )

    # --- geometry validation ---
    valid_mask = gdf.geometry.is_valid & gdf.geometry.notna()
    if not valid_mask.all():
        n_dropped = int((~valid_mask).sum())
        warnings.warn(
            f"Dropping {n_dropped} invalid or null geometr{'y' if n_dropped == 1 else 'ies'} "
            "before distance computation."
        )
        gdf = gdf.loc[valid_mask].copy()

    if gdf.empty:
        raise ValueError("No valid geometries remain after filtering.")

    # --- CRS enforcement ---
    if gdf.crs.is_geographic:
        warnings.warn(
            f"GeoDataFrame is in geographic CRS {gdf.crs.to_epsg()}. "
            f"Reprojecting to {target_crs} for metric distances."
        )
        gdf = gdf.to_crs(target_crs)

    # --- coordinate extraction (centroid for all geometry types) ---
    centroids = gdf.geometry.centroid
    coords = np.column_stack([centroids.x.to_numpy(), centroids.y.to_numpy()])
    valid_index = gdf.index.to_numpy()

    # --- distance computation ---
    n = coords.shape[0]
    if metric == "euclidean" and n > 1:
        # scipy.pdist computes only the upper triangle, then squareform mirrors it;
        # this is 2× faster than pairwise_distances for pure Euclidean.
        matrix: np.ndarray = squareform(pdist(coords, metric="euclidean"))
    else:
        matrix = pairwise_distances(coords, metric=metric)

    # --- optional sparsification ---
    if sparse_threshold > 0.0:
        matrix[matrix > sparse_threshold] = 0.0
        return csr_matrix(matrix), valid_index

    return matrix, valid_index

Step-by-Step Walkthrough

Step 1 — Load and inspect your GeoDataFrame

import geopandas as gpd

gdf = gpd.read_file("air_quality_stations.gpkg")
print(gdf.crs)          # e.g. EPSG:4326
print(gdf.geometry.is_valid.value_counts())
print(f"{len(gdf)} features loaded")

Confirm the CRS is set and check how many invalid geometries exist before calling the function.

Step 2 — Call build_spatial_distance_matrix

dist_matrix, idx = build_spatial_distance_matrix(
    gdf,
    metric="euclidean",
    target_crs="EPSG:32632",   # UTM zone 32N for central Europe
    sparse_threshold=0.0,       # dense output
)

print(dist_matrix.shape)   # (N, N) where N = valid feature count
print(dist_matrix.dtype)   # float64
assert (dist_matrix == dist_matrix.T).all(), "Matrix must be symmetric"
assert (dist_matrix.diagonal() == 0).all(), "Self-distances must be zero"

Step 3 — Inspect distance distribution

Before feeding the matrix into a model, sanity-check the value range. Distances that cluster near zero indicate a poorly chosen unit (e.g., still in degrees) or duplicate geometries.

import numpy as np

upper_tri = dist_matrix[np.triu_indices_from(dist_matrix, k=1)]
print(f"Min: {upper_tri.min():.1f} m")
print(f"Median: {np.median(upper_tri):.1f} m")
print(f"95th pct: {np.percentile(upper_tri, 95):.1f} m")

If your UTM projection is correct, values should be in the hundreds to hundreds-of-thousands of metres for a regional dataset.

Step 4 — Sparsify for large datasets

For datasets with more than 5 000 features, convert to sparse CSR using a physically meaningful threshold — for example, 50 km for a regional station network.

INTERACTION_RADIUS_M = 50_000   # 50 km

sparse_matrix, idx = build_spatial_distance_matrix(
    gdf,
    target_crs="EPSG:32632",
    sparse_threshold=INTERACTION_RADIUS_M,
)

density = sparse_matrix.nnz / (sparse_matrix.shape[0] ** 2)
print(f"Matrix density: {density:.2%}")   # typically 5–20 % for regional networks

Step 5 — Attach to a feature DataFrame for model training

The returned idx array lets you re-join the matrix back to the original dataframe safely, even after geometry rows were dropped.

import pandas as pd

# Convert sparse matrix to a DataFrame aligned with the valid subset
dist_df = pd.DataFrame(
    sparse_matrix.toarray(),
    index=idx,
    columns=idx,
)

# Merge with other spatial features
features = gdf.loc[idx, ["elevation_m", "land_cover_code"]].copy()
features = features.join(dist_df.add_prefix("dist_to_"))

print(features.shape)   # (N, 2 + N) — two base features plus N distance columns

For feature scaling for geospatial inputs to work correctly, apply StandardScaler or MinMaxScaler to the distance columns separately from categorical columns such as land cover codes.

Verification

Three checks confirm the matrix is correct before it enters any training pipeline:

import numpy as np

# 1. Symmetry — essential property of any distance metric
assert np.allclose(dist_matrix, dist_matrix.T, atol=1e-8), \
    "Distance matrix is not symmetric."

# 2. Zero diagonal — distance from a point to itself must be zero
assert np.allclose(dist_matrix.diagonal(), 0, atol=1e-10), \
    "Diagonal contains non-zero self-distances."

# 3. Triangle inequality spot-check on a random triple
rng = np.random.default_rng(42)
i, j, k = rng.integers(0, dist_matrix.shape[0], size=3)
assert dist_matrix[i, j] <= dist_matrix[i, k] + dist_matrix[k, j] + 1e-8, \
    "Triangle inequality violated — check CRS or metric choice."

print("All distance matrix assertions passed.")

For end-to-end validation within an MLflow experiment, log upper_tri.mean() and upper_tri.std() as run metrics. A sudden shift in these statistics between training and inference runs signals a change in spatial sampling density — the kind of distribution shift that also surfaces in handling spatial autocorrelation analyses.


FAQ

Why can’t I use EPSG:4326 coordinates directly in a distance matrix?

EPSG:4326 stores coordinates as decimal degrees. One degree of longitude shrinks from about 111 km at the equator to nearly zero at the poles. Running pdist() on raw degree-pairs produces angle-differences, not metric distances. The resulting matrix has plausible topology but wrong magnitudes, causing distance-sensitive models to learn neighbourhood radii that have no physical meaning. Always reproject to a metric CRS — a local UTM zone for regional data, or EPSG:3857 as a global fallback — before computing any distance.

How do I handle polygon or line geometries, not just points?

Extract centroid coordinates with gdf.geometry.centroid before passing to pdist. For polygons this is usually the correct representation because the centroid stands in for the whole feature. Where boundary proximity matters more than centroid proximity — for example, computing clearance between infrastructure corridors — calculate nearest-point distances using shapely.ops.nearest_points per feature pair instead of building a global matrix.

When should I switch to a sparse distance matrix?

A dense float64 matrix for 10 000 points consumes roughly 800 MB; for 50 000 points it exceeds 20 GB. Switch to scipy.sparse.csr_matrix when N exceeds about 5 000, or when your model only needs neighbourhood connections within a fixed radius. Set sparse_threshold to a physically motivated cutoff (e.g., your service-area radius in metres), zero all larger values, and convert. Graph convolutional networks and most spatial lag computations only read the non-zero structure, so accuracy loss is minimal.


Part of: Vector Proximity and Buffer GenerationSpatial Feature Engineering for Machine Learning