TL;DR: Project your GeoDataFrame to a metric CRS, extract centroid coordinates with gdf.geometry.centroid, and call scipy.spatial.distance.pdist + squareform to build a symmetric pairwise distance matrix. For datasets larger than ~5 000 points, convert to scipy.sparse.csr_matrix by zeroing entries beyond your interaction radius. The parent technique is covered in Vector Proximity and Buffer Generation.
Why This Fails in Geospatial ML Pipelines
Distance matrix computation looks trivial until the first production incident. Three categories of silent failure are responsible for most problems:
Scale-corrupted distances from geographic coordinates. When a GeoDataFrame arrives in EPSG:4326 — the default for most geospatial data sources — its x values are degrees of longitude and its y values are degrees of latitude. Calling pdist directly on these produces angle-differences, not distances in metres. The resulting matrix has correct topology (closer features produce smaller numbers) but wrong magnitudes. Distance-sensitive models such as k-nearest-neighbour regressors or kernel SVMs silently learn the wrong neighbourhood radii; gradient boosting tree splits on distance thresholds produce splits that mean nothing physically.
Invalid geometries propagating NaN into the matrix. Self-intersecting polygons and empty geometries yield NaN or inf from .centroid. Those propagate through pdist into an entire row and column of the matrix. scikit-learn transformers do not raise an error — they pass NaN silently into the model, corrupting every tree split that touches the affected row.
Index misalignment between the filtered GeoDataFrame and downstream feature arrays. After dropping invalid geometries, the surviving row indices no longer match the original integer sequence. If the resulting matrix is joined to other features by positional index rather than by the GeoDataFrame’s preserved index, features from one geometry end up matched to the wrong sample.
Core Principles
- Validate-first, fail-fast. Check
gdf.crs is Noneandgdf.geometry.is_validbefore any computation. Raise, do not silently drop. - Enforce metric CRS before extraction. Call
gdf.to_crs(target_crs)and warn when a conversion fires; never rely on the caller having projected already. - Centroid extraction is appropriate for most tasks. For point datasets it is a no-op. For polygon datasets it is correct when polygon centroids represent the relevant spatial entity; use nearest-point distances only when boundary proximity matters.
- Use
scipy.pdistfor Euclidean;sklearn.pairwise_distancesfor everything else.pdistcalls vectorised C routines and avoids the redundant lower-triangle computation; scikit-learn adds parallel execution and pluggable metrics. - Return index-aligned outputs. Attach the surviving GeoDataFrame index to the matrix so downstream joins are safe regardless of how many rows were dropped.
- Sparsify at a meaningful physical radius. Choose
sparse_thresholdbased on domain knowledge (e.g., the service radius in metres of the spatial entity), not as an arbitrary percentile of matrix values.
Production-Ready Code
import warnings
from typing import Union
import numpy as np
import geopandas as gpd
from scipy.spatial.distance import pdist, squareform
from scipy.sparse import csr_matrix
from sklearn.metrics import pairwise_distances
def build_spatial_distance_matrix(
gdf: gpd.GeoDataFrame,
metric: str = "euclidean",
target_crs: str = "EPSG:3857",
sparse_threshold: float = 0.0,
) -> tuple[Union[np.ndarray, csr_matrix], np.ndarray]:
"""
Compute a pairwise distance matrix from vector geometries.
Parameters
----------
gdf : GeoDataFrame
Input data with any geometry type (points, polygons, lines).
Must carry an assigned CRS.
metric : str
Distance metric compatible with scipy/sklearn, e.g. 'euclidean',
'manhattan', 'chebyshev'.
target_crs : str
EPSG code for the metric CRS to project into when the GDF is
geographic. EPSG:3857 is a safe global default; prefer a local UTM
zone (e.g. 'EPSG:32633') for higher accuracy within a region.
sparse_threshold : float
Distances strictly greater than this value are zeroed and the
matrix is returned as a CSR sparse matrix. Set to 0.0 (default)
to return a dense numpy array.
Returns
-------
matrix : np.ndarray or scipy.sparse.csr_matrix
Symmetric (N x N) distance matrix where N is the number of valid
geometries surviving the validity filter.
valid_index : np.ndarray
The original GeoDataFrame integer index values corresponding to
the rows/columns of *matrix*, enabling safe re-join to other
feature arrays.
Raises
------
ValueError
If *gdf* is empty or has no CRS assigned.
"""
if gdf.empty:
raise ValueError("Input GeoDataFrame is empty.")
if gdf.crs is None:
raise ValueError(
"GeoDataFrame has no CRS. Assign one with gdf.set_crs() before "
"computing distances."
)
# --- geometry validation ---
valid_mask = gdf.geometry.is_valid & gdf.geometry.notna()
if not valid_mask.all():
n_dropped = int((~valid_mask).sum())
warnings.warn(
f"Dropping {n_dropped} invalid or null geometr{'y' if n_dropped == 1 else 'ies'} "
"before distance computation."
)
gdf = gdf.loc[valid_mask].copy()
if gdf.empty:
raise ValueError("No valid geometries remain after filtering.")
# --- CRS enforcement ---
if gdf.crs.is_geographic:
warnings.warn(
f"GeoDataFrame is in geographic CRS {gdf.crs.to_epsg()}. "
f"Reprojecting to {target_crs} for metric distances."
)
gdf = gdf.to_crs(target_crs)
# --- coordinate extraction (centroid for all geometry types) ---
centroids = gdf.geometry.centroid
coords = np.column_stack([centroids.x.to_numpy(), centroids.y.to_numpy()])
valid_index = gdf.index.to_numpy()
# --- distance computation ---
n = coords.shape[0]
if metric == "euclidean" and n > 1:
# scipy.pdist computes only the upper triangle, then squareform mirrors it;
# this is 2× faster than pairwise_distances for pure Euclidean.
matrix: np.ndarray = squareform(pdist(coords, metric="euclidean"))
else:
matrix = pairwise_distances(coords, metric=metric)
# --- optional sparsification ---
if sparse_threshold > 0.0:
matrix[matrix > sparse_threshold] = 0.0
return csr_matrix(matrix), valid_index
return matrix, valid_indexStep-by-Step Walkthrough
Step 1 — Load and inspect your GeoDataFrame
import geopandas as gpd
gdf = gpd.read_file("air_quality_stations.gpkg")
print(gdf.crs) # e.g. EPSG:4326
print(gdf.geometry.is_valid.value_counts())
print(f"{len(gdf)} features loaded")Confirm the CRS is set and check how many invalid geometries exist before calling the function.
Step 2 — Call build_spatial_distance_matrix
dist_matrix, idx = build_spatial_distance_matrix(
gdf,
metric="euclidean",
target_crs="EPSG:32632", # UTM zone 32N for central Europe
sparse_threshold=0.0, # dense output
)
print(dist_matrix.shape) # (N, N) where N = valid feature count
print(dist_matrix.dtype) # float64
assert (dist_matrix == dist_matrix.T).all(), "Matrix must be symmetric"
assert (dist_matrix.diagonal() == 0).all(), "Self-distances must be zero"Step 3 — Inspect distance distribution
Before feeding the matrix into a model, sanity-check the value range. Distances that cluster near zero indicate a poorly chosen unit (e.g., still in degrees) or duplicate geometries.
import numpy as np
upper_tri = dist_matrix[np.triu_indices_from(dist_matrix, k=1)]
print(f"Min: {upper_tri.min():.1f} m")
print(f"Median: {np.median(upper_tri):.1f} m")
print(f"95th pct: {np.percentile(upper_tri, 95):.1f} m")If your UTM projection is correct, values should be in the hundreds to hundreds-of-thousands of metres for a regional dataset.
Step 4 — Sparsify for large datasets
For datasets with more than 5 000 features, convert to sparse CSR using a physically meaningful threshold — for example, 50 km for a regional station network.
INTERACTION_RADIUS_M = 50_000 # 50 km
sparse_matrix, idx = build_spatial_distance_matrix(
gdf,
target_crs="EPSG:32632",
sparse_threshold=INTERACTION_RADIUS_M,
)
density = sparse_matrix.nnz / (sparse_matrix.shape[0] ** 2)
print(f"Matrix density: {density:.2%}") # typically 5–20 % for regional networksStep 5 — Attach to a feature DataFrame for model training
The returned idx array lets you re-join the matrix back to the original dataframe safely, even after geometry rows were dropped.
import pandas as pd
# Convert sparse matrix to a DataFrame aligned with the valid subset
dist_df = pd.DataFrame(
sparse_matrix.toarray(),
index=idx,
columns=idx,
)
# Merge with other spatial features
features = gdf.loc[idx, ["elevation_m", "land_cover_code"]].copy()
features = features.join(dist_df.add_prefix("dist_to_"))
print(features.shape) # (N, 2 + N) — two base features plus N distance columnsFor feature scaling for geospatial inputs to work correctly, apply StandardScaler or MinMaxScaler to the distance columns separately from categorical columns such as land cover codes.
Verification
Three checks confirm the matrix is correct before it enters any training pipeline:
import numpy as np
# 1. Symmetry — essential property of any distance metric
assert np.allclose(dist_matrix, dist_matrix.T, atol=1e-8), \
"Distance matrix is not symmetric."
# 2. Zero diagonal — distance from a point to itself must be zero
assert np.allclose(dist_matrix.diagonal(), 0, atol=1e-10), \
"Diagonal contains non-zero self-distances."
# 3. Triangle inequality spot-check on a random triple
rng = np.random.default_rng(42)
i, j, k = rng.integers(0, dist_matrix.shape[0], size=3)
assert dist_matrix[i, j] <= dist_matrix[i, k] + dist_matrix[k, j] + 1e-8, \
"Triangle inequality violated — check CRS or metric choice."
print("All distance matrix assertions passed.")For end-to-end validation within an MLflow experiment, log upper_tri.mean() and upper_tri.std() as run metrics. A sudden shift in these statistics between training and inference runs signals a change in spatial sampling density — the kind of distribution shift that also surfaces in handling spatial autocorrelation analyses.
FAQ
Why can’t I use EPSG:4326 coordinates directly in a distance matrix?
EPSG:4326 stores coordinates as decimal degrees. One degree of longitude shrinks from about 111 km at the equator to nearly zero at the poles. Running pdist() on raw degree-pairs produces angle-differences, not metric distances. The resulting matrix has plausible topology but wrong magnitudes, causing distance-sensitive models to learn neighbourhood radii that have no physical meaning. Always reproject to a metric CRS — a local UTM zone for regional data, or EPSG:3857 as a global fallback — before computing any distance.
How do I handle polygon or line geometries, not just points?
Extract centroid coordinates with gdf.geometry.centroid before passing to pdist. For polygons this is usually the correct representation because the centroid stands in for the whole feature. Where boundary proximity matters more than centroid proximity — for example, computing clearance between infrastructure corridors — calculate nearest-point distances using shapely.ops.nearest_points per feature pair instead of building a global matrix.
When should I switch to a sparse distance matrix?
A dense float64 matrix for 10 000 points consumes roughly 800 MB; for 50 000 points it exceeds 20 GB. Switch to scipy.sparse.csr_matrix when N exceeds about 5 000, or when your model only needs neighbourhood connections within a fixed radius. Set sparse_threshold to a physically motivated cutoff (e.g., your service-area radius in metres), zero all larger values, and convert. Graph convolutional networks and most spatial lag computations only read the non-zero structure, so accuracy loss is minimal.
Related
- Vector Proximity and Buffer Generation — the parent technique covering buffers, Voronoi tessellation, and proximity joins
- CRS Alignment and Projection Handling — choosing and enforcing the right coordinate reference system across your pipeline
- Feature Scaling for Geospatial Inputs — normalising distance columns alongside other heterogeneous spatial features
- Spatial Lag and Neighborhood Statistics — using weighted distance matrices to compute spatially lagged features
- Optimizing Memory Usage for Large Vector Datasets — memory budgets and chunking strategies relevant to large matrix workloads
Part of: Vector Proximity and Buffer Generation — Spatial Feature Engineering for Machine Learning