Call gdf.to_crs("EPSG:target") — not .set_crs() — on every GeoDataFrame that does not already match your pipeline’s target projection. Validate the source CRS first with gdf.crs and raise an exception if it is None. Centralise this logic in a single alignment function and call it at every ingestion boundary. The full implementation, walkthrough, and verification pattern are below.
This page covers the specific mechanics of fixing mismatches inside GeoDataFrames. For the broader pipeline architecture — raster alignment, datum grid selection, and cross-modal validation — see CRS Alignment and Projection Handling.
Why This Fails in Geospatial ML Pipelines
CRS mismatches between GeoDataFrames do not throw an error at the point of failure — they propagate silently through every downstream step. GeoPandas spatial joins, buffer operations, distance calculations, and raster sampling all assume that coordinate values are expressed in the same unit and reference system. When they are not, the results are numerically valid but geographically wrong.
The most destructive pattern is the EPSG:4326 / EPSG:3857 combination. EPSG:4326 stores coordinates as decimal degrees; EPSG:3857 stores them as metres relative to the Web Mercator pseudo-origin. A 1000-unit buffer in EPSG:4326 creates a circle roughly 1000 degrees in radius — a shape that spans the entire globe several times over. The sjoin() call completes without complaint, attribute columns are populated, and the feature table looks plausible. But every matched pair is wrong, and the model trained on those features will have learned relationships that do not exist in the real world.
The failure is equally dangerous when the unit mismatch is subtler. Mixing a GeoDataFrame in UTM Zone 33N (EPSG:32633, metres) with one in the German national grid (EPSG:31467, metres but different origin and datum) produces distance errors of 50–200 m for typical European datasets. For vector proximity and buffer generation features like nearest-road distance or isochrone buffers, those errors translate directly to corrupted training labels.
The corruption is invisible because the model still trains and produces predictions. Only when the model is deployed against data from a consistently aligned pipeline — or evaluated with spatial cross-validation strategies that respect geographic holdout boundaries — does the performance gap become apparent, usually framed as unexplained concept drift.
Core Alignment Principles
- Validate before every operation, not once at startup. Data arrives from heterogeneous sources across pipeline runs. Check
gdf.crsat the ingestion boundary of every function that performs spatial operations, not just the first function in the script. - Transform with
.to_crs(), declare with.set_crs()..to_crs()mathematically reprojects every coordinate..set_crs()only attaches metadata without touching coordinate values. Using.set_crs()to fix a mismatch assigns the wrong label to the wrong numbers and makes the corruption invisible to all downstream checks. - Fail fast on
NoneCRS. A GeoDataFrame withcrs = Nonehas unknown coordinates. It may be in degrees, metres, or an arbitrary local system. Assuming EPSG:4326 and continuing is almost always wrong. Raise an exception and route the dataset to a data engineering queue. - Pin a single target CRS in pipeline configuration. Hard-code the target EPSG code as a named constant in your pipeline config, not as a string literal repeated across alignment functions. Version-control the config alongside your model artifacts.
- Reproject vectors, not rasters, when the two disagree. Vector reprojection with
.to_crs()is computationally cheap and attribute-lossless. Raster reprojection requires pixel resampling, which changes spectral or elevation values. When aligning GeoDataFrames against raster backends, bring the vector layer to the raster’s CRS, not the other way around. - Assert after every transformation. A one-line
assert gdf.crs.to_epsg() == TARGET_EPSGafter each.to_crs()call catches silent failures caused by PROJ version mismatches or malformed CRS strings before they propagate through expensive feature computation.
Production-Ready Code
The function below enforces the validate-first, fail-fast pattern for one or more GeoDataFrames. It handles missing CRS metadata, logs all transformation decisions, and returns aligned frames in the same structure (single or list) as the input.
import geopandas as gpd
import pyproj
import logging
from typing import Union, List, Sequence
from pyproj.exceptions import CRSError
logger = logging.getLogger(__name__)
TARGET_EPSG = 32633 # UTM Zone 33N — pin this in pipeline config
def align_geodataframes(
gdfs: Union[gpd.GeoDataFrame, Sequence[gpd.GeoDataFrame]],
target_crs: Union[str, int, pyproj.CRS] = TARGET_EPSG,
strict: bool = True,
) -> Union[gpd.GeoDataFrame, List[gpd.GeoDataFrame]]:
"""
Validate and reproject one or more GeoDataFrames to a shared target CRS.
Parameters
----------
gdfs : GeoDataFrame or sequence of GeoDataFrames
Input layers to align.
target_crs : str, int, or pyproj.CRS
Target CRS as EPSG code, EPSG string, or pyproj.CRS object.
strict : bool
If True (default), raise ValueError when source CRS is None.
If False, log a warning and assume EPSG:4326 for unprojected frames.
Returns
-------
GeoDataFrame or list of GeoDataFrames aligned to target_crs.
"""
try:
target = pyproj.CRS.from_user_input(target_crs)
except CRSError as exc:
raise ValueError(f"Invalid target CRS '{target_crs}'.") from exc
is_sequence = isinstance(gdfs, (list, tuple))
inputs = list(gdfs) if is_sequence else [gdfs]
aligned: List[gpd.GeoDataFrame] = []
for i, gdf in enumerate(inputs):
if not isinstance(gdf, gpd.GeoDataFrame):
raise TypeError(f"Input {i} is {type(gdf).__name__}, expected GeoDataFrame.")
if gdf.crs is None:
if strict:
raise ValueError(
f"GeoDataFrame {i} has no CRS metadata. "
"Assign the correct source CRS with .set_crs() before calling align_geodataframes(), "
"or pass strict=False to fall back to EPSG:4326."
)
logger.warning("GeoDataFrame %d has no CRS. Assuming EPSG:4326 — verify this is correct.", i)
gdf = gdf.set_crs("EPSG:4326")
if gdf.crs.equals(target):
logger.debug("GeoDataFrame %d already in target CRS — no reprojection needed.", i)
aligned.append(gdf)
continue
source_label = gdf.crs.to_epsg() or gdf.crs.to_string()
target_label = target.to_epsg() or target.to_string()
logger.info("Reprojecting GeoDataFrame %d from %s to %s.", i, source_label, target_label)
aligned.append(gdf.to_crs(target))
return aligned if is_sequence else aligned[0]For pipelines that concatenate GeoDataFrames from multiple sources before alignment, use the companion utility below. It audits a list of frames and returns a clear report before any reprojection is attempted, which surfaces mismatches early in development.
def audit_crs(gdfs: Sequence[gpd.GeoDataFrame], names: Sequence[str] | None = None) -> dict:
"""
Return a CRS audit report for a collection of GeoDataFrames.
The report maps each frame's name to its EPSG code, unit, and whether
it matches the majority CRS. Use this to diagnose mismatches before
calling align_geodataframes().
"""
labels = list(names) if names else [f"frame_{i}" for i in range(len(gdfs))]
report = {}
for label, gdf in zip(labels, gdfs):
if gdf.crs is None:
report[label] = {"epsg": None, "unit": None, "datum": None}
else:
report[label] = {
"epsg": gdf.crs.to_epsg(),
"unit": gdf.crs.axis_info[0].unit_name if gdf.crs.axis_info else "unknown",
"datum": gdf.crs.datum.name if gdf.crs.datum else "unknown",
}
epsgs = [v["epsg"] for v in report.values() if v["epsg"] is not None]
majority = max(set(epsgs), key=epsgs.count) if epsgs else None
for label in report:
report[label]["matches_majority"] = report[label]["epsg"] == majority
return reportStep-by-Step Walkthrough
The following walkthrough uses a realistic scenario: a land-parcel GeoDataFrame (EPSG:4326, from a GeoJSON export) joined to a building footprints layer (EPSG:3857, from a web tile service). The goal is to compute the count of buildings within each parcel — a common spatial feature for property valuation and urban ML models.
Step 1: Load the data and inspect CRS
import geopandas as gpd
parcels = gpd.read_file("parcels.geojson") # EPSG:4326 — degrees
buildings = gpd.read_file("buildings.gpkg") # EPSG:3857 — Web Mercator metres
print(parcels.crs) # EPSG:4326
print(buildings.crs) # EPSG:3857Confirm the mismatch with the audit utility before proceeding:
report = audit_crs([parcels, buildings], names=["parcels", "buildings"])
# {'parcels': {'epsg': 4326, 'unit': 'degree', ...},
# 'buildings': {'epsg': 3857, 'unit': 'metre', ...}}Step 2: Choose the target CRS
Both datasets cover a regional study area in central Europe. UTM Zone 33N (EPSG:32633) gives metre-unit coordinates without the area distortion of Web Mercator, making it correct for distance and area features. Define the target once:
TARGET_EPSG = 32633Step 3: Align both layers
parcels_utm, buildings_utm = align_geodataframes(
[parcels, buildings],
target_crs=TARGET_EPSG,
)Both frames now share EPSG:32633. Coordinate values are in metres.
Step 4: Run the spatial join
joined = gpd.sjoin(
parcels_utm,
buildings_utm[["geometry", "building_id"]],
how="left",
predicate="contains",
)
building_counts = (
joined.groupby(joined.index)["building_id"]
.count()
.rename("building_count")
)
parcels_utm = parcels_utm.join(building_counts).fillna({"building_count": 0})Step 5: Compute distance features in metre-unit CRS
With the layers aligned to a metric CRS, distance calculations are in metres, not degrees. This is the prerequisite for feature scaling for geospatial inputs steps that normalise distance columns before model training.
from shapely.ops import nearest_points
# Centroid of each parcel
parcel_centroids = parcels_utm.copy()
parcel_centroids["geometry"] = parcels_utm.geometry.centroid
# Distance to nearest building centroid
building_centroids = buildings_utm.copy()
building_centroids["geometry"] = buildings_utm.geometry.centroid
parcel_centroids["nearest_building_m"] = parcel_centroids.geometry.apply(
lambda pt: building_centroids.geometry.distance(pt).min()
)Verification
After alignment, confirm correctness with three assertions before passing the data to any feature computation step.
# 1. Both frames in the same CRS
assert parcels_utm.crs.to_epsg() == TARGET_EPSG, "parcels not in target CRS"
assert buildings_utm.crs.to_epsg() == TARGET_EPSG, "buildings not in target CRS"
# 2. CRS is metric (units = metres)
assert parcels_utm.crs.axis_info[0].unit_name == "metre", (
"Target CRS is not metric — distance features will be in angular units."
)
# 3. Spatial overlap — total_bounds should share a common rectangle
px0, py0, px1, py1 = parcels_utm.total_bounds
bx0, by0, bx1, by1 = buildings_utm.total_bounds
assert px0 < bx1 and px1 > bx0 and py0 < by1 and py1 > by0, (
"No spatial overlap between parcels and buildings after alignment. "
"Check that both datasets cover the same geographic region."
)
print("All alignment assertions passed.")Log these assertion results alongside your experiment metadata. If a batch job fails the overlap check, it typically means a new data extract has drifted to a different tile or bounding box — a data engineering issue, not a code bug.
FAQ
What is the difference between .set_crs() and .to_crs() in GeoPandas?
.set_crs() only attaches CRS metadata to a GeoDataFrame without transforming any coordinates — use it only when the file has no embedded projection and you know what the correct CRS is. .to_crs() performs an actual mathematical reprojection, converting every coordinate from the source CRS to the target CRS. Confusing the two is one of the most common sources of silent spatial corruption in geospatial ML pipelines.
Why does a spatial join return no results even though the geometries look correct?
The most common cause is a CRS mismatch between the two GeoDataFrames. GeoPandas performs spatial joins in the coordinate space of the left frame. If the right frame is in a different CRS, its geometries fall in entirely different coordinate ranges and no intersections are found. Always call gdf.crs.equals(other.crs) before any spatial join and reproject the right frame to match the left.
Should I reproject my GeoDataFrame or my raster when they have different CRS values?
Reproject the vector layer to match the raster. Reprojecting a raster requires pixel resampling, which introduces interpolation artifacts and can change spectral or elevation values. Vector reprojection with .to_crs() is lossless for attribute data and only introduces sub-millimetre coordinate rounding errors, making it the preferred direction. See CRS Alignment and Projection Handling for the full raster alignment workflow.
Related
- CRS Alignment and Projection Handling — full pipeline: raster alignment, datum grids, cross-modal validation
- Vector Proximity and Buffer Generation — distance and buffer features that depend on a metric CRS
- Feature Scaling for Geospatial Inputs — normalising coordinate and distance columns after alignment
- Creating Distance Matrices for Spatial Features — pairwise distance computation in a correctly projected space
Part of: CRS Alignment and Projection Handling Part of: Spatial Feature Engineering for Machine Learning