Optimizing memory usage for large vector datasets requires abandoning monolithic in-memory loading in favor of chunked I/O, columnar storage, aggressive type downcasting, and out-of-core computation. By streaming features through pyogrio, persisting scaled arrays via numpy.memmap, and deferring spatial joins until necessary, you can process tens of millions of geometries without triggering OOM errors during training, inference, or drift detection. This workflow is foundational to modern Spatial Feature Engineering for Machine Learning practices, where memory footprint directly dictates pipeline scalability and deployment viability.
Core Memory Optimization Patterns
Vector datasets carry hidden overhead: float64 coordinate precision, unindexed geometry columns, and redundant attribute tables. Implement these four patterns to keep RAM consumption bounded:
- Migrate to Columnar Formats: Legacy formats like Shapefiles and GeoJSON force full-deserialization and lack compression. Convert to GeoParquet, which stores geometries as WKB alongside ZSTD-compressed columns. Expect 60–80% disk reduction and columnar projection, meaning you only deserialize attributes required for your model.
- Implement Chunked I/O: Replace
geopandas.read_file()with row-slicedpyogrioreads or lazydask_geopandasexecution. Dask constructs a task graph that materializes partitions on demand, enabling you to chain extraction, Feature Scaling for Geospatial Inputs, and prediction without holding the full dataset in RAM. - Downcast Aggressively: Geospatial attributes rarely require 64-bit precision. Convert
float64→float32,int64→int32, and encode repetitive strings ascategory. Coordinate precision can safely round to 5–6 decimal places (~1m accuracy), halving geometry memory without degrading spatial model performance. - Persist with Memory-Mapped Arrays: Instead of accumulating scaled features in memory, write them to disk-backed arrays.
numpy.memmapallows random access to transformed data without RAM duplication, enabling seamless handoff to PyTorch or TensorFlow data loaders. See the official NumPy memmap documentation for dtype and mode specifications.
Production-Ready Streaming Pipeline
The following pattern streams a large GeoPackage, downcasts in-place, fits a scaler incrementally, and writes transformed features to a memory-mapped array for downstream training. It assumes you are scaling numeric attributes and dropping geometry for the ML pipeline.
import pyogrio
import numpy as np
from sklearn.preprocessing import StandardScaler
import os
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
def stream_and_scale_to_memmap(
input_path: str,
output_path: str,
chunk_size: int = 50_000
) -> np.memmap:
"""
Streams vector data, downcasts numerics, incrementally scales,
and writes to a memory-mapped array.
"""
info = pyogrio.read_info(input_path)
total_rows = info["features"]
# 1. Schema inspection (read first chunk to determine numeric columns)
first_chunk = pyogrio.read_dataframe(
input_path, skip_features=0, max_features=chunk_size
)
numeric_cols = first_chunk.select_dtypes(include=["number"]).columns.tolist()
n_features = len(numeric_cols)
# 2. Pre-allocate memory-mapped array (float32 for scaled outputs)
if os.path.exists(output_path):
os.remove(output_path)
memmap = np.memmap(
output_path, dtype=np.float32, mode="w+", shape=(total_rows, n_features)
)
scaler = StandardScaler()
# 3. Stream, downcast, scale, and persist
for start in range(0, total_rows, chunk_size):
chunk = pyogrio.read_dataframe(
input_path, skip_features=start, max_features=chunk_size
)
# Downcast in-place
chunk[numeric_cols] = chunk[numeric_cols].astype(np.float32)
# Extract numeric matrix
X_chunk = chunk[numeric_cols].values
# Incremental scaling (partial_fit for out-of-core)
scaler.partial_fit(X_chunk)
# Transform and write to memmap slice
end_idx = min(start + chunk_size, total_rows)
memmap[start:end_idx] = scaler.transform(X_chunk)
# Explicit flush to disk to free OS page cache
memmap.flush()
del chunk, X_chunk
return memmapKey implementation notes:
partial_fitenables true out-of-core scaling. The scaler learns global mean/variance across chunks without loading all data simultaneously.memmap.flush()forces OS-level writes, preventing page cache bloat during long-running jobs.- Geometry is intentionally excluded from the memmap. If spatial coordinates are model inputs, append them as a separate
float32array after rounding.
MLOps Integration & Deployment Notes
Memory-optimized pipelines shift the bottleneck from RAM to I/O throughput. When deploying to cloud or edge environments, align your storage tier with access patterns:
- Training Phase: Use NVMe-backed scratch disks for
memmapfiles. Sequential writes during chunk processing saturate bandwidth efficiently. - Inference Phase: Load pre-scaled
memmaparrays directly into batched data loaders. Avoid re-scaling at inference time; serialize the fittedStandardScalerviajobliband apply it to incoming requests. - Drift Detection: Compute rolling statistics (mean, variance, null rates) per chunk during ingestion. Store lightweight summaries in a metadata table rather than materializing full feature distributions. This keeps monitoring overhead proportional to feature count, not row count.
Deferring expensive spatial operations until after scaling prevents intermediate DataFrame bloat. Use approximate spatial indexes (e.g., R-tree or H3 grids) for pre-filtering, then execute exact joins only on the reduced candidate set. This pattern aligns with production-grade geospatial MLOps, where deterministic memory ceilings are required for autoscaling and container orchestration.
Common Pitfalls to Avoid
| Pitfall | Impact | Fix |
|---|---|---|
geopandas.read_file() on >1M rows |
Immediate OOM or swap thrashing | Switch to pyogrio.read_dataframe with skip_features/max_features |
Scaling with .fit() on full dataset |
RAM spike during transform | Use .partial_fit() + memmap for out-of-core learning |
Retaining float64 geometry |
2× memory waste per coordinate | Round to 5 decimals, cast to float32 before ingestion |
| Uncompressed intermediate outputs | Disk I/O bottleneck during training | Write memmap to fast storage; compress final Parquet artifacts |
| Loading full attribute table | Unnecessary deserialization overhead | Use column projection (columns=["id", "value"]) in pyogrio |
By enforcing strict memory budgets at the ingestion layer, you eliminate unpredictable scaling failures and enable reproducible, horizontally scalable geospatial ML pipelines.