Optimizing Memory Usage for Large Vector Datasets

Optimize memory for large vector datasets in Python: chunked I/O with pyogrio, GeoParquet columnar storage, type downcasting, and numpy.memmap for out-of-core ML.

TL;DR: Replace geopandas.read_file() with chunked pyogrio.read_dataframe(), downcast numeric columns to float32, and persist scaled features to a numpy.memmap file. This keeps RAM consumption flat regardless of dataset size and feeds PyTorch or TensorFlow data loaders without duplication. The streaming approach is part of the Feature Scaling for Geospatial Inputs workflow described in its parent section.


Why This Fails in Geospatial ML Pipelines

Vector datasets carry silent overhead that triggers out-of-memory (OOM) errors in ways that are hard to predict from file size alone. A 2 GB GeoPackage of urban parcels with float64 coordinates and a wide attribute table can expand to 12–18 GB in RAM after deserialization, spatial index construction, and pandas object conversion — before a single scaling operation runs.

The failure is often silent: Linux OOM-killer terminates the training process with no Python traceback, or the job appears to hang while the OS swaps aggressively to disk. Downstream steps corrupted by this include:

  • Scaler fitting: calling StandardScaler.fit() on an in-memory GeoDataFrame produces a peak RAM spike equal to 2–3× the dataset size (original data + transformed copy).
  • Spatial joins: joining large attribute tables before filtering by extent loads all geometry into memory even when only a small spatial subset is needed.
  • Intermediate writes: converting large GeoDataFrames to uncompressed CSV or Shapefile during pipeline steps amplifies disk pressure and I/O latency.

The solution is to enforce a strict memory ceiling at the ingestion layer rather than hoping the dataset fits.


Core Memory Reduction Principles

Apply these principles in order — each compounds the savings of the previous one:

  • Migrate to columnar formats first. Legacy formats like Shapefile and GeoJSON force full deserialization and lack compression. Converting to GeoParquet (WKB geometry + ZSTD-compressed columns) typically reduces disk size by 60–80% and enables column projection: only the columns your model needs are read from disk.
  • Load in chunks, not monolithically. Use pyogrio.read_dataframe() with skip_features and max_features rather than geopandas.read_file(). For lazy execution across the whole dataset, dask_geopandas builds a task graph that materializes partitions on demand.
  • Downcast aggressively before any computation. Convert float64float32, int64int32, and encode repetitive string columns as category. Coordinate precision can safely round to 5–6 decimal places (~1 m accuracy), halving geometry memory without degrading model inputs.
  • Use partial_fit for out-of-core scaling. StandardScaler.partial_fit() accumulates mean and variance statistics across chunks. After streaming the full dataset, the scaler is equivalent to fitting on the complete data.
  • Persist scaled arrays to disk, not RAM. numpy.memmap provides random access to a disk-backed float array without loading it into RAM. PyTorch’s DataLoader and TensorFlow’s tf.data both support memmap-backed datasets natively.
  • Defer spatial operations until after filtering. Use R-tree or H3 grid pre-filtering to reduce candidate rows before running exact spatial joins. Joining against the full geometry column is the single largest source of intermediate DataFrame bloat during Spatial Feature Engineering for Machine Learning pipelines.

Production-Ready Code

The function below streams a large vector file, downcasts numerics in-place, fits a scaler incrementally, and writes the scaled output to a memmap file. It handles the full pipeline in a single pass.

import pyogrio
import numpy as np
from sklearn.preprocessing import StandardScaler
import os
import warnings
import logging

warnings.filterwarnings("ignore", category=UserWarning)
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger(__name__)


def stream_and_scale_to_memmap(
    input_path: str,
    output_path: str,
    chunk_size: int = 50_000,
    columns: list[str] | None = None,
) -> tuple[np.memmap, list[str], StandardScaler]:
    """
    Stream a vector dataset, downcast numerics, incrementally fit a
    StandardScaler, and write float32 scaled features to a memmap file.

    Args:
        input_path:  Path to GeoPackage, GeoParquet, or Shapefile.
        output_path: Path for the output .npy memmap file.
        chunk_size:  Rows per chunk (tune to available RAM; 50k is safe
                     for ~500 MB working sets).
        columns:     Explicit list of numeric columns to include. If None,
                     all numeric columns detected from the first chunk are used.

    Returns:
        Tuple of (memmap array, column list, fitted StandardScaler).
    """
    info = pyogrio.read_info(input_path)
    total_rows: int = info["features"]
    logger.info("Dataset: %d rows in %s", total_rows, input_path)

    # Schema inspection — read first chunk to detect numeric columns
    first_chunk = pyogrio.read_dataframe(
        input_path, skip_features=0, max_features=min(chunk_size, total_rows)
    )
    if columns is None:
        columns = first_chunk.select_dtypes(include=["number"]).columns.tolist()
    n_features = len(columns)
    logger.info("Scaling %d numeric columns: %s", n_features, columns)

    # Pre-allocate memory-mapped array (float32 halves RAM vs float64)
    if os.path.exists(output_path):
        os.remove(output_path)
    memmap = np.memmap(
        output_path, dtype=np.float32, mode="w+", shape=(total_rows, n_features)
    )

    scaler = StandardScaler()

    # Pass 1: partial_fit across all chunks to build global statistics
    for start in range(0, total_rows, chunk_size):
        chunk = pyogrio.read_dataframe(
            input_path, skip_features=start, max_features=chunk_size
        )
        X_chunk = chunk[columns].astype(np.float32).values
        scaler.partial_fit(X_chunk)
        del chunk, X_chunk
        logger.info("partial_fit: rows %d–%d", start, min(start + chunk_size, total_rows))

    # Pass 2: transform and write to memmap
    for start in range(0, total_rows, chunk_size):
        chunk = pyogrio.read_dataframe(
            input_path, skip_features=start, max_features=chunk_size
        )
        X_chunk = chunk[columns].astype(np.float32).values
        end_idx = min(start + chunk_size, total_rows)
        memmap[start:end_idx] = scaler.transform(X_chunk).astype(np.float32)
        # Flush forces OS page cache write, preventing bloat on long runs
        memmap.flush()
        del chunk, X_chunk

    logger.info("Wrote scaled array (%d × %d) to %s", total_rows, n_features, output_path)
    return memmap, columns, scaler

A two-pass design (one pass to partial_fit, one pass to transform) keeps each pass to a single sequential scan and prevents any intermediate state from accumulating in RAM. If your dataset is stored in GeoParquet with column pruning available, pass columns= explicitly to skip geometry deserialization entirely.

Connecting the memmap to a training loop

import joblib
from torch.utils.data import Dataset, DataLoader
import torch

# Persist the scaler for inference-time use
joblib.dump(scaler, "scaler.joblib")

class MemmapDataset(Dataset):
    """Wraps a numpy.memmap for use with PyTorch DataLoader."""

    def __init__(self, path: str, shape: tuple[int, int], labels_path: str | None = None):
        self.data = np.memmap(path, dtype=np.float32, mode="r", shape=shape)
        self.labels = np.load(labels_path) if labels_path else None

    def __len__(self) -> int:
        return self.data.shape[0]

    def __getitem__(self, idx: int):
        x = torch.from_numpy(self.data[idx].copy())
        if self.labels is not None:
            return x, torch.tensor(self.labels[idx], dtype=torch.float32)
        return x

# Usage
dataset = MemmapDataset("features.npy", shape=(total_rows, n_features), labels_path="labels.npy")
loader = DataLoader(dataset, batch_size=1024, shuffle=True, num_workers=4)

The .copy() call in __getitem__ is required because memmap slices return non-writeable views; PyTorch tensors need writeable backing memory.


Step-by-Step Walkthrough

This example applies the pipeline to a national parcel dataset (5 million rows, 22 attribute columns, stored as a GeoPackage).

Step 1: Convert to GeoParquet

Before streaming, convert any legacy format to GeoParquet to enable column projection and compression:

import geopandas as gpd
import pyogrio

# Read the full file only once for format conversion
# (acceptable cost if done offline before pipeline runs)
gdf = pyogrio.read_dataframe("parcels_national.gpkg")
gdf.to_parquet(
    "parcels_national.parquet",
    compression="zstd",
    geometry_encoding="WKB",
)

After conversion, the 8.4 GB GeoPackage reduces to 2.1 GB on disk.

Step 2: Identify columns and inspect dtypes

import pyogrio

first = pyogrio.read_dataframe("parcels_national.parquet", max_features=1000)
print(first.dtypes)
# Select only the numeric columns relevant to your model
feature_cols = ["area_m2", "perimeter_m", "build_year", "floors", "lot_value", "dist_road_m"]

Step 3: Run the streaming pipeline

memmap_arr, col_names, fitted_scaler = stream_and_scale_to_memmap(
    input_path="parcels_national.parquet",
    output_path="parcels_scaled.npy",
    chunk_size=100_000,
    columns=feature_cols,
)

On a machine with 16 GB RAM, this processes 5 million rows without exceeding 3 GB peak usage. The memmap file is written to an NVMe-backed scratch disk for maximum sequential write throughput.

Step 4: Verify shape and statistics

assert memmap_arr.shape == (5_000_000, 6), f"Unexpected shape: {memmap_arr.shape}"

# Spot-check: scaled means should be near 0, stds near 1
sample = memmap_arr[:10_000]
print(f"Sample mean per column: {sample.mean(axis=0).round(3)}")
print(f"Sample std  per column: {sample.std(axis=0).round(3)}")
# Expected: means ≈ 0.0, stds ≈ 1.0

Step 5: Reload for training without re-scaling

import numpy as np

# Open read-only — no RAM copy, pure OS page-cache access
X_train = np.memmap("parcels_scaled.npy", dtype=np.float32, mode="r", shape=(5_000_000, 6))

Verification

Confirm correctness with three checks before starting any training run:

import numpy as np
import joblib

arr = np.memmap("parcels_scaled.npy", dtype=np.float32, mode="r", shape=(5_000_000, 6))
scaler = joblib.load("scaler.joblib")

# 1. No NaN or Inf in scaled output
assert not np.any(np.isnan(arr)), "NaN values in scaled array"
assert not np.any(np.isinf(arr)), "Inf values in scaled array"

# 2. Column-level mean and std (sample 100k rows for speed)
idx = np.random.choice(arr.shape[0], size=100_000, replace=False)
sample = arr[idx]
for i, col in enumerate(feature_cols):
    mean = sample[:, i].mean()
    std  = sample[:, i].std()
    assert abs(mean) < 0.05, f"{col}: mean={mean:.4f} far from 0"
    assert abs(std - 1.0) < 0.05, f"{col}: std={std:.4f} far from 1"

# 3. Inverse-transform round-trip on a small batch
original_chunk = pyogrio.read_dataframe("parcels_national.parquet", max_features=1000)
X_orig = original_chunk[feature_cols].astype(np.float32).values
X_scaled = scaler.transform(X_orig).astype(np.float32)
X_restored = scaler.inverse_transform(X_scaled)
np.testing.assert_allclose(X_orig, X_restored, rtol=1e-3)
print("All verification checks passed.")

MLOps Integration and Deployment Notes

Memory-optimised vector streaming pipeline Five stages connected by arrows: GeoParquet source, chunked pyogrio read, float32 downcast and partial_fit, memmap write, DataLoader and model training. GeoParquet (ZSTD + WKB) pyogrio chunk read float32 cast partial_fit StandardScaler memmap flush to disk DataLoader → model training RAM ceiling: one chunk at a time — dataset size irrelevant

Memory-optimized pipelines shift the bottleneck from RAM to I/O throughput. Align your storage tier with access patterns across training and inference phases:

Training: Use NVMe-backed scratch storage for memmap files. Sequential chunk writes during streaming saturate sequential bandwidth efficiently. Cloud equivalents: instance-store volumes on AWS, local SSD on GCP.

Inference: Load the pre-scaled memmap array directly into batched data loaders. Never re-scale at inference time from raw coordinates — this introduces latency and risks scaler mismatch. Serialize the fitted StandardScaler via joblib.dump() during training and load it at serving time. Apply it to incoming feature vectors before passing them to the model.

Drift detection: During ingestion, compute rolling statistics (mean, variance, null rate) per chunk and store lightweight summaries in a metadata table. This keeps monitoring overhead proportional to feature count, not row count — a 5-million-row ingestion adds microseconds of monitoring overhead per chunk rather than materializing full feature distributions. This connects directly to the Vector Proximity and Buffer Generation stage, where deferred spatial joins benefit from the same chunked pattern.

When using Spatial Lag and Neighborhood Statistics as model features, note that spatial lag computation requires global neighbor lookups — build the spatial weights matrix once against the full coordinate set (using an R-tree or STRtree index), serialize it, then apply it chunk-by-chunk during feature extraction rather than rebuilding it per chunk.


Common Pitfalls to Avoid

Pitfall Impact Fix
geopandas.read_file() on >1M rows OOM or swap thrashing before any ML code runs Use pyogrio.read_dataframe() with skip_features/max_features
StandardScaler.fit() on full in-memory array 2–3× RAM spike during transform() call Replace with partial_fit() across chunks, then transform() per chunk
Retaining float64 coordinates 2× memory cost per coordinate pair Round to 5 decimal places, cast to float32 before ingestion
Uncompressed intermediates (CSV, Shapefile) Disk I/O bottleneck; re-read slower than source Write intermediates as GeoParquet; write final arrays as memmap
Loading full attribute table when only 4 columns needed Unnecessary deserialization of all columns Pass columns=["id", "area", "value"] to pyogrio.read_dataframe()
Building spatial index inside chunk loop Quadratic R-tree construction cost Build the STRtree once on bounding box centroids, serialize, reuse

FAQ

Why does partial_fit() give different results than fit() on the same data?

It should not, up to floating-point accumulation order. partial_fit() accumulates the sum and sum-of-squares across chunks and computes mean and variance at the end. Because floating-point addition is not perfectly associative, you may see differences in the 6th–7th decimal place compared to fit() on the full array. This has no practical effect on float32 model inputs.

Can I use dask-geopandas instead of manual pyogrio chunks?

Yes. dask_geopandas.read_file() wraps pyogrio and returns a Dask GeoDataFrame with lazy evaluation. Use map_partitions() to apply downcast and partial_fit steps. The trade-off: Dask adds scheduling overhead (~5–15% slower on sequential pipelines) but simplifies parallel execution across cores. For single-machine sequential pipelines under 50 million rows, manual pyogrio chunking is faster and simpler.

Does the memmap file need to fit in RAM to be useful?

No. The OS maps pages from the file into virtual address space on demand and evicts them under memory pressure. A 40 GB memmap on a machine with 16 GB RAM works correctly — the OS loads only the pages currently being accessed. Performance degrades gracefully rather than crashing, unlike in-memory arrays that fail catastrophically at the allocation boundary.


Part of: Feature Scaling for Geospatial InputsSpatial Feature Engineering for Machine Learning