Skip to content

Memory-Aware Serialization (Lazy Loading & Quantization)

Overview

This plan details the implementation of "Memory-Aware" serialization in lodum. The goal is to allow Data Analysts to work with large, quantized datasets (e.g., ML weights, high-frequency sensor data) in memory-constrained environments like Pyodide/WASM without triggering Out-Of-Memory (OOM) errors.

Problem Statement

Standard Python serialization (and many modern alternatives) expands data upon loading. For example, 1GB of 4-bit quantized data often expands to 8GB or more when converted to standard Python floats or NumPy arrays. In environments like WASM with a 4GB memory ceiling, this makes many data science tasks impossible.

Implementation Approach

1. Lazy Field & Partial Object Loading

Introduce a lazy=True option to field(). This provides two tiers of memory optimization:

  • Lazy Tensor/Array Access: For large arrays (NumPy/Tensors), the loader captures the byte-offset and length, returning a proxy. Materialization occurs only when slices are accessed.
  • Lazy Field Access (Complex Objects): For nested @lodum objects or large dictionaries, the field is skipped during initial parse. Accessing the attribute triggers a targeted "sub-parse" of only that section of the buffer.
@lodum
class LargeModel:
    metadata: Dict[str, Any]
    weights: np.ndarray = field(lazy=True)  # Tensor proxy
    extra_data: ComplexInfo = field(lazy=True)  # Object proxy
  • Generated Bytecode: The compiler generates a "skip-and-record" instruction that stores the buffer reference and bounds for the lazy field.
  • Seekable Requirements: This feature requires a seekable data source (e.g., BytesIO or mmap) to allow random access to lazy segments.

2. Quantization-Aware Handlers (lodum.ext.ml)

To maintain a lean core, advanced ML-specific logic (bit-packing, quantization scales) will reside in the lodum.ext.ml namespace. This will be an optional "extra" (pip install "lodum[ml]").

  • Bit-Packed Dtypes: Support for q4_0, q4_k, int8, etc.
  • AST Generation: The load_codegen engine will generate specialized bit-shifting loops ((val >> 4) & 0x0F) to unpack data directly into the target representation.
  • Metadata Coupling: Support for block-level quantization parameters (scales and zero-points) that are read and applied during access.

3. Zero-Copy Architecture & Memory Mapping

Leverage Python's memoryview and mmap to avoid memory expansion.

  • Buffer Management: Use memoryview to slice the input buffer without copying.
  • mmap Integration: When loading from a file, lodum can automatically use mmap to map the file into address space, allowing the OS to handle paging and keeping the Python heap footprint minimal.
  • Interleaved Data: Support formats like GGUF and Safetensors, allowing the parser to jump between descriptors and large data blocks.

Use Cases

  • Browser-based LLMs: Managing model metadata and weights in Pyodide without crashing the tab.
  • Edge Computing: Analyzing high-frequency sensor streams on resource-constrained hardware.
  • Dequantize-on-Access: Providing a NumPy-compatible interface that only performs floating-point math on the specific slice being accessed.

Relationship to Streaming Support

While both this plan and the Streaming Support Plan address memory constraints, they solve different problems:

  • Streaming (load_stream): Focuses on horizontal scale (iterating over millions of objects). It works on sequential, non-seekable streams (like network sockets).
  • Memory-Aware (lazy=True): Focuses on vertical depth (handling massive fields within a single object). It requires a seekable source (like a local file or BytesIO) to allow the proxy to "jump back" and read data on-demand.

Synergy: When combined, lodum can stream a large list of objects where each individual object is also lazily loaded, providing maximum memory efficiency for complex data science pipelines.

Milestones

  1. Phase 1: Prototype lazy field skipping in the AST compiler.
  2. Phase 2: Implement basic QuantizedArray proxy for 8-bit data.
  3. Phase 3: Add bit-packed (4-bit) AST generation logic.
  4. Phase 4: Full integration with NumPy/Polars for zero-copy views.