High-performance numerical computing on ragged and nested data
Grumpy is a high-performance Python library (Rust core) for numerical computing on ragged and nested data — Awkward-like layouts with strong typing, explicit nullability, mutable arrays, and Zarr-backed I/O.
Home¶
Grumpy is layout-first infrastructure for ragged and nested numerical data: protein tables, variable-length sequences, mixed annotation fields, and the batches that feed structural ML pipelines. Unlike ad-hoc Python lists, Grumpy keeps a typed layout tree in Rust so elementwise ops, reductions, neighbor search, and streaming I/O stay fast without hand-written loops.
This guide walks from installation through arrays, dataframes, Zarr streaming, and optional compile-time fusion. Each page builds on the previous one.
Installation¶
Grumpy requires Python ≥ 3.10 and a Rust toolchain when building from source.
PyPI¶
If a wheel is not yet available for your platform, build from source as below.
From source¶
git clone https://github.com/Imaginary-Biolabs/Grumpy.git
cd Grumpy
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -U pip maturin
maturin develop --release
pip install -e ".[dev]" # optional: pytest, mkdocs
Verify the install:
A tour of the main features¶
The snippets below are self-contained. Later pages explain each topic in depth.
Ragged arrays¶
Grumpy arrays mirror nested Python lists but carry a homogeneous dtype on every leaf and run kernels in Rust:
import grumpy as gr
x = gr.array([[1, 2, 3], [4, 5]], dtype=gr.int32)
print(x.to_list()) # [[1, 2, 3], [4, 5]]
print(x.mean(dim=1).to_list()) # [2.0, 4.5]
Union layouts (mixed scalar and list rows)¶
When one axis mixes singletons and lists — common for GO terms, isoform IDs, or mixture SMILES — use a union layout from the same constructor:
Dataframes with schema¶
Named columns share outer list structure; an optional schema names nesting levels for dot notation:
df = gr.dataframe(
{"id": ["a", "b"], "coords": [[1.0, 2.0], [3.0, 4.0, 5.0]]},
schema=["molecule", "atom"],
)
print(df.molecule.coords.to_list())
Save, stream, and transform batches¶
Persist to a Zarr directory, then iterate batches for training — with optional parallel apply:
gr.save(df, "data.gr", chunk_size=64)
for batch in gr.stream("data.gr", batch_size=32, workers=2):
batch = batch * 2.0
train_step(batch)
Compile fused transforms¶
When a batch function is simple enough, @gr.compile fuses it into one Rust plan (see Compilation):
@gr.compile
def scale(batch):
return batch * 2.0 + 1.0
st = gr.stream("data.gr", batch_size=32)
for batch in st.apply(scale, compile="auto", scheduler="auto"):
train_step(batch)
Performance¶
Representative public API timings on slightly ragged data (Grumpy, Awkward) vs rectangular NumPy with the same leaf count. Bar groups are Grumpy · NumPy · Awkward; lower is better. Charts are regenerated on each docs build.
Full benchmark suites live in benchmarks/ — see benchmarks/README.md for setup.
Next: Arrays — construct ragged arrays, run elementwise ops, and index into nested data.