Cumulus Labs — February 2026

The Cumulus Manifesto

Why we're rebuilding inference from scratch.

Inference is broken.

The industry runs AI on generic stacks built for last-gen GPUs. Same open-source engine, same rented H100s, wrapped in Kubernetes. The result: cold starts measured in minutes, fine-tuned models treated as afterthoughts, throughput capped by software that doesn't know the hardware.

We built Cumulus to fix this. From first principles.

Cold starts in seconds, not minutes.

Most platforms reload everything from scratch — container, weights, runtime. We snapshot your model's live execution state and restore from memory, not a cold boot.

Flux 2 cold start: 4.2s · Same model on Modal: 60s

Scale to zero. Pay nothing idle. Come back instantly.

Fine-tuned models as first-class citizens.

You spent weeks training a LoRA. Then your serving platform treats it like an afterthought — slow adapter merging, no optimized kernels, cold starts that kill production traffic.

Cumulus serves fine-tuned models with the same cold-start guarantees and throughput as base models. One call. Production endpoint back.

First-principles throughput: ionattention.

ionattention is our custom C++ runtime built natively for NVIDIA Grace Hopper. GH200 isn't just a GPU — it's a system: 99GB HBM3, 452GB LPDDR5X, 900 GB/s coherent CPU↔GPU link, 72 ARM cores. Most engines ignore all of that. We designed around it.

Coherent memory scheduling. Full SM utilization across all 132 streaming multiprocessors. Architecture-native kernels tuned for GH200's memory hierarchy — not a generic fallback.

Qwen2.5-7B: 7,167 tok/s · Qwen3-VL-8B: 444 tok/s · 1.5× Together AI's H100 baseline. Single chip.

What we believe.

The stack should fit the silicon. If your engine doesn't change when the chip changes, you're leaving performance on the table.

You should only pay for what you use. Per-second billing. Scale to zero. No idle tax.

Fine-tuned models are the future. Every team will run their own weights. Infrastructure should make that trivial.

This is Cumulus.

Serverless GPU inference. Built different.

⚡ Platform Deploy models now ✍️ Blog Technical deep dives 🔬 ionattention The full breakdown

Get started →

Y Combinator W26