Cumulus Labs — February 2026
Why we're rebuilding inference from scratch.
The industry runs AI on generic stacks built for last-gen GPUs. Same open-source engine, same rented H100s, wrapped in Kubernetes. The result: cold starts measured in minutes, fine-tuned models treated as afterthoughts, throughput capped by software that doesn't know the hardware.
We built Cumulus to fix this. From first principles.
Most platforms reload everything from scratch — container, weights, runtime. We snapshot your model's live execution state and restore from memory, not a cold boot.
Scale to zero. Pay nothing idle. Come back instantly.
You spent weeks training a LoRA. Then your serving platform treats it like an afterthought — slow adapter merging, no optimized kernels, cold starts that kill production traffic.
Cumulus serves fine-tuned models with the same cold-start guarantees and throughput as base models. One call. Production endpoint back.
ionattention is our custom C++ runtime built natively for NVIDIA Grace Hopper. GH200 isn't just a GPU — it's a system: 99GB HBM3, 452GB LPDDR5X, 900 GB/s coherent CPU↔GPU link, 72 ARM cores. Most engines ignore all of that. We designed around it.
Coherent memory scheduling. Full SM utilization across all 132 streaming multiprocessors. Architecture-native kernels tuned for GH200's memory hierarchy — not a generic fallback.
The stack should fit the silicon. If your engine doesn't change when the chip changes, you're leaving performance on the table.
You should only pay for what you use. Per-second billing. Scale to zero. No idle tax.
Fine-tuned models are the future. Every team will run their own weights. Infrastructure should make that trivial.