Serving ultra-long context genomic models on one GPU
A look at the tradeoffs behind running whole-genome context windows in production without a fleet of accelerators.
Long context is where genomic models earn their keep — regulatory elements act across tens of kilobases, so a model that can only see a few hundred base pairs is guessing. But long context is also where serving costs explode. Here’s how we keep whole-genome inference on a single A10G practical.
The problem
Attention scales quadratically with sequence length. Naively, doubling the context window quadruples the memory and compute per forward pass. For a model meant to reason over megabases, that’s not a knob you can just turn up.
Three levers we pull
- Windowed inference with overlap. Long inputs are chunked into overlapping windows, scored independently, and stitched back together. The overlap keeps edge effects from creating phantom boundaries.
- Memory-efficient attention kernels. Where the architecture allows, we compile fused attention against the serving torch build so the working set stays on-device.
- Model-aware batching. The scheduler groups requests by model and window size so the GPU is never waiting on a mismatched batch.
What we watch
The metric that matters is not raw throughput but predictable latency under mixed load. A tutorial user pasting 2 kb and a partner streaming a whole chromosome hit the same endpoint — the scheduler’s job is to keep neither starving the other.
More on the scheduler internals in a future post. If you’re building something latency-sensitive on top of the API, tell us about it.