Serving ultra-long context genomic models on one GPU

Long context is where genomic models earn their keep — regulatory elements act across tens of kilobases, so a model that can only see a few hundred base pairs is guessing. But long context is also where serving costs explode. Here’s how we keep whole-genome inference on a single A10G practical.

The problem

Attention scales quadratically with sequence length. Naively, doubling the context window quadruples the memory and compute per forward pass. For a model meant to reason over megabases, that’s not a knob you can just turn up.

Three levers we pull

Windowed inference with overlap. Long inputs are chunked into overlapping windows, scored independently, and stitched back together. The overlap keeps edge effects from creating phantom boundaries.
Memory-efficient attention kernels. Where the architecture allows, we compile fused attention against the serving torch build so the working set stays on-device.
Model-aware batching. The scheduler groups requests by model and window size so the GPU is never waiting on a mismatched batch.

What we watch

The metric that matters is not raw throughput but predictable latency under mixed load. A tutorial user pasting 2 kb and a partner streaming a whole chromosome hit the same endpoint — the scheduler’s job is to keep neither starving the other.

More on the scheduler internals in a future post. If you’re building something latency-sensitive on top of the API, tell us about it.