From raw DNA to protein: an agentic genome-annotation walkthrough

Genome annotation is one area where AI agents are starting to become genuinely useful in biology. Most genomic analysis starts with something we already know: reference gene models, transcript databases, RNA-seq evidence, or homology to known proteins. But what happens when we only have raw genomic DNA?

In our new ClawBio + Genomic Intelligence walkthrough, we take a 120 kb genomic region and treat it as if it were unannotated.

The workflow

Raw DNA sequence → transcript prediction → splice-site annotation → exon/intron reconstruction → ORF detection → protein translation → BLAST-based functional interpretation.

Using GI gene-finding and splicing models inside a ClawBio agentic workflow, the system:

predicted transcript intervals,
identified splice donor/acceptor signals,
reconstructed a candidate transcript,
translated a 439 amino acid protein,
and identified it as MYC / c-Myc with 99.77% identity.

We used a human region so we could validate the result. But the same logic applies to unannotated vertebrate genomes, where the real question is: what genes are here, what do they encode, and what biological function might they have?

Why agentic biology workflows matter

Not because agents replace biology. But because they can connect many fragile steps — sequence fetching, coordinate conversion, model calls, splice-site interpretation, ORF extraction, protein translation, and BLAST — into a reproducible workflow that scientists can inspect and reuse.

Genome sequencing is no longer the bottleneck. Interpretation is. And genome-scale AI models + agentic workflows are becoming a practical path from raw DNA to biological hypothesis.