Caduceus introduces innovative architectural components to handle challenges in modeling DNA sequences, such as long-range interactions and reverse complementarity. The model surpasses larger Transformer-based models on variant effect prediction tasks. Pre-training strategies and downstream performance demonstrate the effectiveness of Caduceus in genomics applications.
Large-scale sequence modeling has advanced rapidly, extending into biology and genomics. Genomic sequences present unique challenges like long-range token interactions and reverse complementarity. Caduceus is introduced as a family of RC-equivariant bi-directional long-range DNA language models, surpassing larger models on challenging tasks.
The proposed architecture builds off the Mamba block, extending it to support bi-directionality and reverse complement equivariance. Caduceus outperforms previous models on downstream benchmarks, especially excelling in tasks requiring long-range modeling.
The study compares Caduceus to HyenaDNA and Nucleotide Transformer models across various genomic benchmarks. Results show that Caduceus consistently performs well, particularly in predicting the effect of genetic mutations on gene expression at different distances from Transcription Start Sites (TSS).
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Yair Schiff,... at arxiv.org 03-07-2024
https://arxiv.org/pdf/2403.03234.pdfDeeper Inquiries