This paper introduces DiscDiff, a Latent Diffusion Model (LDM) designed specifically for generating discrete DNA sequences. DiscDiff consists of two key components: a Variational Autoencoder (VAE) that maps discrete DNA sequences to a continuous latent space, and a denoising model that learns to predict the noise added during the diffusion process.
The authors also propose the Absorb-Escape algorithm, a post-processing step that refines the generated sequences by detecting and correcting local errors made by the LDM. This algorithm leverages the strengths of both diffusion and autoregressive models to produce more realistic and coherent DNA sequences.
To evaluate the performance of DiscDiff, the authors introduce EPD-GenDNA, a large-scale, multi-species dataset for DNA generation, which includes 160,000 unique sequences from 15 different species. They compare DiscDiff and the Absorb-Escape algorithm against other state-of-the-art diffusion and autoregressive models on both unconditional and conditional DNA sequence generation tasks.
The results show that DiscDiff outperforms existing diffusion models in terms of latent distance, motif frequency correlation, and diversity of the generated sequences. Furthermore, the Absorb-Escape algorithm further improves the quality of the generated sequences, achieving the best performance across all metrics. The authors also demonstrate that the Absorb-Escape algorithm can be used to control the balance of different genetic motifs in the generated sequences, providing a useful tool for applications such as gene therapy and protein production.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Zehui Li,Yuh... at arxiv.org 04-18-2024
https://arxiv.org/pdf/2402.06079.pdfDeeper Inquiries