Core Concepts
Discrete protein language representation for sequence-structure co-generation.
Abstract
Abstract:
Introducing FoldTokenizer for protein sequence-structure representation.
Application of protein language in backbone inpainting and antibody design tasks.
1. Introduction:
Sequence and structure modeling crucial in protein applications.
Modality gap between sequence and structure models addressed by FoldTokenizer.
2. Related Work:
Co-modeling techniques integrating pretrained models for predictive tasks.
3. Method:
Framework includes FoldTokenizer and FoldGPT models for sequence-structure co-generation.
4. Experiments:
Reconstruction quality comparison among VQ methods on CATH4.3 dataset.
Backbone inpainting results showing superiority of FoldGPT over baselines.
Antibody design performance evaluation against baselines in CDR regions.
Stats
Vanilla VQ (Van Den Oord et al., 2017) compresses latent representations to the nearest codebook vector.
Soft Conditional Vector Quantizer (SoftCVQ) achieves good performance on both protein reconstruction and generation tasks.
SoftGVQ identified a trade-off between reconstruction and generation tasks.
Quotes
"Establishing a discrete protein language to bridge protein research with NLP remains an open challenge." - Pintea et al., 2023
"Our findings reveal a substantial enhancement in reconstruction quality with the proposed SoftCVQ method surpassing existing VQ methods." - Gao et al., 2023b