Core Concepts
Protein design can be significantly enhanced by leveraging textual annotations that directly describe protein functionalities and properties, enabling fine-grained control over the generation process.
Abstract
The core challenge of de novo protein design is creating proteins with specific functions or properties, guided by certain conditions. Current models explore generating proteins using structural and evolutionary guidance, which only provide indirect conditions concerning functions and properties. However, textual annotations of proteins, especially the annotations for protein domains, which directly describe the protein's high-level functionalities, properties, and their correlation with target amino acid sequences, remain unexplored in the context of protein design tasks.
To address this, the paper proposes Protein-Annotation Alignment Generation (PAAG), a multi-modality protein design framework that integrates the textual annotations extracted from protein databases for controllable generation in sequence space. Specifically, within a multi-level alignment module, PAAG can explicitly generate proteins containing specific domains conditioned on the corresponding domain annotations, and can even design novel proteins with flexible combinations of different kinds of annotations.
The experimental results underscore the superiority of the aligned protein representations from PAAG over 7 prediction tasks. Furthermore, PAAG demonstrates a nearly sixfold increase in generation success rate (24.7% vs 4.7% in zinc finger, and 54.3% vs 8.7% in the immunoglobulin domain) in comparison to the existing model. This showcases PAAG's ability to leverage the knowledge from textual annotation and proteins for improved protein design.
Stats
PAAG can generate proteins containing specific domains with a success rate of 54.3% for the immunoglobulin domain, compared to 8.7% for the existing model.
PAAG can generate proteins containing specific domains with a success rate of 24.7% for the zinc finger domain, compared to 4.7% for the existing model.
PAAG outperforms state-of-the-art models by an average relative improvement of 1.5% across 7 predictive downstream tasks.
Quotes
"The core challenge of de novo protein design lies in creating proteins with specific functions or properties, guided by certain conditions."
"Textual annotations of proteins, especially the annotations for protein domains, which directly describe the protein's high-level functionalities, properties, and their correlation with target amino acid sequences, remain unexplored in the context of protein design tasks."
"PAAG can explicitly generate proteins containing specific domains conditioned on the corresponding domain annotations, and can even design novel proteins with flexible combinations of different kinds of annotations."