核心概念
Integrating protein family classification information with masked language modeling improves the quality of protein representations, leading to state-of-the-art performance on various downstream tasks.
要約
The content discusses a novel approach to enhancing protein sequence modeling by combining graph-based clustering and masked language prediction. The key highlights are:
The authors propose a Community Propagation-Based Clustering Algorithm that incorporates protein family and superfamily information into the training process, improving the global representation of protein structures and functions.
This clustering approach is combined with a masked language modeling task, which refines the local accuracy of amino acid representations by predicting missing residues based on contextual cues.
The resulting model, called ComproESM, significantly outperforms the state-of-the-art ESM2 model on a range of downstream tasks, including protein classification, mutation effect prediction, activity prediction, protein-protein interaction, function prediction, and homology detection.
The authors demonstrate that the protein representations learned by ComproESM better capture the biochemical properties and structural-functional relationships of proteins, as evidenced by visualizations and ablation studies.
The proposed training methodology addresses the limitations of ESM2, which relies solely on statistical analysis of amino acid compositions, by integrating both global and local insights into the protein representation.
The Community Propagation-Based Clustering Algorithm is a novel, resource-efficient approach to training graph neural networks, which can be applied beyond the protein domain.
統計
The dataset consists of 540,601 protein samples from the UniProtKB/Swiss-Prot database, with 17,132 family categories and 3,189 superfamily categories.
The average length of the amino acid sequences is 367.01.
引用
"Integrating protein family classification information with masked language modeling improves the quality of protein representations, leading to state-of-the-art performance on various downstream tasks."
"The Community Propagation-Based Clustering Algorithm is a novel, resource-efficient approach to training graph neural networks, which can be applied beyond the protein domain."