核心概念
Existing data splitting strategies for protein-protein interaction benchmarks suffer from significant data leakage, leading to overoptimistic evaluation of model generalization. Directly comparing the structural similarity of protein interfaces is crucial for creating high-quality, non-leaking splits.
摘要
The content discusses the problem of data leakage in protein-protein interaction (PPI) benchmarks used for evaluating machine learning models. It highlights that commonly used splitting strategies based on metadata (e.g., PDB codes) or sequence similarity are insufficient, as they result in a high percentage of test examples having near-duplicate training examples in terms of the structural similarity of the protein interfaces.
The authors first quantify the data leakage in different splitting approaches using the iDist algorithm, which efficiently approximates the structural alignment of protein interfaces. They find that splits based on PPI codes, PDB codes, and sequence similarity lead to 86%, 65%, and 30% data leakage, respectively.
To address this issue, the authors review recent work that utilizes structural similarity of protein interfaces for creating non-leaking splits. Specifically, methods like Foldseek and iDist enable large-scale comparison of protein interfaces and can be used to construct benchmarks that effectively assess the generalization of machine learning models beyond the training data.
The authors also highlight the importance of leveraging domain expertise provided by dataset authors, as exemplified by the SKEMPI v2.0 dataset, where the expert-curated grouping of protein complexes leads to 0% data leakage compared to 56% leakage in a naive PPI code-based split.
Finally, the authors provide recommendations for the community, emphasizing the use of interface similarity as the standard criterion for splitting protein complexes, thoroughly reviewing dataset-specific information provided by experts, and quantifying data leakage when train-test splits are not under the researchers' control.
統計資料
Percentage of test PPIs with near-duplicate training examples:
PPI code-based split: 86%
PDB code-based split: 65%
Sequence similarity-based split: 30%
引述
"We find that splits based on PPI codes, on average, lead to 86% data leakage, which is expected due to the high redundancy in PDB. Splits based on PDB codes improve the situation, yet still lead to 65% data leakage."
"We find that this sequence-based splitting approach yields a substantial improvement in structural data leakage compared to metadata-based splits, with a leakage rate of 30%."