Core Concepts
Joint Multi-domain Pre-training (JMP) is an effective supervised pre-training strategy that leverages diverse chemical datasets to learn generalizable representations for accurate and efficient atomic property prediction across multiple domains.
Abstract
The paper introduces Joint Multi-domain Pre-training (JMP), a supervised pre-training strategy for atomic property prediction that simultaneously trains on multiple datasets from different chemical domains. The key insights are:
JMP frames each chemical dataset as a separate pre-training task within a multi-task learning framework, allowing a single model to learn from diverse data sources.
JMP employs several techniques to address the challenges of multi-task pre-training, including data normalization, temperature-based dataset sampling, structure-wise loss reduction, and effective regularization.
Evaluating on a comprehensive benchmark spanning small molecules, large molecules, and materials, JMP demonstrates significant performance improvements over training from scratch, setting or matching state-of-the-art on 34 out of 40 tasks.
JMP enables efficient scaling to larger models, overcoming the typical overfitting issues observed when training large models from scratch on small datasets. The pre-trained JMP models exhibit a 12x reduction in fine-tuning compute compared to training from scratch.
The authors conduct detailed ablation studies to understand the impact of various components of the JMP method, highlighting the importance of diverse multi-task pre-training and effective regularization strategies.
Overall, the paper presents a powerful pre-training approach that leverages the availability of large, diverse chemical datasets to advance the state-of-the-art in atomic property prediction, especially for low-data tasks and out-of-domain applications.
Stats
The combined pre-training dataset contains over 120M training examples with energy and force labels, with the majority (> 99%) coming from non-equilibrium structures.
The fine-tuning datasets range from 600 to 130,000 examples, covering small molecules, large molecules, and materials.
Quotes
"Foundation models have been transformational in machine learning fields such as natural language processing and computer vision. Similar success in atomic property prediction has been limited due to the challenges of training effective models across multiple chemical domains."
"By pre-training large models on diverse chemical data, we believe JMP represents an important step towards the goal of a universal ML potential, and that the continued growth of available data and compute power will only improve JMP's ability to learn transferable atomic representations."