Core Concepts
Large language models can be effectively fine-tuned to perform regression tasks for predicting materials and molecular properties, providing a versatile regression tool that can rival or outperform standard machine learning models.
Abstract
This work explores the ability of large language models (LLMs), specifically the LLaMA 3 model, to perform regression tasks for predicting materials and molecular properties. The authors fine-tune the LLaMA 3 model using different input representations, such as SMILES strings, InChI strings, and explicit atomic coordinates, and evaluate its performance on a variety of molecular properties from the QM9 dataset and 24 materials properties.
The key findings are:
LLaMA 3 can function as a useful regression model, with performance that is competitive with or better than random forest models on many materials properties, but generally 5-10x worse than state-of-the-art deep neural network models that leverage detailed structural information.
The choice of input representation (SMILES vs. InChI) can have a modest but statistically significant impact on the regression performance, with SMILES strings generally performing better.
LLaMA 3 outperforms GPT-3.5 and GPT-4o in regression tasks, suggesting the choice of LLM model can have a large impact on the quality of results.
LLMs like LLaMA 3 show promise as versatile regression tools that can be applied to complex physical phenomena in materials science and other scientific domains, without the need for extensive domain-specific feature engineering.
The work highlights the potential of leveraging LLMs for materials and molecular property prediction, while also identifying areas for further research, such as exploring the impact of structural information, prompt engineering, and transfer/multitask learning approaches.
Stats
The QM9 dataset contains 133,885 molecules with computed formation energy, HOMO, LUMO, and HOMO-LUMO gap.
The materials property datasets range from 137 to 643,916 data points, covering a diverse set of experimental and computed properties.
Quotes
"LLaMA 3 can function as a very good regression model for molecular properties, with generally low errors and only a handful of outlier points."
"LLaMA 3 is roughly 5-10× worse than state-of-the-art for predicting molecular properties in the QM9 dataset."
"LLaMA 3 provides improved predictions compared to GPT-3.5 and GPT-4o."