This work explores the ability of large language models (LLMs), specifically the LLaMA 3 model, to perform regression tasks for predicting materials and molecular properties. The authors fine-tune the LLaMA 3 model using different input representations, such as SMILES strings, InChI strings, and explicit atomic coordinates, and evaluate its performance on a variety of molecular properties from the QM9 dataset and 24 materials properties.
The key findings are:
LLaMA 3 can function as a useful regression model, with performance that is competitive with or better than random forest models on many materials properties, but generally 5-10x worse than state-of-the-art deep neural network models that leverage detailed structural information.
The choice of input representation (SMILES vs. InChI) can have a modest but statistically significant impact on the regression performance, with SMILES strings generally performing better.
LLaMA 3 outperforms GPT-3.5 and GPT-4o in regression tasks, suggesting the choice of LLM model can have a large impact on the quality of results.
LLMs like LLaMA 3 show promise as versatile regression tools that can be applied to complex physical phenomena in materials science and other scientific domains, without the need for extensive domain-specific feature engineering.
The work highlights the potential of leveraging LLMs for materials and molecular property prediction, while also identifying areas for further research, such as exploring the impact of structural information, prompt engineering, and transfer/multitask learning approaches.
A otro idioma
del contenido fuente
arxiv.org
Ideas clave extraídas de
by Ryan Jacobs,... a las arxiv.org 09-11-2024
https://arxiv.org/pdf/2409.06080.pdfConsultas más profundas