The paper introduces SMolInstruct, a large-scale, comprehensive, and high-quality instruction tuning dataset for chemistry tasks. It contains 14 selected chemistry tasks and over three million samples, covering a diverse range of molecular representations, properties, and reactions.
The authors fine-tune four open-source LLMs (Galactica, Llama 2, Code Llama, and Mistral) on SMolInstruct using LoRA, creating a series of LLMs named LlaSMol. Comprehensive experiments show that among the LlaSMol models, the Mistral-based model outperforms others by a substantial margin, highlighting the critical influence of base models on downstream chemistry tasks.
The authors further demonstrate that using canonicalized SMILES during training and inference can improve performance, while using SELFIES does not provide significant advantages over SMILES. Additionally, the proposed SMolInstruct dataset plays a crucial role in driving the performance improvements, as models trained on it substantially outperform those trained on previous datasets.
Although LlaSMol models do not yet surpass state-of-the-art task-specific models, they approach their performance with only 0.58% of parameters being fine-tuned, suggesting their great potential for further improvements and to serve as strong foundation models for the field of chemistry.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Botao Yu,Fra... at arxiv.org 04-02-2024
https://arxiv.org/pdf/2402.09391.pdfDeeper Inquiries