핵심 개념
Our developed LLMs can achieve very strong results on a comprehensive set of chemistry tasks, outperforming the most advanced GPT-4 and Claude 3 Opus by a substantial margin, by leveraging the proposed large-scale, comprehensive, and high-quality instruction tuning dataset SMolInstruct.
초록
The paper introduces SMolInstruct, a large-scale, comprehensive, and high-quality instruction tuning dataset for chemistry tasks. It contains 14 selected chemistry tasks and over three million samples, covering a diverse range of molecular representations, properties, and reactions.
The authors fine-tune four open-source LLMs (Galactica, Llama 2, Code Llama, and Mistral) on SMolInstruct using LoRA, creating a series of LLMs named LlaSMol. Comprehensive experiments show that among the LlaSMol models, the Mistral-based model outperforms others by a substantial margin, highlighting the critical influence of base models on downstream chemistry tasks.
The authors further demonstrate that using canonicalized SMILES during training and inference can improve performance, while using SELFIES does not provide significant advantages over SMILES. Additionally, the proposed SMolInstruct dataset plays a crucial role in driving the performance improvements, as models trained on it substantially outperform those trained on previous datasets.
Although LlaSMol models do not yet surpass state-of-the-art task-specific models, they approach their performance with only 0.58% of parameters being fine-tuned, suggesting their great potential for further improvements and to serve as strong foundation models for the field of chemistry.
통계
"Chemistry is a fundamental science that underpins countless aspects of modern life, ranging from drug discovery and materials science to energy production."
"Large language models (LLMs) such as GPT-4, Llama series, and Mistral have emerged as general-purpose foundation models and demonstrate remarkable abilities on various natural language processing tasks."
"When applied to chemistry tasks, LLMs show only limited capabilities."
"Our developed LLMs can achieve very strong results on a comprehensive set of chemistry tasks, outperforming the most advanced GPT-4 and Claude 3 Opus by a substantial margin."
"SMolInstruct consists of 3.3M samples and 1.6M distinct molecules, with a diverse range of sizes, structures, and properties."
인용구
"Chemistry is a fundamental science that underpins countless aspects of modern life, ranging from drug discovery and materials science to energy production."
"Large language models (LLMs) such as GPT-4, Llama series, and Mistral have emerged as general-purpose foundation models and demonstrate remarkable abilities on various natural language processing tasks."
"When applied to chemistry tasks, LLMs show only limited capabilities."
"Our developed LLMs can achieve very strong results on a comprehensive set of chemistry tasks, outperforming the most advanced GPT-4 and Claude 3 Opus by a substantial margin."