Core Concepts
SciAssess evaluates leading LLMs in scientific literature analysis, highlighting strengths and areas for improvement to advance research capabilities.
Abstract
SciAssess introduces a benchmark tailored for scientific literature analysis, evaluating LLMs' abilities in memorization, comprehension, and analysis within scientific contexts. The benchmark covers tasks from various scientific fields and ensures reliability through quality control measures.
Recent advances in Large Language Models (LLMs) have revolutionized natural language understanding and generation. SciAssess focuses on evaluating LLMs' abilities in memorization, comprehension, and analysis within scientific contexts. It includes tasks from diverse scientific fields such as general chemistry, organic materials, and alloy materials. Rigorous quality control measures ensure reliability in correctness, anonymization, and copyright compliance.
Existing benchmarks inadequately evaluate the proficiency of LLMs in the scientific domain. SciAssess aims to bridge this gap by providing a thorough assessment of LLMs' efficacy in scientific literature analysis. By focusing on memorization, comprehension, and analysis abilities within specific scientific domains, SciAssess offers valuable insights for advancing LLM applications in research.
The benchmark design is founded on critical considerations including model ability delineation, scope & task predication across various scientific domains, and stringent quality control protocols to derive accurate insights. SciAssess aims to reveal the current performance of LLMs in the scientific domain to foster their development for enhancing research capabilities across disciplines.
Stats
GPT-4 excels with an accuracy rate of 0.591 in MMLU High-School Chemistry task.
Gemini showcases strength with an accuracy rate of 23.3% in Electrolyte Table QA task.
GPT-3.5 leads with a value recall rate of 35.9% in Affinity Data Extraction task.