The STEM dataset is introduced as a new challenge to test the STEM skills of neural models. It is the largest multimodal dataset covering all STEM subjects, with 448 skills and 1,073,146 questions spanning from pre-K to 8th grade.
The dataset is designed to focus on fundamental STEM skills based on the K-12 curriculum, enabling diverse and comprehensive tests across all STEM subjects. This is in contrast to existing datasets that often concentrate on evaluating one STEM subject or expert-level abilities.
The STEM dataset is challenging for current neural models. While state-of-the-art foundation models like CLIP and GPT-3.5-Turbo show improvements over random guesses, their performance is still far behind that of average elementary students, averaging 54.7% lower. The models struggle especially with math skills that require complex reasoning and abstract knowledge.
Finetuning the models on the STEM training set helps, but the performance remains relatively low compared to human references. The results suggest that novel algorithmic innovations are necessary to solve the multimodal STEM problems in the STEM dataset.
The dataset also supports deep performance analysis at different granularities, such as by subject, skill, or grade level. This reveals important shortcomings of existing models and provides insights for future research directions.
다른 언어로
소스 콘텐츠 기반
arxiv.org
더 깊은 질문