Core Concepts
The 2nd BabyLM Challenge aims to incentivize researchers to focus on optimizing language model pretraining given data limitations inspired by human language development, and to democratize research on pretraining by addressing open problems that can be tackled on a university budget.
Abstract
The 2nd BabyLM Challenge will be hosted in 2024/2025, with some key changes from the previous year:
A new paper-only track is introduced to encourage contributions related to the challenge's goals, but not involving direct competition entries. This could include novel cognitively-inspired evaluation metrics or in-depth analyses of BabyLM models.
The requirement to use a fixed pretraining corpus has been relaxed. Participants can now construct their own datasets, provided they stay within the 100M or 10M word budget. A datasheet must be provided for any self-constructed datasets.
A new vision-language track is introduced, with a corpus of 50% text-only and 50% image-text multimodal data provided to facilitate participation.
The challenge includes three tracks: STRICT (100M words or less), STRICT-SMALL (10M words or less), and VISION (multimodal image-text models). Participants are free to use any training procedure, as long as models can provide (pseudo) log-likelihoods to text, conditioned on an image for the VISION track.
Baseline models for each track will be released, based on the winning submissions from the previous year's challenge. The submission process involves providing model predictions, a download link, and a datasheet for any self-constructed datasets.
Stats
The text-only dataset has been updated, with the QED portion replaced by data from CHILDES.
The multimodal dataset includes 50M words of text-only data and 50M words of paired image-text data, drawn from Localized Narratives and Conceptual Captions 3M.