Główne pojęcia
Language models exhibit varying capabilities in discerning plausible scenarios beyond factual accuracy, highlighting the need for benchmarks that assess their world knowledge application.
Streszczenie
The paper introduces PROBELM (Plausibility Ranking Evaluation for Language Models), a novel benchmark designed to assess language models' ability to prioritize plausible scenarios over less plausible alternatives based on their parametric knowledge.
Key highlights:
- Existing benchmarks often focus on factual accuracy or reasoning without explicitly incorporating broader world knowledge. PROBELM aims to bridge this gap.
- PROBELM utilizes scenarios collected from Wikidata, comprising new facts unknown to models due to the timeframe of their training data, and automatically generated less plausible scenarios.
- Models are evaluated across three prompt types (statements, text completions, and question-answering) and ranked based on their perplexity scores for the scenarios.
- Experiments with 10 models of varying sizes and architectures reveal that factual accuracy does not directly correlate with plausibility performance, and model architecture and training methodologies also influence plausibility inference, independent of model size.
- The greater the temporal gap between a model's training data and the evaluation set, the poorer the model's performance on PROBELM, underscoring the importance of up-to-date knowledge in plausibility assessment.
Statystyki
Marcelo is a Brazilian footballer.
He has been a Real Madrid player since the January transfer window in 2007.
Ibero-America includes Brazil.
Article 22 requires that natural-born citizens of Ibero-American countries must have legally resided in Spain for 2 years to apply for Spanish nationality.