A sample-efficient human evaluation approach based on maximum discrepancy competition is proposed to fairly assess and rank the performance of large language models across diverse scenarios, including scientific knowledge understanding, mathematical reasoning, creative writing, and code generation.
UltraEval is a lightweight, user-friendly, and comprehensive framework for evaluating the capabilities of large language models, featuring modular design, efficient inference, and extensive benchmark coverage.
Large language models can learn new tasks through in-context learning, but their ability to generalize beyond the provided examples in a robust, syntax-aware manner is limited. Models pre-trained on code demonstrate better out-of-distribution generalization compared to those trained only on natural language.
A significant fraction of sentences generated by retrieval-augmented language models, even those containing correct answers, are not grounded in the provided context or the models' pre-training data.
Language models exhibit alarming inconsistencies in their predictions when dealing with simplified text inputs, with prediction change rates up to 50% across multiple languages and tasks.
Multilingual masked language models exhibit varying degrees of gender bias, which can be more reliably assessed using a novel model-based sentence generation method and strict bias metrics.
FairPair, a robust evaluation framework, measures differential treatment in language models by constructing counterfactual pairs grounded in the same demographic group and accounting for inherent generation variability.
Large language models often fail to correctly understand non-affirmative statements, particularly those involving hypothetical scenarios, and are susceptible to knowledge conflicts when answering questions based on such contexts.
The Hallucinations Leaderboard is an open initiative to quantitatively measure and compare the tendency of large language models to produce hallucinations - outputs that do not align with factual reality or the input context.
A simple regression-based approach to control for length bias in the AlpacaEval automated evaluation metric, resulting in a more robust and accurate measure of chatbot performance.