Current large language models (LLMs) struggle to demonstrate adequate understanding of Indic languages and cultures, highlighting the need for a dedicated benchmark like MILU to drive progress in this area.
Language models' (LMs) probability scores can be better correlated with human acceptability judgments by using a new linking theory called MORCELA, which accounts for model-specific variations in sensitivity to sentence length and word frequency.
This article introduces a statistically rigorous framework for analyzing language model evaluations, advocating for the use of confidence intervals, paired statistical tests, and power analysis to improve the reliability and informativeness of model comparisons.
LINGOLY, a novel benchmark using Linguistics Olympiad puzzles, reveals that even state-of-the-art LLMs struggle with multi-step reasoning in low-resource languages, particularly when mitigating memorization.
Inconsistent human evaluations of language models, particularly in pairwise comparisons, can be attributed to the difficulty in distinguishing between model outputs. The SEPARABILITY metric addresses this by quantifying the distinguishability of generations from different models on a given input, offering a measure of evaluation reliability and enabling more robust model comparisons.
Existing benchmarks for evaluating language models as judges of text quality primarily focus on English, hindering the assessment of these models' effectiveness in multilingual contexts. MM-Eval addresses this gap by introducing a multilingual benchmark covering 18 languages and various linguistic challenges, revealing that both proprietary and open-source language models have significant room for improvement in multilingual settings.
DOLOMITES is a novel benchmark designed to evaluate the capabilities of language models in assisting experts with complex, domain-specific writing tasks, revealing significant room for improvement in both model performance and automatic evaluation methods.
Large language models demonstrate some sensitivity to argument roles in sentence processing, but their performance differs significantly from human behavior, suggesting a reliance on lexical cues rather than a deep understanding of syntactic structure and argument role relationships.
This paper introduces Fisher susceptibility, an efficient method for estimating the sensitivity of language models to input context, offering a faster alternative to the computationally expensive Monte Carlo approximation.
While large language models (LLMs) have made significant strides in various language tasks, their developmental trajectory does not mirror human language acquisition. LLMs' capabilities are more influenced by training data and architecture than by mimicking the stages of human language development.