Existing large language models (LLMs) struggle to effectively utilize tools in complex, real-world scenarios, highlighting the need for more robust benchmarks and evaluation metrics for tool-augmented LLMs.
Natural language inference (NLI) benchmarks, though less popular in recent times, remain valuable for evaluating and improving large language models (LLMs), offering insights into model discriminability, training progression, and alignment with human judgment distributions.
This paper introduces P-MMEval, a new benchmark designed to comprehensively evaluate the multilingual capabilities of large language models (LLMs) across a variety of tasks, addressing the limitations of existing benchmarks that primarily focus on English or specific aspects of language processing.
This research introduces a novel approach to assess and quantify the risk-taking behaviors and inherent biases present within Large Language Models (LLMs) by employing role-playing scenarios and specialized ethical scales, revealing potential ethical concerns and avenues for improvement in LLM development.
Large language models (LLMs) demonstrate a stronger grasp of linguistic form (grammar) over meaning (semantics), suggesting their understanding of language relies heavily on statistical correlations rather than true conceptual understanding.
RoCar is a novel evaluation method for Large Language Models (LLMs) that leverages randomly generated social network graphs to assess reasoning and memory capabilities, ensuring fairness by minimizing the chance of LLMs having pre-existing knowledge of the evaluation tasks.
Existing mathematical reasoning benchmarks for Large Language Models (LLMs) are limited in their ability to assess true reasoning capabilities, leading to the development of UTMath, a novel benchmark that utilizes unit tests and a Reasoning-to-Coding of Thoughts (RCoT) approach to robustly evaluate LLM reasoning skills.
This paper introduces Chinese SimpleQA, a new benchmark designed to evaluate the factuality of large language models (LLMs) when answering short questions in Chinese.
This paper introduces LIFBench, a novel benchmark designed to evaluate the instruction-following capabilities and stability of Large Language Models (LLMs) in long-context scenarios, along with LIFEval, a rubric-based evaluation framework for accurate and efficient assessment of LLM performance.
The OpenAI o1-mini model demonstrates strong intuitive reasoning and problem-solving abilities in mathematics, comparable across public and private datasets, suggesting its capabilities extend beyond memorization, though it often struggles to provide complete, rigorous proofs.