Large Language Model Evaluation

Zaloguj się

spostrzeżenie - Large Language Model Evaluation

FreeEval: A Modular and Trustworthy Framework for Efficient Evaluation of Large Language Models

FreeEval is a modular and extensible framework that enables trustworthy and efficient automatic evaluation of Large Language Models (LLMs) by providing a unified implementation of diverse evaluation methods, incorporating meta-evaluation techniques, and leveraging high-performance inference backends.

S3Eval: A Scalable, Synthetic, and Systematic Evaluation Suite for Assessing Long-Context Reasoning Capabilities of Large Language Models

S3EVAL is a scalable, synthetic, and systematic evaluation suite that uses SQL execution as a proxy task to comprehensively assess the long-context reasoning capabilities of large language models.

Evalverse: A Unified and Expandable Library for Comprehensive Evaluation of Large Language Models

Evalverse is a novel library that streamlines the evaluation of Large Language Models (LLMs) by unifying disparate evaluation tools into a single, user-friendly framework, enabling both researchers and practitioners to comprehensively assess LLM performance.

Comprehensive Survey on Contamination in Large Language Models and the LLMSanitize Library

Contamination, where evaluation datasets are included in the training data, poses a critical challenge to the integrity and reliability of large language models (LLMs). This paper provides a comprehensive survey of data and model contamination detection methods, and introduces the open-source LLMSanitize library to help the community centralize and share implementations of contamination detection algorithms.

O nas

Produkty

Zasoby