FreeEval is a modular and extensible framework that enables trustworthy and efficient automatic evaluation of Large Language Models (LLMs) by providing a unified implementation of diverse evaluation methods, incorporating meta-evaluation techniques, and leveraging high-performance inference backends.
S3EVAL is a scalable, synthetic, and systematic evaluation suite that uses SQL execution as a proxy task to comprehensively assess the long-context reasoning capabilities of large language models.
Evalverse is a novel library that streamlines the evaluation of Large Language Models (LLMs) by unifying disparate evaluation tools into a single, user-friendly framework, enabling both researchers and practitioners to comprehensively assess LLM performance.
Contamination, where evaluation datasets are included in the training data, poses a critical challenge to the integrity and reliability of large language models (LLMs). This paper provides a comprehensive survey of data and model contamination detection methods, and introduces the open-source LLMSanitize library to help the community centralize and share implementations of contamination detection algorithms.