Core Concepts
BAMBOO provides a comprehensive evaluation benchmark for assessing the long text modeling capacities of Large Language Models (LLMs) across various tasks and domains.
Abstract
BAMBOO introduces a multi-task long context benchmark with 10 datasets covering question answering, hallucination detection, text sorting, language modeling, and code completion. It aims to evaluate LLMs' abilities in capturing long-range dependencies and fine-grained details in lengthy texts. The benchmark addresses issues like data contamination, accurate automatic evaluation, and different length levels to enhance the performance assessment of LLMs. Experimental results show ChatGPT-16k outperforming other models consistently but struggling on uncommon tasks. The study highlights challenges like instruction forgetting, format errors, and the need for diverse training data to improve LLMs' capabilities.
Stats
BAMBOO consists of 10 datasets covering various tasks like question answering, hallucination detection, text sorting, language modeling, and code completion.
ChatGPT-16k demonstrates optimal performance across most datasets.
Vicuna-16k struggles on uncommon tasks like text sorting and code completion.