Evaluating Long-Context Capabilities of Large Language Models with Adaptable Benchmarks
Ada-LEval, a length-adaptable benchmark, is introduced to rigorously evaluate the long-context capabilities of large language models, revealing significant limitations in their performance, especially in ultra-long-context settings.