S3Eval: A Scalable, Synthetic, and Systematic Evaluation Suite for Assessing Long-Context Reasoning Capabilities of Large Language Models
S3EVAL is a scalable, synthetic, and systematic evaluation suite that uses SQL execution as a proxy task to comprehensively assess the long-context reasoning capabilities of large language models.