Utilizing relational databases and the ER model, ERBench provides a comprehensive evaluation of LLMs by generating complex and automatically verifiable questions.
Large language models benefit from multi-turn interactions with tools and natural language feedback, as shown by the MINT evaluation benchmark.
Large Language Models (LLMs) performance in Multi-hop Question Answering tasks is evaluated through a new benchmark, MRKE, highlighting the need for trustworthy evaluation of reasoning abilities.