Core Concepts
CausalBench is a comprehensive benchmark designed to thoroughly evaluate the causal learning capabilities of large language models (LLMs) across diverse datasets, tasks, and prompt formats.
Abstract
CausalBench is a comprehensive benchmark for evaluating the causal learning capabilities of large language models (LLMs). It includes the following key components:
Data View:
CausalBench incorporates 15 commonly used real-world causal learning datasets ranging from 2 to 109 nodes, enabling a comprehensive evaluation of LLM capabilities across various scales and complexities.
Task View:
CausalBench establishes three core evaluation tasks - identifying correlation, causal skeleton, and causality - to assess LLM understanding of causal relationships at different depths and difficulties.
An additional "chain of thought" task is included to further evaluate LLM reasoning abilities for causal discovery.
Prompt View:
CausalBench utilizes four prompt formats - variable names, variable names with background knowledge, variable names with structured data, and variable names with both background knowledge and structured data - to fully exploit LLM capabilities in prior knowledge integration and long-text comprehension.
The evaluation results show that:
Closed-source LLMs significantly outperform open-source models, but still fall short of classic and state-of-the-art causal learning methods.
LLM performance declines as dataset scale and complexity increase, with better performance on identifying correlation and causal skeleton compared to causality.
Background knowledge and structured data have varying impacts on LLM causal learning, depending on dataset characteristics and LLM capabilities.
LLMs exhibit strengths in chain of thought reasoning for causal discovery tasks.
Overall, CausalBench provides a comprehensive framework to rigorously evaluate and understand the causal learning capabilities of LLMs, paving the way for further advancements in this critical area.