Core Concepts
Factcheck-Bench is a fine-grained annotation framework and benchmark for evaluating the performance of automatic fact-checking systems on the outputs of large language models (LLMs). It encompasses detailed labeling of factual claims, evidence retrieval and stance detection, claim correction, and response revision.
Abstract
The authors present Factcheck-Bench, a comprehensive framework and benchmark for evaluating automatic fact-checking systems on the outputs of large language models (LLMs). The framework decomposes the fact-checking process into eight subtasks, including decomposition, decontextualization, checkworthiness identification, evidence retrieval and collection, stance detection, correction determination, claim correction, and final response revision.
The authors construct a dataset of 94 (question, LLM response) pairs, where the responses contain a significant number of factual errors. Each example is annotated with detailed labels covering the eight subtasks, enabling fine-grained evaluation of fact-checking systems.
The key findings from the dataset analysis include:
More than half of the claims in the LLM responses are factually incorrect, and in some cases, the responses contain over 5 false claims.
Automatic methods struggle to identify false claims, with the best F1-score being 0.63 using GPT-4 as the verifier.
Intrinsic metrics like edit distance and semantic similarity are ineffective in evaluating the quality of revised responses compared to human preferences.
The authors also discuss the limitations of the dataset, including the small scale, the challenges in handling inter-claim dependencies, and the quality of automatically retrieved evidence. They highlight the potential biases and the importance of maintaining public trust in fact-checking systems.
Stats
94 (question, LLM response) pairs
661 checkworthy claims out of 678 total claims
3,305 (claim, evidence, stance) triplets
61 examples contain false claims, with up to 5 false claims per example
Quotes
"The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the factual accuracy of their outputs."
"How to evaluate and improve the accuracy of automated fact-checkers is critical to produce dependable LLM factuality evaluations."
"Desirable guidance for how to improve fact-checking pipelines is under-explored."