The authors present Factcheck-Bench, a comprehensive framework and benchmark for evaluating automatic fact-checking systems on the outputs of large language models (LLMs). The framework decomposes the fact-checking process into eight subtasks, including decomposition, decontextualization, checkworthiness identification, evidence retrieval and collection, stance detection, correction determination, claim correction, and final response revision.
The authors construct a dataset of 94 (question, LLM response) pairs, where the responses contain a significant number of factual errors. Each example is annotated with detailed labels covering the eight subtasks, enabling fine-grained evaluation of fact-checking systems.
The key findings from the dataset analysis include:
The authors also discuss the limitations of the dataset, including the small scale, the challenges in handling inter-claim dependencies, and the quality of automatically retrieved evidence. They highlight the potential biases and the importance of maintaining public trust in fact-checking systems.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Yuxia Wang,R... at arxiv.org 04-17-2024
https://arxiv.org/pdf/2311.09000.pdfDeeper Inquiries