Centrala begrepp
Large language models struggle to reliably detect self-contradictions in long documents, with GPT4 performing the best among the evaluated models.
Sammanfattning
The paper introduces CONTRADOC, a new human-annotated dataset for studying self-contradictions in long documents across multiple domains, document lengths, self-contradiction types, and appearance scopes. The dataset is used to evaluate the performance of four state-of-the-art large language models (GPT3.5, GPT4, PaLM2, and LLaMAv2) on three different tasks: binary judgment, top-k evidence selection, and judge-then-find.
The key findings are:
GPT4 outperforms the other models, but still struggles to reliably detect self-contradictions, especially those that require more nuance and context.
Models perform better on detecting self-contradictions related to objective facts (e.g., numeric, negation) compared to more subjective ones (e.g., emotion, perspective).
Self-contradictions that are farther apart in the document are not necessarily harder to detect than those that are closer.
Document type (news, wiki, story) has a significant impact on model performance, with wiki documents being the easiest and story documents being the hardest.
The paper highlights the need for further research to improve large language models' capabilities in long-form reasoning and understanding self-contradictions in text.
Statistik
"The road T51 was located in New York."
"The road T51 was located in California."
"The doctor spoke highly of the project and called it 'a breakthrough'."
"The doctor disliked the project, saying it had no impact at all."
"Zully donated her kidney."
"Zully never donated her kidney."
Citat
"Detecting contradictions in texts has long been pivotal in natural language understanding(NLU), with most of the works falling under the umbrella of natural language inference(NLI)."
"Psychological research (Graesser and McMahen, 1993; Otero and Kintsch, 1992) indicates that humans struggle to identify contradictions in unfamiliar, informative texts, particularly when contradictions are widely separated in long documents, underscoring the need for automated text analysis tools to tackle this challenge."