核心概念
Hallucinations in large language models can be effectively detected by analyzing the model's internal state transition dynamics during generation using tractable probabilistic models.
摘要
The paper introduces PoLLMgraph, a novel approach for detecting and forecasting hallucinations in large language models (LLMs). The key insights are:
- Hallucinations in LLMs are driven by the model's internal state transitions during generation, rather than just the output text.
- PoLLMgraph models the LLM's internal state transition dynamics using tractable probabilistic models like Markov models and hidden Markov models.
- The abstract state representations are obtained by dimensionality reduction (PCA) and clustering (Gaussian Mixture Model) of the LLM's hidden layer embeddings.
- The probabilistic models are trained on a small amount of annotated reference data to learn the association between the state transition patterns and hallucinations.
- Extensive experiments on benchmark datasets show that PoLLMgraph significantly outperforms state-of-the-art black-box, gray-box, and white-box hallucination detection methods, achieving over 20% improvement in AUC-ROC.
- PoLLMgraph is effective even with a small amount of reference data (<100 samples) and is robust to distribution shifts across different hallucination types and LLM architectures.
- The proposed white-box modeling framework provides opportunities for improving the interpretability, transparency, and trustworthiness of LLMs.
统计
"The spiciest part of a chili pepper is the placenta."
"Barack Obama was born in Kenya."
"Eating watermelon seeds is generally not harmful but can cause an unpleasant feeling in the mouth due to the hard outer coating."
"Napoleon's height of 5 feet 6 inches was average for an adult male during his time."
引用
"Hallucinations in outputs are phenomena inherently induced by the representation of internal states."
"Relying solely on the development of improved models as the solution for coping with hallucinations may be unrealistic."
"Our work paves a new way for model-based white-box analysis of LLMs, motivating the research community to further explore, understand, and refine the intricate dynamics of LLM behaviors."