Core Concepts
Large language models generate text with significantly fewer grounding acts compared to humans, indicating a fundamental gap in how they establish common ground.
Abstract
The article examines the discrepancies between how humans and large language models (LLMs) use grounding acts in dialogue. Grounding acts, such as clarification, acknowledgment, and follow-up questions, are crucial for building shared understanding between conversation participants.
The authors first curate a set of grounding acts based on prior research in linguistics and dialogue analysis. They then use these acts to analyze conversations across three domains - emotional support, education, and persuasion - where grounding is critical.
The authors find that compared to humans, LLM generations contain significantly fewer grounding acts. For example, LLMs use 64.3% fewer follow-up questions and 83.4% fewer acknowledgment acts than humans. Furthermore, the agreement between human and LLM grounding acts, as measured by Cohen's kappa, is poor to fair across all evaluated models.
To understand the roots of this "grounding gap", the authors investigate the role of supervised fine-tuning (SFT) and preference optimization (PO) in LLM training. They find that while SFT alone does not improve grounding agreement, PO actually degrades it. The authors hypothesize that current preference datasets may signal that asking questions is dispreferred, leading to LLMs that presume common ground instead of actively constructing it.
The authors discuss the risks of LLMs not generating grounding acts in critical domains like social skill training, and suggest that contextualizing preferences across domains and training reward models on multi-turn interactions may help address the grounding gap.
Stats
LLMs generate 64.3% fewer follow-up questions than humans.
LLMs use 83.4% fewer acknowledgment acts than humans.
Across 3 grounding acts x 3 datasets, only 3/9 have Cohen's kappa agreement significantly greater than zero.
Quotes
"Failing to construct common ground in human-human conversation can be at best misleading and at worst harmful."
"We find that—compared to humans—LLMs generate language with less conversational grounding, instead generating text that appears to simply presume common ground."
"We observe negative correlation between DPO train steps and Cohen κ agreement on grounding acts, with Pearson R averaging R = −0.79, and p < 0.05 for all acts."