Core Concepts
Language models with prompts leveraging insights from social science can provide specific and fair feedback as conversational moderators, but encouraging users to become more respectful and cooperative remains challenging.
Abstract
This paper establishes a systematic definition and evaluation framework for assessing the effectiveness of conversational moderation, which aims to guide problematic users towards more constructive behavior through interactive interventions. The authors identify four key metrics for moderation effectiveness: specificity, fairness, cooperativeness, and respectfulness.
The paper then proposes an evaluation framework that uses controversial conversation stubs from Reddit to create realistic yet safe scenarios for testing language model-based moderators. The framework involves participants continuing a conversation with a moderator bot and then providing feedback on the bot's performance.
The authors evaluate several approaches, including prosocial dialogue models and prompted language models informed by conflict resolution and cognitive behavioral therapy techniques. The results show that the prompted language models can provide specific and fair feedback, but improving user cooperativeness and respectfulness remains challenging. Interestingly, the perceived effectiveness of the moderators varies depending on whether the evaluator is the moderated user or an observer of the conversation.
The paper also explores the use of non-survey metrics, such as user word count, to assess moderation effectiveness, but finds they are only weakly correlated with the key metrics. Additionally, the authors analyze the impact of confounding factors like user agreement and likeability on the evaluation.
Overall, this work provides a valuable foundation for research on scaling up conversational moderation using language models, while highlighting the complexities and challenges involved in this task.
Stats
Controversial conversations from Reddit have an average of 3 turns between users.
Participants on average produced 1.5 times more words when interacting with the GPT-Socratic moderator compared to the Cosmo-XL moderator.
Quotes
"Language models with prompts leveraging insights from social science can provide specific and fair feedback as conversational moderators, but encouraging users to become more respectful and cooperative remains challenging."
"The perceived effectiveness of the moderators varies depending on whether the evaluator is the moderated user or an observer of the conversation."