核心概念
AdvisorQA, the first benchmark for evaluating and enhancing the capabilities of large language models in offering personalized, actionable, and empathetic advice based on deeply personalized experiences.
要約
The paper introduces AdvisorQA, a dataset for advice-seeking question answering that focuses on questions rooted in personalized experiences and the corresponding advice, ranked by collective intelligence. The key highlights are:
AdvisorQA is designed to address the challenges in evaluating subjective helpfulness, which is not determined by objective criteria like correctness, but rather by personal preferences. It leverages the upvote ranking from the LifeProTips subreddit forum as a proxy for majority preference.
The dataset contains 10,350 advice-seeking questions with an average of 8.9 answers per question, reflecting the diverse perspectives on subjective issues. The questions are highly complex, with an average length of 75.2 tokens, covering a wide range of daily life topics.
The paper proposes two evaluation metrics - helpfulness and harmlessness. The helpfulness metric is based on the Plackett-Luce model, which can effectively rank the advice according to the majority preference. The harmlessness metric uses the LifeTox moderator to assess the safety of the advice.
Baseline experiments are conducted with popular language models, including GPT-3.5, GPT-4, Flan-T5, Llama, and Mistral. The results show that while larger models tend to be more helpful, they can also be less harmless. Fine-tuning the models with supervised learning and reinforcement learning further reveals the trade-off between helpfulness and harmlessness.
The analysis of the trained models suggests that different training approaches, such as Proximal Policy Optimization (PPO) and Direct Policy Optimization (DPO), lead to diverse characteristics in the generated advice, with PPO models being more diverse but less safe, and DPO models being more aligned with the demonstrations but less diverse.
The AdvisorQA benchmark and the insights from the experiments mark a significant step towards enhancing question answering systems for providing personalized, empathetic, and safe advice, showcasing the improved understanding of human subjectivity by large language models.
統計
The average number of answers per advice-seeking question is 8.9.
The top-ranked advice receives an average of 71.4 upvotes, and the total upvotes for all advice in each thread is 164.2.
The average token length of the advice-seeking questions is 75.2.
引用
"As the integration of large language models into daily life is on the rise, there is a clear gap in benchmarks for advising on subjective and personal dilemmas."
"AdvisorQA marks a significant leap in enhancing QA systems for providing personalized, empathetic advice, showcasing LLMs' improved understanding of human subjectivity."