This study investigates the effect of user feedback from follow-up utterances on the evaluation of task-oriented dialogue systems (TDSs) by both crowdworkers and large language models (LLMs). The researchers conducted experiments with two setups: one that provided only the initial user query and the system's response, and another that included the user's follow-up utterance.
The key findings are:
Both crowdworkers and LLMs exhibit sensitivity to user feedback from follow-up utterances, with significant differences in their ratings across the two setups, except for relevance.
Crowdworkers are more susceptible to user feedback on usefulness and interestingness compared to LLMs, who are more influenced by user feedback on interestingness and relevance.
User feedback leads to a more personalized assessment of usefulness by crowdworkers, aligning closely with the user's explicit feedback.
In cases of ambiguous or complex user requests, user feedback improves agreement among crowdworkers, helping to clarify the user's intent.
These findings highlight the importance of incorporating user feedback in the evaluation of dialogue systems and suggest the potential for automated feedback integration in future research.
A otro idioma
del contenido fuente
arxiv.org
Ideas clave extraídas de
by Clemencia Si... a las arxiv.org 04-22-2024
https://arxiv.org/pdf/2404.12994.pdfConsultas más profundas