GPT-4 models can closely approximate human-level performance in evaluating the quality of multi-party conversations and identifying errors in dyadic dialogues, demonstrating the potential for automated dialogue assessment.
User feedback from follow-up utterances significantly influences the evaluation of dialogue systems by both crowdworkers and large language models.