Combining domain-specific language models with AMR graph information and large language models can improve the robustness of open-domain dialogue evaluation, especially for discriminating adversarial negative responses.
A novel dialogue evaluation metric, PAIREVAL, assesses responses by comparing their quality against a limited number of comparison responses, outperforming previous evaluation metrics.