Sign In

Generalizable Multi-aspect Text Evaluation via Augmented Instruction Tuning with Auxiliary Evaluation Aspects

Core Concepts
X-EVAL, a two-stage instruction tuning framework, can evaluate text in both seen and unseen aspects customized by end users.
The paper introduces X-EVAL, a two-stage instruction tuning framework for multi-aspect text evaluation. The first stage aims to equip the model with the ability to follow instructions to perform diverse evaluation tasks, including scoring, comparison, ranking, and Boolean QA. The second stage further enhances the model by exploiting the connections between fine-grained evaluation aspects. It incorporates the evaluation results of a set of auxiliary aspects into the instructions to provide clues for evaluating the target aspect and encourage consistent evaluations across multiple aspects. To support the training of X-EVAL, the authors collect ASPECTINSTRUCT, the first instruction tuning dataset tailored for multi-aspect NLG evaluation spanning 27 diverse evaluation aspects with 65 tasks. They also devise an augmentation strategy that converts human rating annotations into diverse forms of NLG evaluation tasks. Extensive experiments across three essential categories of NLG tasks - dialogue generation, summarization, and data-to-text - coupled with 21 aspects in meta-evaluation, demonstrate that X-EVAL enables even a lightweight language model to achieve a comparable if not higher correlation with human judgments compared to the state-of-the-art NLG evaluators like GPT-4.
The proposed X-EVAL framework achieves a Spearman correlation of 0.605 on the Topical-Chat dialogue evaluation dataset, outperforming the state-of-the-art GPT-based evaluators. On the SummEval summarization dataset, X-EVAL achieves an averaged Spearman correlation of 0.480 across four aspects, surpassing the lightweight baselines and matching the performance of GPT-3.5-based evaluator. On the data-to-text generation task, X-EVAL outperforms all the lightweight baselines with an averaged Spearman correlation of 0.303.
"To obtain a more comprehensive assessment of text quality, multi-aspect evaluation (Fabbri et al., 2021) has been proposed to evaluate the generated text from multiple fine-grained evaluation aspects, such as fluency and consistency." "Recent studies (Fu et al., 2023; Liu et al., 2023) harness proprietary LLMs to perform fine-grained evaluation in a zero-shot manner. However, due to the closed-source nature, these evaluation metrics suffer from issues of reproducibility and are prohibitively expensive."

Deeper Inquiries

How can the proposed X-EVAL framework be extended to support multi-lingual text evaluation?

To extend the X-EVAL framework for multi-lingual text evaluation, several steps can be taken: Dataset Expansion: Collecting multi-lingual datasets with annotations for various evaluation aspects in different languages. Language Model Training: Fine-tuning language models on multi-lingual data to understand and generate text in multiple languages. Instruction Tuning: Adapting the instruction tuning process to include instructions in different languages for evaluating text. Aspect Verbalization: Developing verbalizers for evaluation aspects in multiple languages to incorporate them into the evaluation process. Inference Pipeline: Modifying the inference pipeline to handle multi-lingual inputs and outputs for a comprehensive evaluation.

What are the potential limitations of the current instruction-based approach, and how can they be addressed to further improve the generalization ability?

Limitations of the instruction-based approach include: Error Propagation: Errors in auxiliary aspect evaluations can impact the final evaluation. Address by refining the auxiliary aspect selection strategy and improving the inference algorithm. Inference Efficiency: Multiple rounds of predictions for auxiliary aspects can increase computational costs. Improve efficiency by optimizing the inference process. Robustness: The model may not generalize well to all unseen aspects. Enhance robustness through more diverse training data and advanced model architectures. Language Dependency: The approach may be language-specific. Overcome by training on multi-lingual data and incorporating language-agnostic features.

Given the observed connections between evaluation aspects, can we leverage knowledge distillation or meta-learning techniques to better capture the inter-dependencies and further boost the performance?

Knowledge distillation and meta-learning techniques can be beneficial in capturing inter-dependencies between evaluation aspects: Knowledge Distillation: Transfer knowledge from a larger model to a smaller one to improve generalization and performance. Meta-Learning: Adapt the model to new evaluation aspects quickly by meta-learning on a diverse set of tasks and aspects. Inter-Aspect Relationships: Develop models that explicitly learn the relationships between evaluation aspects to enhance performance and provide more nuanced evaluations. Continual Learning: Implement continual learning strategies to adapt the model over time and incorporate new knowledge about inter-dependencies between aspects.