Bibliographic Information: Anugraha, D., Kuwanto, G., Susanto, L., Wijaya, D.T., & Winata, G.I. (2024). METAMETRICS-MT: Tuning Meta-Metrics for Machine Translation via Human Preference Calibration. arXiv preprint arXiv:2411.00390v1.
Research Objective: This paper introduces METAMETRICS-MT, a new metric for evaluating machine translation (MT) systems, aiming to address the limitations of existing single-metric evaluations by combining multiple metrics and optimizing their weights to better reflect human judgments of translation quality.
Methodology: The researchers developed METAMETRICS-MT by integrating various existing MT metrics, including lexical-based, semantic-based, and neural-based metrics. They employed Bayesian optimization with Gaussian Processes (GP) to determine the optimal weights for each metric, maximizing the correlation between METAMETRICS-MT scores and human assessment scores. The metric was trained and evaluated using the WMT24 metric shared task dataset, encompassing multiple language pairs and translation tasks.
Key Findings: METAMETRICS-MT outperformed all existing baselines in the reference-based setting of the WMT24 metric shared task, achieving state-of-the-art performance. It also demonstrated competitive results in the reference-free setting, comparable to leading reference-free metrics. The study found that the optimization process consistently selected the highest-performing variant of each metric, leading to a more efficient and robust evaluation.
Main Conclusions: METAMETRICS-MT offers a more accurate and reliable evaluation of MT systems by aligning closely with human preferences. Its flexibility, adaptability, and efficiency make it a valuable tool for researchers and developers working on MT tasks.
Significance: This research significantly contributes to the field of MT evaluation by introducing a novel metric that surpasses existing methods in aligning with human judgments. The proposed approach of combining and optimizing multiple metrics based on human feedback has the potential to improve the evaluation of other natural language processing tasks as well.
Limitations and Future Research: The study acknowledges limitations in terms of computational constraints, which prevented the inclusion of certain high-memory models in the evaluation. Future research could explore incorporating these models and extending the optimization process to other objective functions or system-level settings. Further investigation into the generalizability of METAMETRICS-MT across different domains and evaluation criteria is also warranted.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by David Anugra... at arxiv.org 11-04-2024
https://arxiv.org/pdf/2411.00390.pdfDeeper Inquiries