This paper presents a partial reproduction of the work by Atanasova et al. (2020) on generating fact checking explanations, as part of the ReproHum shared task. The authors focused on reproducing the human evaluation results for the criterion of Coverage.
The original study by Atanasova et al. (2020) aimed to address the lack of transparency in fact-checking systems by generating natural language explanations to justify the assigned veracity labels. They used a multi-task learning framework to jointly optimize explanation generation and veracity prediction.
For the human evaluation, the original study assessed the generated explanations on criteria such as Coverage, Non-redundancy, and Non-contradiction. In the reproduction, the authors focused solely on the Coverage criterion, following the instructions provided by the ReproHum team.
The reproduction study involved 3 NLP PhD students as evaluators, who ranked the outputs of 3 systems (a gold standard and two models) on the Coverage criterion. The results of the reproduction were compared to the original findings, and the authors observed some differences in the overall rankings, with one of the proposed models ranking higher than the gold standard in their reproduction. However, the reproduction supported the original finding that the model trained jointly for explanation generation and veracity prediction (Explain-MT) outperformed the model trained to generate explanations in isolation (Explain-Extr).
The authors note that their reproduction covers only one of the multiple human evaluation criteria used in the original work, and the patterns observed may not necessarily hold across all criteria. They emphasize the importance of conducting reproduction studies to assess the reproducibility of findings in the field of NLP.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問