The authors systematically evaluate five state-of-the-art coreference resolution (CR) models, controlling for the choice of language model and other factors. They find that:
When controlling for language model size, encoder-based CR models outperform more recent decoder-based models in terms of both accuracy and inference speed.
Surprisingly, among encoder-based CR models, more recent models are not always more accurate than the oldest model tested (C2F), which also generalizes the best to out-of-domain textual genres.
The authors scale up the encoder-based LingMess model to 1.5B parameters and find it achieves the same accuracy as the 11B parameter decoder-based ASP model.
The authors conclude that controlling for the choice of language model reduces most, but not all, of the increase in F1 score reported in the past five years, suggesting that many improvements may be attributable to the use of a stronger language model rather than architectural changes.
The authors emphasize the need for more holistic evaluations of coreference resolution beyond a single accuracy metric, and the importance of carefully considering confounding factors when presenting model comparisons.
To Another Language
from source content
arxiv.org
Głębsze pytania