The authors investigated the generalizability of sarcasm detection models by testing their performance on different sarcasm datasets. They found that:
For intra-dataset predictions, models performed best on the Sarcasm Corpus V2 dataset, followed by the Conversation Sarcasm Corpus (CSC) with third-party labels. Models performed worst on the iSarcasmEval dataset, which only had author labels.
For cross-dataset predictions, most models failed to generalize well, implying that one type of dataset cannot represent all the diverse styles and domains of sarcasm.
Models fine-tuned on the new CSC dataset showed the highest generalizability to other datasets, despite not being the largest dataset. The authors attribute this to the psycholinguistically-motivated data collection methodology used for CSC.
The source of sarcasm labels (author vs. third-party) consistently affected model performance, with third-party labels leading to better results.
A post-hoc analysis revealed that different datasets contain sarcasm with distinct linguistic properties, such as negative emotions, social issues, and religious references, which the models become accustomed to during fine-tuning.
The authors conclude that future sarcasm research should account for the broad scope and diversity of sarcasm, rather than focusing on a narrow definition, to build more robust sarcasm detection systems.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問