The paper introduces MSCINLI, a diverse benchmark for scientific natural language inference (NLI) that covers five different scientific domains: Hardware, Networks, Software & its Engineering, Security & Privacy, and NeurIPS. This is in contrast to the existing SCINLI dataset, which is limited to the computational linguistics domain.
The authors first describe the data extraction and automatic labeling process used to create the MSCINLI dataset, which leverages linking phrases between sentences to assign NLI relations (Entailment, Reasoning, Contrasting, Neutral). They then manually annotate the test and development sets to ensure high-quality evaluation data.
The authors evaluate the difficulty of MSCINLI by experimenting with a BiLSTM model, and find that MSCINLI is more challenging than SCINLI. They then establish strong baselines on MSCINLI using four pre-trained language models (BERT, SCIBERT, ROBERTA, XLNET) and two large language models (LLAMA-2, MISTRAL). The best performing pre-trained model, ROBERTA, achieves a Macro F1 of 77.21%, while the best performing large language model, LLAMA-2, achieves a Macro F1 of 51.77%, indicating the challenging nature of the task.
The authors further analyze the performance of the ROBERTA model by investigating its behavior on different subsets of the training data, and find that "ambiguous" examples help train stronger models. They also show that domain shift at test time reduces the performance of the models.
Finally, the authors explore the use of SCINLI and MSCINLI as intermediate tasks to improve the performance of downstream tasks in the scientific domain, and find that the diversity in the data is essential for this transfer learning setting.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Mobashir Sad... at arxiv.org 04-15-2024
https://arxiv.org/pdf/2404.08066.pdfDeeper Inquiries