The article introduces CrisisTransformers, a set of pre-trained language models and sentence encoders designed for effectively processing and analyzing crisis-related social media texts. Key highlights:
Curation of a large-scale corpus of over 15 billion word tokens from tweets associated with more than 30 crisis events, including disease outbreaks, natural disasters, conflicts, and other critical incidents.
Experimentation with multiple state-of-the-art pre-training approaches, including MPNet, BERTweet, BERT, RoBERTa, XLM-RoBERTa, ALBERT, and ELECTRA, to determine the optimal pre-training procedure for CrisisTransformers.
Evaluation of CrisisTransformers and existing pre-trained models on 18 crisis-specific public datasets for text classification tasks. CrisisTransformers outperform strong baselines across all datasets.
Development of CrisisTransformers-based sentence encoders using contrastive learning objectives (MNR and MNR with hard negatives) to generate semantically rich sentence embeddings. These sentence encoders significantly outperform existing sentence embedding models, improving the state-of-the-art by 17.43%.
Analysis of the impact of model initialization on convergence, highlighting the advantages of leveraging domain-specific pre-trained weights compared to random initialization.
Public release of CrisisTransformers models, which can be used with the Transformers library, to serve as a robust baseline for tasks involving crisis-related social media text analysis.
In eine andere Sprache
aus dem Quellinhalt
arxiv.org
Wichtige Erkenntnisse aus
by Rabindra Lam... um arxiv.org 04-12-2024
https://arxiv.org/pdf/2309.05494.pdfTiefere Fragen