Core Concepts
Effective text sampling methods are crucial for fine-tuning language models like SBERT in text stream mining to improve performance and adapt to concept drift.
Abstract
This study explores the efficacy of seven text sampling methods to selectively fine-tune language models, addressing concept drift. Evaluation focused on Macro F1-score and elapsed time using two text stream datasets and an incremental SVM classifier. Findings indicate that certain sampling methods, like Softmax loss and Batch All Triplets loss, are particularly effective for text stream classification. The proposed WordPieceToken ratio sampling method significantly enhances performance with identified loss functions.
Introduction
Text streams present challenges due to data arriving sequentially.
Pre-trained language models like SBERT save time but may need adaptation.
Background
Text stream mining involves real-time analytics of textual data.
Pre-trained models like SBERT are popular due to their efficiency.
Text-based Sampling Methods
Length-based, random, TF-IDF-based, and WordPieceToken ratio sampling methods evaluated.
Proposed WordPieceToken ratio method based on wordpieces-to-tokens ratio shows promise.
Experimental Results
Datasets from Airbnb and Yelp used with different sample sizes.
Loss functions like BATL and SL show improved performance over baseline.
Conclusion
Effective text sampling methods combined with suitable loss functions can enhance the performance of language models in text stream mining scenarios.
Stats
Our findings indicate that Softmax loss and Batch All Triplets loss are particularly effective for text stream classification.
Larger sample sizes generally correlate with improved macro F1-scores.
Quotes
"Pre-trained language models have become popular in batch and stream scenarios due to their time-saving characteristics."
"Updating (or fine-tuning) the language model is generally costly if all new data is considered."