Główne pojęcia
Summarization outperforms truncation in text classification tasks, with the best strategy being taking the head of the document.
Streszczenie
The study compares document truncation and summarization strategies in text classification tasks using IndoSum dataset. Extractive summarization performs better than most truncation methods, with taking the beginning of the document as the top-performing strategy. The research highlights the importance of understanding where key information lies within a document for effective shortening strategies.
- Investigating Text Shortening Strategy in BERT:
- Examines document truncation and summarization.
- Performance Evaluation:
- Summaries outperform most truncation variations.
- Extractive summarization is a feasible shortening strategy.
- Best Strategies:
- Taking the head of the document yields optimal results.
- Dataset Exploration:
- Utilizes IndoSum dataset for news article classification.
- Preprocessing and Variations:
- Tests 10 different shortening strategies, including extractive and abstractive summarizations.
- Text Classification:
- Fine-tunes DistilBERT model for text classification tasks.
- Result Analysis:
- Extractive summarization ranks among top strategies.
- Conclusion and Recommendation:
- Recommends considering where important information lies within a document for effective shortening strategies.
Statystyki
"The best strategy obtained in this study is taking the head of the document."
"The average number of tokens in a document is 346 tokens."
"Filtered IndoSum has around 13 thousand (13K) articles."
Cytaty
"This study concludes that extractive summaries as an alternative shortening strategy has great potential for a Transformer-based classification model."
"In general, whenever the location of the main idea of the document is unknown, taking the first part of the document seems to be the best assumption."