Core Concepts
Study focuses on adapting and evaluating topic detection algorithms for the Persian language, emphasizing the importance of hybrid methods.
Abstract
This comprehensive study delves into the challenges of topic detection in Persian text streams, highlighting issues like morphological complexity, lack of resources, and contextual ambiguity. The research aims to adapt existing algorithms to suit the Persian language better and evaluates their performance using a new multiclass evaluation criterion called FS. Ten methods from three categories (Frequent Pattern Mining, Clustering, Hybrid) are studied and implemented from scratch. The dataset consists of posts from Telegram social media channels in Persian. The study's main contributions include a broad understanding of topic detection methods, utilizing a large dataset processed with approximately 1.4 billion tokens, comparing different categories of methods, and focusing on the Persian language processing.
Stats
Approximately 1.4 billion tokens are processed during experiments.
Quotes
"Persian text exhibits complex morphological features."
"Persian is considered a low-resource language."
"A new multiclass evaluation criterion called FS is used in this paper."