The paper revisits the problem of hierarchical text classification (HTC), which involves assigning labels to text within a structured hierarchy. The authors identify two key challenges in HTC evaluation: (1) the choice of appropriate hierarchical metrics that account for the severity of prediction errors, and (2) the inference method used to produce predictions from the estimated probability distribution.
The authors first propose to evaluate HTC models using hierarchical metrics, such as hierarchical F1-score (hF1), which considers the distance between predicted and ground-truth labels within the hierarchy. They argue that the commonly used multi-label metrics and inference methods (e.g., thresholding at 0.5) are suboptimal for HTC.
To further investigate these issues, the authors introduce a new, more challenging HTC dataset called Hierarchical Wikivitals (HWV), which has a deeper and more complex hierarchy compared to existing benchmarks.
The experimental results on HWV and other datasets show that state-of-the-art HTC models do not necessarily encode hierarchical information well, and can be outperformed by simpler baselines, such as a BERT model trained with a conditional softmax loss that directly incorporates the hierarchy structure. The authors hypothesize that the superiority of their proposed conditional softmax approach stems from its ability to better capture the complexity of the hierarchy, especially for deeper and more imbalanced classes.
The paper concludes by emphasizing the importance of carefully considering the evaluation methodology, including both the choice of metrics and inference rules, when proposing new HTC methods. The authors plan to further investigate the inference mechanism for hierarchical metrics as a future direction.
На другой язык
из исходного контента
arxiv.org
Дополнительные вопросы