Conceitos essenciais
AnchorAL is a computationally efficient active learning method that addresses class imbalance by dynamically selecting a small, balanced subpool of instances to run the active learning strategy on, promoting the discovery of minority instances.
Resumo
The paper proposes AnchorAL, a novel active learning (AL) method designed to scale to large and imbalanced datasets. Standard pool-based AL struggles with large pools due to the high computational cost of repeatedly evaluating the model on the entire pool, and with imbalanced datasets as the AL strategy tends to overfit the initial decision boundary and fails to explore the input space to discover minority instances.
AnchorAL addresses these challenges by:
- Filtering the pool before running the AL strategy to reduce the computational and annotation costs. It selects a small, fixed-sized subpool by:
- Choosing a set of "anchor" instances, dynamically selected to promote diversity, from the labelled set.
- Retrieving the unlabelled instances most similar to the anchors to form the subpool.
- The subpool is more balanced than the original pool, allowing the AL strategy to discover minority instances more effectively.
- Experiments show that AnchorAL is the fastest method, reducing the total instance selection time from hours to minutes, and often the best-performing, reaching higher predictive performance with fewer annotations compared to baselines.
The key intuition behind AnchorAL is that by biasing the subpool towards smaller regions of the input space and dynamically changing these regions across iterations, it promotes the exploration of the input space and the discovery of minority instances, without the need for a large subpool which would reduce the computational efficiency.
Estatísticas
The Amazon-Agri dataset has a minority class (agriculture) with 0.09% prevalence.
The Amazon-Multi dataset has minority classes (archaeology 0.09%, audio 0.56%, philosophy 0.78%) with low prevalence.
The WikiToxic and Agnews-Bus datasets have minority classes downsampled to 1% prevalence.