Efficient Sparse Mixture-of-Experts Models through Expert Pruning and Top-K Adaptation
This work introduces SEER-MoE, a two-stage framework that reduces the memory footprint and compute requirements of pre-trained Mixture-of-Experts (MoE) models. The first stage prunes the total number of experts using a heavy-hitters counting guidance, while the second stage employs a regularization-based fine-tuning strategy to recover accuracy loss and reduce the number of activated experts during inference.