This work introduces SEER-MoE, a two-stage framework that reduces the memory footprint and compute requirements of pre-trained Mixture-of-Experts (MoE) models. The first stage prunes the total number of experts using a heavy-hitters counting guidance, while the second stage employs a regularization-based fine-tuning strategy to recover accuracy loss and reduce the number of activated experts during inference.
FFN-SkipLLM is a novel fine-grained skip strategy that can skip ~25-30% of feed-forward network (FFN) blocks in autoregressive large language models (LLMs) with marginal change in performance on knowledge-intensive tasks.
Eigenpruning is a method that removes singular values from weight matrices in large language models (LLMs) to improve their performance on specific tasks. This approach is inspired by interpretability methods that aim to automatically find subnetworks of a model that can effectively solve a given task.
The CMAT framework introduces a structured environment where individual agents with specialized roles and capabilities work together to process information, make decisions, and solve complex tasks, enabling more scalable and flexible training of language models.
Introducing ComplexityNet, a framework that leverages fine-tuned smaller models to accurately assess task complexity and allocate tasks to the most appropriate Large Language Model, reducing computational resource usage by 90% while maintaining high code generation accuracy.