Concepts de base
FFN-SkipLLM is a novel fine-grained skip strategy that can skip ~25-30% of feed-forward network (FFN) blocks in autoregressive large language models (LLMs) with marginal change in performance on knowledge-intensive tasks.
Résumé
The content discusses the challenges faced by autoregressive large language models (LLMs) in terms of their massive scale and high computational requirements for token-by-token generation. To address this, the authors propose FFN-SkipLLM, a novel fine-grained skip strategy that focuses on skipping the computationally expensive feed-forward network (FFN) blocks in LLMs.
The key observations that motivate FFN-SkipLLM are:
- Monotonically increasing cosine similarity between tensors before and after the FFN blocks across layers, indicating redundant computation.
- Existence of "cold regions" (initial and final layers) where FFN blocks should not be skipped to maintain performance.
- Leveraging the "attention sink" phenomenon to stabilize the initial token generation before applying FFN skipping.
FFN-SkipLLM uses an input-adaptive approach to identify the non-cold regions where FFN blocks can be skipped based on the cosine similarity threshold. It can skip ~25-30% of FFN blocks with marginal change in performance on knowledge-intensive tasks like Factoid-QA, multi-turn conversations, and variable-length text summarization, while avoiding the KV cache issues associated with layer-skipping methods.
The authors extensively evaluate FFN-SkipLLM against baselines and show its effectiveness in retaining the model's ability to access factual knowledge, maintain coherence, consistency, fluency, and relevance in the generated outputs.
Stats
The GPT-175B model requires 325 GB of GPU memory and a minimum of five A100 (80GB) GPUs.
FFN blocks hold approximately two-thirds of the parameter budget of a LLaMa layer.
Citations
"Unlike prior layer-skipping methods, we focus on only skipping computationally expensive FFN blocks based on our observation of their monotonically increasing saturation within the middle layers of LLMs."
"Our proposed FFN-SkipLLM uses a simple cosine similarity metric across tensors to capture the trend of FFN saturation and decide an input-adaptive skipping of FFN blocks."
"Our extensive knowledge-intensive experiments such as Factoid-QA, Multi-turn conversations and Variable-length in-context text summarization, reveal that FFN-SkipLLM can skip ~25-30% of FFN blocks of LLMs with a marginal change in performance and reduce hallucination and token collapse."