CoCA: Integrating Position Embedding with Collinear Constrained Attention for Long Context Window Extending
Core Concepts
CoCA seamlessly integrates position embedding and self-attention to enhance long context window extrapolation.
Abstract
CoCA addresses anomalous behaviors between Rotary Position Embedding (RoPE) and self-attention, enhancing long context window extrapolation. It enforces a collinear constraint between Q and K, improving performance without significant complexity. CoCA-based models show exceptional results in extending context windows, outperforming existing methods. The slack constraint version of CoCA provides comparable performance to the strict version but offers advantages in certain tasks like passkey retrieval.
CoCA
Stats
A CoCA-based GPT model extends the context window up to 32K (60x) without fine-tuning.
Dropping CoCA in LLaMA-7B achieves extrapolation up to 32K within only 2K training length.
Quotes
"Extensive experiments show that incorporating CoCA into existing models significantly enhances performance in both long sequence language modeling and long context retrieval tasks."
"Combining CoCA with other extended RoPE methods effectively mitigates rotation boundary issues, achieving robust long-context extrapolation capabilities."