Core Concepts
The authors propose Orthogonal Butterfly (BOFT), a novel parameter-efficient finetuning method that leverages butterfly factorization to construct dense orthogonal matrices, enabling more efficient adaptation of large foundation models.
Abstract
The paper studies the problem of parameter efficiency in orthogonal finetuning, a principled approach for downstream task adaptation. The authors propose an information transmission framework to transform the task of crafting a parameter-efficient dense orthogonal matrix into an information transmission problem within a grid-structured graph.
Inspired by the butterfly structures in the Cooley-Tukey fast Fourier transform algorithm, the authors introduce Orthogonal Butterfly (BOFT), a parameter-efficient orthogonal finetuning method. BOFT subsumes the original orthogonal finetuning (OFT) as a special case and provides a generalized orthogonal finetuning framework.
The authors provide theoretical insights on the expressivity and inductive bias of BOFT, showing that it can significantly reduce the number of trainable parameters while still yielding good performance. BOFT also comes with an intriguing weight interpolation property.
The authors extensively evaluate BOFT on a wide range of adaptation tasks, including natural language understanding, mathematical reasoning, high-quality image segmentation, and controllable text-to-image generation. BOFT outperforms current state-of-the-art methods by a considerable margin, validating its superior parameter-efficiency and generalization ability.
Stats
The GPT-3 model has around 175 billion parameters.
The Llama-2-7B model has 7 billion parameters.
The DINOv2-large model has more than 300 million parameters.
Quotes
"Recent models such as ChatGPT and Stable Diffusion demonstrate the remarkable generalization ability of large foundation models. The rapid increase in the performance of such models is paired with a dramatic increase in the number of parameters."
"Broad progress in the field therefore requires the ability to adapt such models without retraining them from scratch. That is, we must be able to efficiently adapt existing foundation models to downstream tasks."