Concetti Chiave
CATP, a cross-attention based token pruning method, can maintain high accuracy in multimodal model inference while significantly reducing computational costs.
Sintesi
The content discusses the development of Cross-Attention Token Pruning (CATP), a novel token pruning technique for large multimodal models like BLIP-2. The key insights are:
BLIP-2 is a state-of-the-art multimodal model that combines a frozen image encoder, a frozen language model, and a Querying Transformer (Q-Former) to enable vision-language tasks. However, the language model dominates the inference time, posing a challenge for computational efficiency.
Existing pruning methods like magnitude-based and self-attention probability pruning often result in significant accuracy degradation when applied to BLIP-2. To address this, the authors propose CATP, which leverages the cross-attention layers in the Q-Former to determine the importance of each query token.
CATP employs a refined voting strategy across model heads and layers to compute the importance score of each query token. Tokens with the lowest scores are then pruned, preserving model accuracy.
Experiments on the VQA dataset show that CATP achieves up to 12.1X higher accuracy compared to existing pruning methods, demonstrating its effectiveness in balancing computational efficiency and model precision.
The authors also explore further refinements, such as incorporating image token importance weighting and analyzing the layer-wise contribution of cross-attention information, to enhance the pruning strategy.
Statistiche
The content does not provide specific numerical data or metrics. However, it presents the following key figures:
BLIP-2 has a total of 3.1 billion parameters, with the language model decoder accounting for over 87% of the total.
CATP achieves up to 12.1X higher accuracy compared to existing pruning methods on the VQA dataset.
Pruning ratios of 1/2, 1/4, and 1/8 are evaluated for different pruning methods.
Citazioni
The content does not include any direct quotes.