MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer
The author introduces the MADTP framework to accelerate Vision-Language Transformers by aligning features across modalities and dynamically adjusting token compression ratios based on input complexity.