CATP, a cross-attention based token pruning method, can maintain high accuracy in multimodal model inference while significantly reducing computational costs.