Core Concepts
Verbose images can induce high energy-latency cost in VLMs by increasing the length of generated sequences.
Abstract
The content discusses inducing high energy-latency costs in large vision-language models (VLMs) using verbose images. It explores the relationship between energy consumption, latency time, and sequence length in VLMs during inference. The proposed method involves delaying the end-of-sequence token, enhancing output uncertainty, improving token diversity, and utilizing a temporal weight adjustment algorithm to craft imperceptible perturbations. Extensive experiments show that verbose images significantly increase the length of generated sequences compared to original images on MS-COCO and ImageNet datasets.
Abstract:
Large vision-language models (VLMs) require substantial computational resources for deployment.
Attackers can induce high energy consumption and latency time during inference to exhaust resources.
Verbose images are proposed to manipulate VLMs into generating longer sequences.
Loss objectives delay EOS occurrence, enhance output uncertainty, improve token diversity.
A temporal weight adjustment algorithm balances these losses for effective optimization.
Introduction:
VLMs have achieved remarkable performance but require significant computational resources.
Malicious attacks inducing high energy-latency cost can deplete computational resources.
Proposed verbose images aim to increase sequence length by manipulating VLM outputs.
Methodology:
Three loss objectives are designed: delayed EOS occurrence, enhanced output uncertainty, improved token diversity.
A temporal weight adjustment algorithm is introduced to balance these losses during optimization.
GradCAM visualization shows dispersed attention in verbose images compared to original images.
Results:
Extensive experiments demonstrate that verbose images significantly increase the length of generated sequences in VLMs.
Visual interpretation using GradCAM highlights dispersed attention in verbose images.
Textual interpretation shows increased object hallucination in sequences generated from verbose images.
Stats
"Extensive experiments demonstrate that our verbose images can increase the length of generated sequences by 7.87× and 8.56× compared to original images on MS-COCO and ImageNet datasets."
Quotes
"Our contribution can be outlined as follows."
"Our code is available at https://github.com/KuofengGao/Verbose_Images."