Core Concepts
High energy consumption and latency time can be induced in multi-modal large language models (MLLMs) by crafting verbose samples, including verbose images and verbose videos, which maximize the length of generated sequences.
Abstract
The paper investigates the vulnerability of MLLMs, particularly image-based and video-based ones, to energy-latency manipulation. It is observed that energy consumption and latency time exhibit an approximately positive linear relationship with the length of generated sequences in MLLMs.
To induce high energy-latency cost, the authors propose verbose samples, including verbose images and verbose videos. For both modalities, two modality non-specific losses are designed: 1) Delayed EOS loss to delay the occurrence of the end-of-sequence (EOS) token, and 2) Uncertainty loss to enhance output uncertainty and break the original output dependency.
Additionally, modality-specific losses are proposed. For verbose images, a Token Diversity loss is introduced to promote diverse hidden states among all generated tokens. For verbose videos, a Frame Feature Diversity loss is proposed to increase the diversity of frame features, introducing inconsistent semantics.
A temporal weight adjustment algorithm is used to balance these losses during optimization. Experiments demonstrate that the proposed verbose samples can significantly extend the length of generated sequences, thereby inducing high energy-latency cost in MLLMs.
Stats
Energy consumption and latency time exhibit an approximately positive linear relationship with the length of generated sequences in MLLMs.
Our verbose images can increase the maximum length of generated sequences by 7.87x and 8.56x on MS-COCO and ImageNet datasets, respectively.
Our verbose videos can increase the maximum length of generated sequences by 4.04x and 4.14x on MSVD and TGIF datasets, respectively.
Quotes
"Once attackers maliciously induce high energy consumption and latency time (energy-latency cost) during the inference stage, it can exhaust computational resources and reduce the availability of MLLMs."
"The energy consumption is the amount of energy used on hardware during an inference, while the latency time represents the response time taken for the inference."