The paper investigates the vulnerability of MLLMs, particularly image-based and video-based ones, to energy-latency manipulation. It is observed that energy consumption and latency time exhibit an approximately positive linear relationship with the length of generated sequences in MLLMs.
To induce high energy-latency cost, the authors propose verbose samples, including verbose images and verbose videos. For both modalities, two modality non-specific losses are designed: 1) Delayed EOS loss to delay the occurrence of the end-of-sequence (EOS) token, and 2) Uncertainty loss to enhance output uncertainty and break the original output dependency.
Additionally, modality-specific losses are proposed. For verbose images, a Token Diversity loss is introduced to promote diverse hidden states among all generated tokens. For verbose videos, a Frame Feature Diversity loss is proposed to increase the diversity of frame features, introducing inconsistent semantics.
A temporal weight adjustment algorithm is used to balance these losses during optimization. Experiments demonstrate that the proposed verbose samples can significantly extend the length of generated sequences, thereby inducing high energy-latency cost in MLLMs.
إلى لغة أخرى
من محتوى المصدر
arxiv.org
الرؤى الأساسية المستخلصة من
by Kuofeng Gao,... في arxiv.org 04-26-2024
https://arxiv.org/pdf/2404.16557.pdfاستفسارات أعمق