This research paper introduces Audio-Agent, a new multimodal framework for generating, editing, and composing audio from text and video inputs. The key innovation lies in its use of LLMs, specifically GPT-4 and Gemma2-2B-it, to guide a pre-trained text-to-audio diffusion model (Auffusion).
For text-to-audio (TTA) tasks, Audio-Agent employs GPT-4 to decompose complex text descriptions into simpler, atomic instructions for Auffusion. This allows the system to handle long, multi-event descriptions that pose challenges for conventional single-pass TTA models.
For video-to-audio (VTA) tasks, Audio-Agent fine-tunes Gemma2-2B-it to generate semantic tokens from video inputs. These tokens, which inherently contain temporal information, guide Auffusion to generate audio synchronized with the video content. This eliminates the need for a separate timestamp detection module, simplifying the VTA process.
Evaluations on the AudioCaps and AVSync15 datasets demonstrate Audio-Agent's capabilities. In TTA tasks, it outperforms baseline models, especially on complex, multi-event descriptions. In VTA tasks, it achieves competitive performance in audio fidelity and semantic alignment while demonstrating superior temporal synchronization compared to other methods.
The paper highlights the significance of Audio-Agent as a step towards multi-event, long-condition TTA generation, a relatively unexplored area with significant potential. It also acknowledges limitations, such as the need for improved handling of complex text conditions, and suggests future research directions, including exploring LLM applications in audio and video captioning tasks.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Zixuan Wang,... at arxiv.org 10-07-2024
https://arxiv.org/pdf/2410.03335.pdfDeeper Inquiries