Core Concepts
The author introduces T2AV-BENCH, a benchmark for text-to-audio generation aligned with videos, and proposes the T2AV model that integrates visual-aligned text embeddings for improved audio synthesis.
Abstract
The content discusses the challenges in maintaining synchronization between generated audio and video frames. It introduces innovative methods like T2AV-BENCH and T2AV to address these issues. Extensive experiments and ablation studies validate the effectiveness of the proposed approach in achieving visual alignment and temporal consistency in text-to-audio generation.
Key Points:
- Introduction of T2AV-BENCH for video-aligned text-to-audio generation.
- Proposal of the T2AV model integrating visual-aligned text embeddings.
- Empirical experiments demonstrating state-of-the-art performance.
- Ablation studies highlighting the importance of visual-aligned CLAP and Audio-Visual ControlNet.
- Exploration of training data scale and latent diffusion tuning effects.
Stats
Our method significantly outperforms previous baselines in terms of all metrics (lower is better).
Extensive evaluations on the AudioCaps and T2AV-BENCH demonstrate that our T2AV sets a new standard for video-aligned TTA generation.
Quotes
"Our contributions can be summarized as presenting a novel benchmark for Text-to-Audio generation aligned with Video, introducing a simple yet effective approach called T2AV, and demonstrating state-of-the-art superiority over previous baselines."
"We achieve significant performance gains compared to MMDiffusion, DiffSound, AudioGen, and AudioLDM in video-aligned text-to-audio generation."