Text-to-Audio Generation Aligned with Videos: T2AV-BENCH
The author introduces T2AV-BENCH, a benchmark for text-to-audio generation aligned with videos, and proposes the T2AV model that integrates visual-aligned text embeddings for improved audio synthesis.