מושגי ליבה
AudioComposer is a novel text-to-audio generation framework that utilizes natural language descriptions to provide precise control over content and style, without requiring additional conditions or complex network structures.
תקציר
The paper presents AudioComposer, a fine-grained audio generation framework that relies solely on natural language descriptions (NLDs) to enable precise control over content and style. The key highlights are:
-
Automatic Data Simulation Pipeline:
- The authors introduce an innovative online data simulation pipeline to generate fine-grained audio-text pairs with annotations on timestamps, pitch, and energy.
- This approach effectively tackles the issue of data scarcity in controllable text-to-audio (TTA) systems.
-
Natural Language-based Control:
- AudioComposer utilizes NLDs to provide both content specification and style control information, eliminating the need for additional conditions or complex control networks.
- This simplifies the system design and improves efficiency compared to previous approaches that require extra frame-level conditions.
-
Flow-based Diffusion Transformers:
- The authors employ flow-based diffusion transformers with cross-attention mechanisms to effectively incorporate text representations into the audio generation process.
- This architecture not only accelerates the generation process but also enhances the audio generative performance and controllability.
Extensive experiments demonstrate that AudioComposer outperforms state-of-the-art TTA models in terms of generation quality, temporal controllability, pitch control, and energy control, even with a smaller model size.
סטטיסטיקה
"Dog bark, Start at 3.6s and End at 7.4s, it has Normal Pitch and Low Energy."
"Speech, Start at 0s and End at 3s, it has High Pitch, and Normal Energy."