核心概念
Mimosa, a human-AI collaborative tool, enables amateur video creators to computationally generate and customize immersive spatial audio effects for videos with only monaural or stereo audio.
要約
The paper introduces Mimosa, a human-AI collaborative tool that helps amateur video creators generate and manipulate spatial audio effects for videos with conventional monaural or stereo audio.
Mimosa employs a multi-step audiovisual pipeline to produce useful intermediate results, such as the type and position of independent soundtracks of different sounding objects, and their estimated 3D positions over time. These results are presented through an interactive direct manipulation interface, allowing users to easily validate, fix errors, and further customize the spatial audio effects.
The key features of Mimosa include:
- 2D and 3D direct manipulation panels for users to adjust the spatial positions of sound sources
- Audio properties display and control panel for users to verify and manually specify the correspondence between soundtracks and visual objects
- Real-time spatial audio rendering that mixes the separated soundtracks based on the 3D positions of sounding objects
A subjective evaluation with 8 external evaluators shows that the spatial audio effects generated by Mimosa were more immersive than the original video sound while maintaining a high degree of realism. A user study with 15 participants further demonstrates Mimosa's usability, usefulness, and capability in supporting users to create customized spatial audio effects.
統計
"Spatial audio offers more immersive video consumption experiences to viewers."
"Creating and editing spatial audio often expensive and requires specialized hardware equipment and skills, posing a high barrier for amateur video creators."
引用
"Mimosa, a human-AI collaborative tool, enables amateur video creators to computationally generate and customize immersive spatial audio effects for videos with only monaural or stereo audio."
"The design of Mimosa exemplifies a human-AI collaboration approach that, instead of utilizing state-of-art end-to-end "black-box" ML models, uses a multistep pipeline that aligns its interpretable intermediate results with the user's workflow."