The paper introduces Mimosa, a human-AI collaborative tool that helps amateur video creators generate and manipulate spatial audio effects for videos with conventional monaural or stereo audio.
Mimosa employs a multi-step audiovisual pipeline to produce useful intermediate results, such as the type and position of independent soundtracks of different sounding objects, and their estimated 3D positions over time. These results are presented through an interactive direct manipulation interface, allowing users to easily validate, fix errors, and further customize the spatial audio effects.
The key features of Mimosa include:
A subjective evaluation with 8 external evaluators shows that the spatial audio effects generated by Mimosa were more immersive than the original video sound while maintaining a high degree of realism. A user study with 15 participants further demonstrates Mimosa's usability, usefulness, and capability in supporting users to create customized spatial audio effects.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Zheng Ning,Z... at arxiv.org 04-24-2024
https://arxiv.org/pdf/2404.15107.pdfDeeper Inquiries