The authors present a novel task called "singing style captioning", which aims to capture the vocal and musical characteristics of a given audio clip. To address this task, they introduce S2Cap, a comprehensive dataset with a diverse set of attributes related to singing voices.
The S2Cap dataset is generated using a large language model (LLM) based pipeline, leveraging an existing audio dataset (Melon playlist dataset) and additional metadata obtained through web scraping. The authors process the audio tracks by utilizing demixing and speaker-diarization models to extract vocal segments and their associated attributes, such as gender, timbre, mood, and tempo. These attributes are then used to generate final captions that reflect the singer's style.
The authors also propose a baseline framework for singing style captioning, which combines a pretrained audio encoder (AST) with a text decoder (BART). To address the potential misalignment between the audio encoder and text decoder, the authors introduce a novel technique called CRESCENDO, which performs positive-pair similarity learning to synchronize the embedding spaces. Additionally, they leverage vocal demixing supervision to encourage the model to focus on the vocal track rather than the musical accompaniment.
The authors evaluate their proposed methods on the S2Cap dataset and demonstrate the effectiveness of their approach, outperforming various alternative audio encoders. The S2Cap dataset and the codes are made publicly available to facilitate further research in this emerging field.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Hyunjong Ok,... lúc arxiv.org 09-17-2024
https://arxiv.org/pdf/2409.09866.pdfYêu cầu sâu hơn