SongTrans is a unified model that can directly transcribe and align song lyrics and musical notes without requiring pre-processing or separate tools.
The authors introduce S2Cap, a novel dataset for the task of singing style captioning, which aims to generate textual descriptions of the vocal and musical characteristics of singing voices. The dataset contains a diverse set of attributes, including pitch, volume, tempo, mood, singer's gender and age, and musical genre and emotional expression.
CONTUNER, a diffusion-based model, can efficiently beautify amateur singing voices by correcting pitch and enhancing expressiveness without requiring paired professional-amateur data.