Core Concepts
Mel-RoFormer, a spectrogram-based model with a novel Mel-band Projection module and interleaved RoPE Transformers, achieves state-of-the-art performance in vocal separation and vocal melody transcription tasks.
Abstract
The paper introduces Mel-RoFormer, a deep neural network model designed for music information retrieval (MIR) tasks, with a focus on vocal separation and vocal melody transcription.
Key highlights:
- Mel-RoFormer features two key innovations: a Mel-band Projection module at the front-end to enhance the model's ability to capture informative features across multiple frequency bands, and interleaved RoPE Transformers to explicitly model the frequency and time dimensions as two separate sequences.
- For vocal separation, Mel-RoFormer is trained on various datasets, including MUSDB18HQ, MoisesDB, and an in-house dataset. It outperforms the baseline BS-RoFormer model and other state-of-the-art approaches, achieving the highest signal-to-distortion ratio (SDR) on the MUSDB18HQ test set.
- For vocal melody transcription, the authors propose a two-step approach: first, they train a vocal separation model, which is then fine-tuned for the melody transcription task. This approach leads to state-of-the-art performance on the MIR-ST500 and POP909 datasets, particularly in accurately determining the onset, pitch, and offset of notes.
- The authors emphasize the importance of explicitly modeling the frequency dimension with Transformers and suggest that the separation task can serve as a valuable pre-training objective for a foundation model.
Stats
The MUSDB18HQ dataset contains 150 songs, with 100 for training and 50 for testing.
The MoisesDB dataset contains 240 songs, with 200 for training and 40 for validation.
The MIR-ST500 dataset contains 500 songs, with 330 for training, 37 for validation, and 98 for testing.
The POP909 dataset contains 909 songs, with 750 for training, 50 for validation, and 109 for testing.
Quotes
"Mel-RoFormer demonstrates superior performance compared to BS-RoFormer and other MSS models in experiments."
"For vocal melody transcription, we propose a two-step approach instead of training a unified model. We first pretrain a vocal separation model and then fine-tune it for vocal melody transcription."