Mel-RoFormer, a spectrogram-based model with a novel Mel-band Projection module and interleaved RoPE Transformers, achieves state-of-the-art performance in vocal separation and vocal melody transcription tasks.
Integrating audio and textual lyrics data can enhance the performance of music sentiment analysis systems compared to using a single modality.
A customized musical word embedding that incorporates both general and music-specific vocabulary can improve the performance of audio-word joint representation for music tagging and retrieval tasks.
An engineering approach for efficiently retrieving cover versions of classical music works using a compact fingerprint based on chord and melody progressions.
The proposed method learns a single similarity embedding space with disentangled dimensions, where each subspace represents the similarity focusing on a particular instrument, enabling flexible music similarity assessment.
Genuine music outliers exhibit unique characteristics that deviate from an artist's predominant style, providing valuable insights for music discovery and recommendation systems.
Computational models based on semantic, stylistic, and phonetic similarities are indicative of human perceptions of lyric similarity, underscoring the importance of these factors in how people judge the similarity of song lyrics.
MusiLingo is a novel system that effectively bridges the gap between music and text domains, delivering competitive performance in music captioning and question-answering tasks.
This study explores the potential of deep vector quantization (VQ)-based audio representations, as used in the Jukebox model, for music genre identification tasks, and compares their performance to the well-established Mel spectrogram approach.