核心概念
Large language models (LLMs) show promise for music genre classification (MGC) in a zero-shot setting, particularly WavLM, but the audio spectrogram transformer (AST) model currently outperforms all others, highlighting the strength of transformer architectures in music information retrieval.
摘要
Bibliographic Information:
Meguenani, M.E.A., Britto Jr., A.S., & Koerich, A.L. (2024). Music Genre Classification using Large Language Models. arXiv preprint arXiv:2410.08321v1.
Research Objective:
This paper investigates the efficacy of pre-trained large language models (LLMs) for music genre classification (MGC) in a zero-shot setting, comparing their performance to traditional deep learning architectures.
Methodology:
The researchers extracted feature vectors from various layers of three pre-trained audio LLMs (WavLM, HuBERT, and wav2vec 2.0) and used them to train a classification head. They compared the performance of these models with 1D and 2D convolutional neural networks (CNNs) and the audio spectrogram transformer (AST) on the GTzan dataset using 3-fold cross-validation.
Key Findings:
- The AST model achieved the highest overall accuracy (85.5%) surpassing all other models tested.
- Among the LLMs, WavLM Large performed the best, achieving 84.6% accuracy after aggregation.
- Earlier layers of LLM models generally exhibited superior performance in capturing audio and musical features for genre classification.
Main Conclusions:
- LLMs, particularly WavLM, show potential for MGC in a zero-shot setting.
- Transformer-based architectures, like AST, demonstrate superior performance in capturing complex temporal and frequency features from audio signals for MGC.
- The initial layers of audio LLMs are particularly adept at identifying low-level audio characteristics relevant to genre classification.
Significance:
This research contributes to the field of music information retrieval (MIR) by demonstrating the potential of LLMs and transformer-based models for MGC, paving the way for their application in other music-related tasks.
Limitations and Future Research:
The study acknowledges limitations due to the GTzan dataset's integrity issues and limited genre diversity. Future research could explore fine-tuning these models on larger, more diverse datasets and investigate their effectiveness in related tasks like music recommendation or mood classification.
統計資料
The WavLM Large model achieved 75.5% accuracy on segments at the 5th layer and 84.6% accuracy after aggregation at the 11th layer.
The AST model achieved 79.75% accuracy on audio segments and 85.50% after aggregation.