This study explores the potential of deep vector quantization (VQ)-based audio representations, as used in the Jukebox model, for music genre identification tasks, and compares their performance to the well-established Mel spectrogram approach.
Large language models (LLMs) show promise for music genre classification (MGC) in a zero-shot setting, particularly WavLM, but the audio spectrogram transformer (AST) model currently outperforms all others, highlighting the strength of transformer architectures in music information retrieval.