Concepts de base
Music generation models possess impressive generative capabilities, but the extent to which they encode fundamental Western music theory concepts within their internal representations remains unclear. This study introduces a synthetic dataset, SynTheory, to systematically probe the encoding of music theory concepts, including tempo, time signatures, notes, intervals, scales, chords, and chord progressions, in state-of-the-art music generation models such as Jukebox and MusicGen.
Résumé
This study investigates whether and to what extent state-of-the-art music generation models, such as Jukebox and MusicGen, encode fundamental Western music theory concepts within their internal representations. The authors introduce a synthetic dataset called SynTheory, which consists of seven datasets that each capture an isolated music theory concept, including tempo, time signatures, notes, intervals, scales, chords, and chord progressions.
The authors use a probing approach to assess the degree to which these music theory concepts are discernible in the internal representations of the music generation models. They train probing classifiers on the embeddings extracted from different layers and components of the models, including the audio codecs and decoder language models. The probing results suggest that music theory concepts are indeed encoded within these models, with the degree of encoding varying across different concepts, model sizes, and model layers.
The authors find that the Jukebox model performs consistently well across all SynTheory tasks, while the MusicGen Decoder Language Models also exhibit competitive performance. Interestingly, the smaller MusicGen model outperforms its larger counterparts, suggesting that the smaller model may have developed a more efficient encoding of music theory concepts within its representations.
The authors also benchmark the music generation models against handcrafted audio features, such as mel spectrograms, MFCC, and chroma. The results show that the pretrained music decoder language models generally outperform the handcrafted features, but the aggregate handcrafted features perform comparably to the MusicGen Decoder Language Models.
The insights from this study can inform future efforts towards more detailed and lower-level control in music generation, as well as the development of more challenging probing datasets to further understand the relationship between symbolic and audio-based music generation.
Stats
The tempo dataset covers integer tempi from 50 to 210 BPM in 44 time, with 5 random offset times per sample.
The time signature dataset covers 8 time signatures (2/2, 2/4, 3/4, 3/8, 4/4, 6/8, 9/8, 12/8) with 3 levels of reverb and 10 random offset times per sample.
The notes dataset covers all 12 Western pitch classes in 9 octaves, voiced through 92 instruments.
The intervals dataset covers all 12 pitch classes, 12 half-step intervals, 92 instruments, and 3 play styles (unison, up, and down).
The scales dataset covers 7 Western modes (Ionian, Dorian, Phrygian, Lydian, Mixolydian, Aeolian, Locrian) in all 12 root notes, 92 instruments, and 2 play styles (ascending and descending).
The chords dataset covers all 12 root notes, 4 chord qualities (major, minor, augmented, diminished), 92 instruments, and 3 inversions (root position, first inversion, second inversion).
The chord progressions dataset covers 19 four-chord progressions (10 in major mode, 9 in natural minor mode) in all 12 root notes and 92 instruments.
Citations
"Music foundation models possess impressive music generation capabilities. When people compose music, they may infuse their understanding of music into their work, by using notes and intervals to craft melodies, chords to build progressions, and tempo to create a rhythmic feel. To what extent is this true of music generation models?"
"Recent work proposed leveraging latent audio representations from music generation models towards music information retrieval tasks (e.g. genre classification, emotion recognition), which suggests that high-level musical characteristics are encoded within these models. However, probing individual music theory concepts (e.g. tempo, pitch class, chord quality) remains under-explored."