Core Concepts
COCOLA is a contrastive learning method that captures the harmonic and rhythmic coherence between musical audio samples, enabling the objective evaluation of compositional models for music accompaniment generation.
Abstract
The paper presents COCOLA (Coherence-Oriented Contrastive Learning for Audio), a novel contrastive learning method for musical audio representations. COCOLA operates at the level of stems (or their combinations) composing music tracks and allows the objective evaluation of compositional models for music accompaniment generation.
The key highlights are:
COCOLA maximizes the agreement between disjoint sub-components (sub-mixtures of stems) within the same audio window, and minimizes it on sub-components belonging to different windows. This allows the model to capture the harmonic and rhythmic coherence between audio samples.
The authors introduce a new baseline for compositional music generation called CompoNet, based on ControlNet. CompoNet can perform a wide range of compositional tasks, including accompaniment generation, source separation, and editing, in a unified framework.
The authors release all models trained on public datasets containing separate stems (MUSDB18-HQ, MoisesDB, Slakh2100, and CocoChorales), enabling further research in this area.
The COCOLA score is proposed as a metric to quantify the coherence between conditioning tracks and generated accompaniments. Experiments show that CompoNet outperforms the previous MSDM model in accompaniment generation, as measured by the COCOLA score.
Stats
The paper uses the following public datasets for training and evaluation:
MUSDB18-HQ: 150 tracks with 4 stems (Bass, Drums, Vocals, Other)
MoisesDB: 240 tracks with 11 stems
Slakh2100: 2100 tracks with 34 stems
CocoChorales: 4000 tracks with 13 stems
Quotes
"COCOLA (Coherence-Oriented Contrastive Learning for Audio), a contrastive learning method for musical audio representations that captures the harmonic and rhythmic coherence between samples."
"We also introduce a new baseline for compositional music generation called CompoNet, based on ControlNet [1], generalizing the tasks of MSDM, and quantify it against the latter using COCOLA."