Karchkhadze, T., Izadi, M. R., Chen, K., Assayag, G., & Dubnov, S. (2024). Multi-Track MusicLDM: Towards Versatile Music Generation with Latent Diffusion Model. arXiv preprint arXiv:2409.02845.
This paper introduces a novel multi-track music generation model, MT-MusicLDM, aiming to address the limitations of existing text-to-music models in capturing the multi-layered nature and arrangement complexities of music composition.
The researchers developed MT-MusicLDM by extending the MusicLDM framework, a latent diffusion model for music generation. They incorporated a 3D UNet architecture to handle multi-track audio data and employed CLAP encoders for text and audio conditioning. The model was trained on the Slakh2100 dataset and evaluated using the Fréchet Audio Distance (FAD) metric.
MT-MusicLDM demonstrates superior performance in generating coherent multi-track music compared to the baseline MSDM model, evidenced by significantly lower FAD scores in both unconditional and conditional generation tasks. The model effectively leverages audio and text conditioning to guide the generation process towards desired musical styles and arrangements.
MT-MusicLDM presents a significant advancement in music generation by enabling the creation of realistic and coherent multi-track music, surpassing existing models in quality and versatility. The model's ability to generate arrangements, conditioned on existing tracks or textual prompts, opens new possibilities for musicians and composers.
This research contributes to the field of music generation by introducing a powerful and versatile model capable of capturing the complexities of multi-track composition and arrangement. MT-MusicLDM has the potential to become a valuable tool for musicians, composers, and educators, fostering creativity and enhancing music production workflows.
The study acknowledges limitations stemming from the use of pre-trained components not specifically optimized for multi-track processing and the relatively small size of the Slakh2100 dataset. Future research directions include exploring higher-fidelity audio representations, incorporating larger and more diverse datasets, and enhancing the model's rhythmic coherence and instrument versatility.
A otro idioma
del contenido fuente
arxiv.org
Ideas clave extraídas de
by Tornike Karc... a las arxiv.org 10-15-2024
https://arxiv.org/pdf/2409.02845.pdfConsultas más profundas