Conceitos Básicos
MusiLingo is a novel system that effectively bridges the gap between music and text domains, delivering competitive performance in music captioning and question-answering tasks.
Resumo
The paper introduces MusiLingo, a novel music-language model that leverages large language model (LLM) capabilities to enhance music comprehension. The key innovation lies in the use of a simple adapter network that projects music embeddings into the text embedding space, allowing the frozen pre-trained music encoder and LLM to be effectively combined.
The model is trained in two stages:
Pre-training: The adapter network is trained on a large music captioning dataset (LP-MusicCaps-MSD) to align music and text representations.
Instruction Tuning: The model is further fine-tuned on the MusicInstruct (MI) dataset, which contains high-quality music-related question-answer pairs, to equip the model with the ability to respond to various music-related queries.
Experiments demonstrate that MusiLingo achieves state-of-the-art performance on music question-answering tasks, outperforming existing models on various evaluation metrics. The model also shows competitive results on music captioning, though there is still room for improvement compared to specialized captioning models.
The paper also presents an ablation study that investigates the impact of different fine-tuning datasets on the model's performance, highlighting the importance of carefully selecting the training data to optimize the model's capabilities.
Estatísticas
The paper does not provide any specific numerical data or statistics in the main text. The focus is on the model architecture and training process, as well as the evaluation of the model's performance on music captioning and question-answering tasks.
Citações
The paper does not contain any direct quotes that are particularly striking or support the key arguments.