Core Concepts
MINTは、マルチターゲットの事前トレーニングと指示チューニングを通じて音声言語モデルを強化する革新的なALPフレームワークです。
Abstract
Abstract:
MINT introduces a novel ALP framework for boosting audio-language models through multi-target pre-training and instruction tuning.
Bridge-Net enhances cross-modality alignment and model's ability to follow instructions for various audio-text tasks.
Introduction:
Large language models (LLMs) are utilized to enrich ALP capabilities.
MINT aims to bridge the modality gap and develop audio-language models that can effectively follow instructions.
Proposed methods:
MINT leverages frozen pre-trained models and introduces Bridge-Net to narrow the modality gap.
Model architecture includes an audio transformer, text transformer, and learnable query embeddings in Bridge-Net.
Experiments:
Training data collected from multiple publicly available audio datasets.
MINT evaluated on discriminative tasks like audio classification and generative tasks like audio captioning.
Results:
MINT outperforms Pengi in various audio classification tasks across different datasets.
In generative tasks like audio captioning, MINT exhibits significant superiority over traditional supervised approaches.
Ablation study:
Combining all three components (ALC, ALM, ATG) maximizes performance in the loss function.
Optimal results achieved upon completion of both training stages in MINT.
Stats
MINTは、Nsynthで68.26%、GTZAN(ZS)で49.66%の精度を達成しました。