MM-TTS, a unified framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech.
A novel Multimodal Representation Learning (MRL) method called Cooperative Sentiment Agents (Co-SA) that facilitates adaptive interaction between modalities to learn the joint representation for multimodal sentiment analysis.
The core message of this paper is that the authors propose a novel Intra- and Inter-modal Side Adapted Network (IISAN) that follows a decoupled parameter-efficient fine-tuning (DPEFT) paradigm to efficiently adapt pre-trained large-scale multimodal foundation models for downstream sequential recommendation tasks. IISAN significantly reduces GPU memory usage and training time compared to full fine-tuning and existing embedded PEFT methods, while maintaining comparable recommendation performance.
Proposing a pipeline for contrastive language-audio pretraining to enhance audio representation by combining audio data with natural language descriptions.