The paper introduces VITA, an open-source multimodal large language model (MLLM) that can process and analyze video, image, text, and audio modalities. VITA is developed through a comprehensive training process:
Bilingual Instruction Tuning: The base Mixtral 8x7B model is enhanced by expanding its Chinese vocabulary and further instruction tuning using a high-quality bilingual text corpus, enabling proficiency in both Chinese and English.
Multimodal Alignment: Individual encoders are trained to process different modalities (video, image, audio) and aligned with the language model, enabling robust multimodal understanding.
Multimodal Instruction Tuning: The model is trained to follow text or audio instructions to understand and respond to image or video inputs. State tokens are introduced to distinguish different types of input queries (audio, text, noisy audio).
Beyond the foundational multimodal capabilities, VITA also features advanced interactive functionalities:
Non-awakening Interaction: VITA can automatically identify and respond to user audio queries without the need for a wake-up word or button, by filtering out background noise.
Audio Interrupt Interaction: VITA employs a duplex deployment scheme, where one model handles user queries while the other continuously monitors the environment. If the user interrupts with a new query, the monitoring model takes over to respond to the latest question.
The paper demonstrates VITA's strong performance on a range of unimodal and multimodal benchmarks, while acknowledging the remaining gap compared to proprietary models. The open-sourcing of VITA aims to promote further advancements in the field of multimodal large language models and human-computer interaction.
Vers une autre langue
à partir du contenu source
arxiv.org
Questions plus approfondies