VITA: An Open-Source Multimodal Large Language Model with Advanced Interactive Capabilities
VITA is an open-source multimodal large language model that can simultaneously process and analyze video, image, text, and audio modalities, while also featuring advanced multimodal human-computer interaction capabilities.