Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Conceitos Básicos
Chat-UniVi empowers large language models to comprehend and engage in conversations involving images and videos through a unified visual representation.
Resumo
Large language models have universal capabilities but struggle with image and video understanding.
Chat-UniVi uses dynamic visual tokens for spatial details in images and temporal relationships in videos.
Multi-scale representation enhances model capabilities for various tasks.
Trained on mixed datasets, Chat-UniVi outperforms methods designed exclusively for images or videos.