Ferret-v2 is a significant upgrade to the Ferret model, featuring advanced capabilities in handling any resolution referring and grounding, multi-granularity visual encoding, and a novel three-stage training pipeline, enabling it to excel in processing and understanding images with higher resolution and finer detail.
TinyGPT-V is a novel open-source multimodal large language model designed for efficient training and inference across various vision-language tasks, leveraging a compact yet powerful architecture that integrates the Phi-2 language model with pre-trained vision encoders.
VOLCANO, a multimodal self-feedback guided revision model, effectively reduces multimodal hallucination and achieves state-of-the-art performance on multimodal hallucination benchmarks.
AnyGPTは、異なるモダリティ(音声、テキスト、画像、音楽)を統合的に処理するために離散表現を利用する多対多の言語モデルであり、既存のLLMアーキテクチャやトレーニング手法を変更せずに安定して訓練できることを示しています。