核心概念
OMG-LLaVA is a new and elegant framework that combines powerful pixel-level vision understanding with reasoning abilities, enabling it to accept various visual and text prompts for flexible user interaction.
摘要
The paper presents OMG-LLaVA, a new and elegant framework that bridges image-level, object-level, and pixel-level reasoning and understanding tasks in a single model.
Key highlights:
OMG-LLaVA consists of a universal perception module (OMG-Seg) and a large language model (LLM). The OMG-Seg module encodes images and visual prompts into pixel-centric and object-centric visual tokens, which are then input to the LLM.
The LLM accepts text instructions and visual tokens as input, and can output text responses, segmentation tokens, and segmentation masks.
OMG-LLaVA can handle a variety of tasks, including image captioning, image-based conversation, region captioning, visual prompt-based conversation, referring segmentation, reasoning segmentation, and grounded conversation generation.
The authors propose a perception prior embedding strategy to better integrate the perception priors from the OMG-Seg module with the LLM.
Extensive experiments show that OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding, matching or surpassing the performance of specialized methods on multiple benchmarks.
The authors release the code and model for further research.
統計資料
The image features a white pickup truck parked on a street.
The truck is a large, full-size vehicle, and it is parked in front of a residential home.
The truck is positioned on the street, and it appears to be parked in front of the home's driveway.
The truck is also parked next to a curb, which is a common feature on streets in many cities.
引述
"OMG-LLaVA is a new and elegant framework combining powerful pixel-level vision understanding with reasoning abilities."
"OMG-LLaVA can accept various visual and text prompts for flexible user interaction."