核心概念
OMG-LLaVA is a new and elegant framework that combines powerful pixel-level vision understanding with reasoning abilities, enabling it to accept various visual and text prompts for flexible user interaction.
要約
The paper presents OMG-LLaVA, a new and elegant framework that bridges image-level, object-level, and pixel-level reasoning and understanding tasks in a single model.
Key highlights:
- OMG-LLaVA consists of a universal perception module (OMG-Seg) and a large language model (LLM). The OMG-Seg module encodes images and visual prompts into pixel-centric and object-centric visual tokens, which are then input to the LLM.
- The LLM accepts text instructions and visual tokens as input, and can output text responses, segmentation tokens, and segmentation masks.
- OMG-LLaVA can handle a variety of tasks, including image captioning, image-based conversation, region captioning, visual prompt-based conversation, referring segmentation, reasoning segmentation, and grounded conversation generation.
- The authors propose a perception prior embedding strategy to better integrate the perception priors from the OMG-Seg module with the LLM.
- Extensive experiments show that OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding, matching or surpassing the performance of specialized methods on multiple benchmarks.
- The authors release the code and model for further research.
統計
The image features a white pickup truck parked on a street.
The truck is a large, full-size vehicle, and it is parked in front of a residential home.
The truck is positioned on the street, and it appears to be parked in front of the home's driveway.
The truck is also parked next to a curb, which is a common feature on streets in many cities.
引用
"OMG-LLaVA is a new and elegant framework combining powerful pixel-level vision understanding with reasoning abilities."
"OMG-LLaVA can accept various visual and text prompts for flexible user interaction."