Core Concepts
FlexCap is a versatile vision-language model that generates region-specific descriptions of varying lengths, enabling controllable rich and localized captions.
Abstract
FlexCap introduces a flexible captioning model for generating region-specific descriptions.
The model combines image captioning, object detection, and dense captioning tasks.
FlexCap demonstrates superior performance in dense captioning tasks on the Visual Genome dataset.
The model enables spatially controllable inquiry of any bounding box in an image with desired text detail controlled by word count.
Training datasets are generated from existing image-text paired datasets using open-vocabulary object detectors.
FlexCap achieves competitive performance in visual question answering and dense captioning tasks.
Stats
FlexCapは、画像の領域固有のさまざまな長さの説明を生成する柔軟なビジョン言語モデルです。