FlexCap: Generating Rich, Localized, and Flexible Captions in Images
FlexCap is a versatile vision-language model capable of generating region-specific descriptions with varying lengths, demonstrating superior performance in dense captioning tasks and visual question answering.