FlexCap is a versatile vision-language model that generates region-specific descriptions of varying lengths, enabling controllable rich and localized captions.