The author proposes a one-stage weakly supervised grounded captioner that directly processes RGB images for captioning and grounding at the top-down image level, incorporating relation semantics to enhance caption quality and grounding performance.
Proposing a one-stage weakly supervised grounded captioner with a relation module for accurate captioning and grounding.