Enhancing Dense Video Captioning with Pseudo Boundary Generation and Online Refinement
DIBS, a novel pretraining framework, improves the quality of pseudo event boundaries and captions derived from large-scale unlabeled videos by leveraging diverse language models and optimizing for diversity, event-centricity, temporal ordering, and coherence. It also introduces an online boundary refinement strategy to iteratively enhance the pseudo boundaries during training.