Enhancing Robustness of Multimodal Video Paragraph Captioning Models to Missing Modalities
A multimodal video paragraph captioning framework (MR-VPC) that effectively utilizes available auxiliary inputs (speech, event boundaries) and maintains resilience even in the absence of certain modalities.