Conceitos Básicos
A novel framework that utilizes external memory to incorporate prior knowledge and retrieve relevant text features to improve the quality of event localization and caption generation in dense video captioning.
Resumo
The paper proposes a new dense video captioning framework, named Cross-Modal Memory-based dense video captioning (CM2), that leverages external memory to retrieve relevant text features and improve the quality of event localization and caption generation.
Key highlights:
- Inspired by the human cognitive process of cued recall, the proposed method utilizes an external memory bank to store semantic information extracted from the training data.
- The model performs segment-level video-to-text retrieval to obtain relevant text features from the memory, which are then incorporated into the visual features using a versatile encoder-decoder structure.
- The versatile encoder-decoder architecture with visual and textual cross-attention modules allows the model to effectively learn the inter-task interactions between event localization and caption generation.
- Comprehensive experiments on the ActivityNet Captions and YouCook2 datasets demonstrate the effectiveness of the proposed memory retrieval approach, achieving state-of-the-art performance without extensive pretraining on large video datasets.
Estatísticas
The man connects a light inside a pumpkin and plug it.
A weight lifting tutorial is given.
People are dancing on the grass.
Citações
"Inspired by the human cognitive process of cued recall, the proposed method utilizes an external memory bank to store semantic information extracted from the training data."
"The versatile encoder-decoder architecture with visual and textual cross-attention modules allows the model to effectively learn the inter-task interactions between event localization and caption generation."