Cross-Modal Memory Retrieval for Efficient Dense Video Captioning
A novel framework that utilizes external memory to incorporate prior knowledge and retrieve relevant text features to improve the quality of event localization and caption generation in dense video captioning.