Multi-Modal Spiking Saliency Mamba for Efficient Temporal Video Grounding
A novel multi-modal spiking saliency mamba model that effectively captures fine-grained relationships between video and language, leverages relevant slots to enhance memory for long-term dependencies, and utilizes spiking neural networks to accurately identify salient moments, leading to significant improvements in temporal video grounding.