A novel multi-modal spiking saliency mamba model that effectively captures fine-grained relationships between video and language, leverages relevant slots to enhance memory for long-term dependencies, and utilizes spiking neural networks to accurately identify salient moments, leading to significant improvements in temporal video grounding.
SnAG is a simple and efficient model that achieves state-of-the-art performance on video grounding benchmarks, especially for long videos with many queries, by leveraging late fusion and video-centric training.