T-DEED addresses multiple challenges in Precise Event Spotting, including the need for discriminability among frame representations, high output temporal resolution, and the necessity to capture information at different temporal scales. It tackles these challenges through its specifically designed architecture, featuring an encoder-decoder for leveraging multiple temporal scales and achieving high output temporal resolution, along with temporal modules designed to increase token discriminability.
The core message of this article is to propose a new video visual relation detection task focused on understanding complex human-human interactions in multi-person sports videos, and to introduce the SportsHHI dataset to benchmark this task.