Conceitos Básicos
T-DEED addresses multiple challenges in Precise Event Spotting, including the need for discriminability among frame representations, high output temporal resolution, and the necessity to capture information at different temporal scales. It tackles these challenges through its specifically designed architecture, featuring an encoder-decoder for leveraging multiple temporal scales and achieving high output temporal resolution, along with temporal modules designed to increase token discriminability.
Resumo
The paper introduces T-DEED, a Temporal-Discriminability Enhancer Encoder-Decoder for Precise Event Spotting in sports videos. It addresses three main challenges in this task:
- The need for discriminative per-frame representations to differentiate between adjacent frames with high spatial similarity.
- The necessity of high output temporal resolution to avoid losing prediction precision.
- The variability in the amount of temporal context required for different events, influenced by dataset characteristics and event dynamics.
To tackle these challenges, T-DEED incorporates the following key components:
- An encoder-decoder architecture that operates across various temporal scales, capturing actions requiring diverse temporal contexts, while restoring the original temporal resolution.
- The SGP-Mixer layer, which extends the SGP layer to increase the temporal dimension while integrating information from skip connections, enhancing token discriminability.
- Extensive ablations highlighting the advantages of the encoder-decoder architecture with SGP-based layers to enhance token discriminability.
T-DEED achieves state-of-the-art performance on the FigureSkating and FineDiving datasets, improving mean Average Precision (mAP) by up to +3.07 and +4.83, respectively, compared to previous methods.
Estatísticas
The paper does not provide any specific numerical data or statistics to support the key logics. The focus is on the model architecture and its components.
Citações
There are no striking quotes from the content that support the key logics.