Time Interval Machine: A Transformer-based Approach for Recognizing Audio-Visual Actions in Long Videos
The Time Interval Machine (TIM) is a transformer-based model that can effectively recognize actions in long videos by explicitly modeling the temporal extents of audio and visual events and attending to the surrounding context in both modalities.