LoSA, a memory-and-parameter-efficient backbone adapter, enables end-to-end training of large video foundation models for improved temporal action localization in untrimmed videos.
A multi-level cross-scale solution called video self-stitching graph network (VSGN) is proposed to tackle the challenge of large action scale variation, especially for short actions, in temporal action localization.
This research paper presents a novel approach to temporal action localization in videos, combining multimodal and unimodal transformers to achieve state-of-the-art results on the Perception Test Challenge 2024 dataset.
본 논문에서는 관성 센서 데이터를 활용한 인간 활동 인식 분야에서 시간적 행동 지역화(TAL) 모델의 적용 가능성을 체계적으로 보여주고, 기존의 고정된 시간 윈도우 기반 분류 방식보다 우수한 성능을 입증합니다.