LoSA, a memory-and-parameter-efficient backbone adapter, enables end-to-end training of large video foundation models for improved temporal action localization in untrimmed videos.
A multi-level cross-scale solution called video self-stitching graph network (VSGN) is proposed to tackle the challenge of large action scale variation, especially for short actions, in temporal action localization.