核心概念
The core message of this article is to propose a new video visual relation detection task focused on understanding complex human-human interactions in multi-person sports videos, and to introduce the SportsHHI dataset to benchmark this task.
摘要
The article proposes a new video visual relation detection task called "video human-human interaction detection", which aims to detect and recognize high-level interactions between humans in complex multi-person sports videos. The authors develop a new dataset named SportsHHI to support this task.
Key highlights:
- Current video visual relation detection datasets have limitations in exploring complex human-human interactions in multi-person scenarios, and the relation types defined have relatively low-level semantics.
- SportsHHI is built on basketball and volleyball sports videos, containing 34 high-level interaction classes such as technical actions, tactical cooperation, and confrontation.
- SportsHHI provides 118,075 human bounding boxes and 50,649 interaction instances annotated on 11,398 keyframes, which is comparable in scale to existing video scene graph generation datasets.
- The authors propose a two-stage baseline method for the human-human interaction detection task and conduct extensive experiments to reveal key factors for a successful interaction detector, such as motion features, context information, relative position encoding, and information exchange among proposals.
- The authors hope SportsHHI can stimulate research on human interaction understanding in videos and promote the development of spatio-temporal context modeling techniques in video visual relation detection.
统计
SportsHHI contains 34 high-level interaction classes from basketball and volleyball sports.
118,075 human bounding boxes and 50,649 interaction instances are annotated on 11,398 keyframes.
The dataset is split into 38,527 training instances from 8,719 keyframes and 12,122 validation instances from 2,679 keyframes.
引用
"SportsHHI contains 34 high-level interaction classes from basketball and volleyball sports."
"118,075 human bounding boxes and 50,649 interaction instances are annotated on 11,398 keyframes."