The paper presents a new framework called Language-guided Interpretable Action Recognition (LaIAR) that aims to improve the performance and interpretability of video action recognition models. The key ideas are:
Redefine the problem of understanding video model decisions as a task of aligning video and language models. The language model captures logical reasoning that can guide the training of the video model.
Develop a decoupled cross-modal architecture that allows the language model to guide the video model during training, while only the video model is used for inference.
Introduce a learning scheme with three components:
Extensive experiments on the Charades and CAD-120 datasets show that the proposed LaIAR framework achieves state-of-the-art performance in video action recognition while providing interpretable explanations.
다른 언어로
소스 콘텐츠 기반
arxiv.org
더 깊은 질문