The paper presents a new framework called Language-guided Interpretable Action Recognition (LaIAR) that aims to improve the performance and interpretability of video action recognition models. The key ideas are:
Redefine the problem of understanding video model decisions as a task of aligning video and language models. The language model captures logical reasoning that can guide the training of the video model.
Develop a decoupled cross-modal architecture that allows the language model to guide the video model during training, while only the video model is used for inference.
Introduce a learning scheme with three components:
Extensive experiments on the Charades and CAD-120 datasets show that the proposed LaIAR framework achieves state-of-the-art performance in video action recognition while providing interpretable explanations.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Ning Wang,Gu... lúc arxiv.org 04-03-2024
https://arxiv.org/pdf/2404.01591.pdfYêu cầu sâu hơn