The paper presents a new framework called Language-guided Interpretable Action Recognition (LaIAR) that aims to improve the performance and interpretability of video action recognition models. The key ideas are:
Redefine the problem of understanding video model decisions as a task of aligning video and language models. The language model captures logical reasoning that can guide the training of the video model.
Develop a decoupled cross-modal architecture that allows the language model to guide the video model during training, while only the video model is used for inference.
Introduce a learning scheme with three components:
Extensive experiments on the Charades and CAD-120 datasets show that the proposed LaIAR framework achieves state-of-the-art performance in video action recognition while providing interpretable explanations.
Іншою мовою
із вихідного контенту
arxiv.org
Ключові висновки, отримані з
by Ning Wang,Gu... о arxiv.org 04-03-2024
https://arxiv.org/pdf/2404.01591.pdfГлибші Запити