The paper presents a new framework called Language-guided Interpretable Action Recognition (LaIAR) that aims to improve the performance and interpretability of video action recognition models. The key ideas are:
Redefine the problem of understanding video model decisions as a task of aligning video and language models. The language model captures logical reasoning that can guide the training of the video model.
Develop a decoupled cross-modal architecture that allows the language model to guide the video model during training, while only the video model is used for inference.
Introduce a learning scheme with three components:
Extensive experiments on the Charades and CAD-120 datasets show that the proposed LaIAR framework achieves state-of-the-art performance in video action recognition while providing interpretable explanations.
إلى لغة أخرى
من محتوى المصدر
arxiv.org
الرؤى الأساسية المستخلصة من
by Ning Wang,Gu... في arxiv.org 04-03-2024
https://arxiv.org/pdf/2404.01591.pdfاستفسارات أعمق