The paper explores the problem of open-vocabulary action recognition (OVAR) where the model needs to classify videos into a broad range of action categories, including novel unseen classes during inference. Existing OVAR methods rely on vision-language alignment, treating class names as textual descriptions and measuring similarity between visual and textual embeddings. However, these methods assume the class descriptions provided by users are always clean, which is unrealistic in practice as misspellings and typos are common.
To address this issue, the paper first evaluates the robustness of existing OVAR methods under simulated noise in class descriptions, revealing their poor performance. It then proposes the DENOISER framework, which consists of two main components:
Generative Step: This step treats denoising of class texts as a generative task. It proposes multiple candidate texts based on spelling similarity, and then uses both inter-modal (visual-textual) and intra-modal (textual-only) information to determine the best denoised text.
Discriminative Step: This step uses the existing OVAR models to assign class labels to visual samples. The assigned visual samples are then used to help denoise the corresponding class texts in the generative step.
The generative and discriminative steps are optimized in an alternating manner, where the denoised text classes help improve the OVAR model, and the classified visual samples in turn help better denoise the text classes.
Extensive experiments on multiple datasets and OVAR models show that the proposed DENOISER framework significantly improves the robustness against noisy class descriptions, outperforming various baselines. Detailed ablation studies further validate the effectiveness of the key components.
Til et andet sprog
fra kildeindhold
arxiv.org
Vigtigste indsigter udtrukket fra
by Haozhe Cheng... kl. arxiv.org 04-24-2024
https://arxiv.org/pdf/2404.14890.pdfDybere Forespørgsler