Core Concepts
The author argues that integrating textual information into the framework of Generalized Category Discovery (GCD) can significantly improve accuracy and performance in visual classification tasks.
Abstract
In the paper "Textual Knowledge Matters: Cross-Modality Co-Teaching for Generalized Visual Class Discovery," the authors propose a two-phase TextGCD framework to enhance Generalized Category Discovery by incorporating powerful Visual-Language Models. The approach involves a retrieval-based text generation phase and a cross-modality co-teaching phase. By leveraging disparities between textual and visual modalities, mutual learning is fostered, leading to improved visual GCD. Experiments on eight datasets demonstrate the superiority of their approach over existing methods, with significant performance gains on ImageNet-1k and CUB datasets.
The study addresses limitations in current GCD methods that rely solely on visual cues by introducing additional textual information through a customized RTG based on large-scale VLMs. The proposed co-teaching strategy between textual and visual modalities, along with inter-modal information fusion, enhances category discovery. Comprehensive experiments showcase the effectiveness of the method across various datasets, outperforming state-of-the-art approaches significantly.
Key components of the TextGCD framework include constructing a visual lexicon from diverse datasets, generating descriptive texts using large language models, and implementing cross-modality co-teaching to leverage both visual and textual cues for enhanced category discovery. The study highlights the importance of aligning class perceptions between text and image models to facilitate comprehensive learning from both modalities.
Stats
Experiments on eight datasets show an increase of 7.7% and 10.8% in All accuracy on ImageNet-1k and CUB, respectively.
Current GCD methods rely on only visual cues.
The proposed TextGCD framework integrates powerful Visual-Language Models for multi-modality GCD.
The study introduces a two-phase approach involving retrieval-based text generation and cross-modality co-teaching.
Experiments demonstrate the large superiority of the TextGCD approach over state-of-the-art methods.
Quotes
"The proposed TextGCD framework comprises two main phases: Retrieval-based Text Generation (RTG) and Cross-modality Co-Teaching (CCT)."
"Our contributions are summarized as follows: identifying limitations in existing GCD methods relying solely on visual cues; proposing a co-teaching strategy between textual and visual modalities; conducting comprehensive experiments showcasing method effectiveness."