The paper introduces a novel task setting called Unified Language-driven Zero-shot Domain Adaptation (ULDA), which aims to enable a single model to adapt to diverse target domains without explicit domain-ID knowledge. This is in contrast to previous approaches like PØDA, which require domain-specific models and domain IDs.
To address the challenges posed by ULDA, the authors propose a new framework with three key components:
Hierarchical Context Alignment (HCA): This aligns simulated features with target text at multiple visual levels (scene, region, pixel) to mitigate semantic loss from vanilla scene-text alignment.
Domain Consistent Representation Learning (DCRL): This retains the semantic correlations between different regional representations and their corresponding text embeddings across diverse domains, ensuring structural consistency.
Text-Driven Rectifier (TDR): This rectifies the simulated features during fine-tuning, mitigating the bias between the simulated and real target visual features.
The authors validate the effectiveness of their proposed method through extensive empirical evaluations on both the previous classic setting and the new ULDA setting. The results demonstrate that their approach achieves competitive performance in both settings, highlighting its superiority and generalization ability. Importantly, the proposed method does not introduce any additional computational costs during inference, ensuring its practicality.
Na inny język
z treści źródłowej
arxiv.org
Głębsze pytania