The paper proposes a Prompt-Conditioned Quality Assessment (PCQA) method for evaluating the quality of AI-generated images and videos. The key aspects of the approach are:
Hybrid Text Encoder: The method uses a frozen hybrid CLIP text encoder to encode the prompt information, which is then used as a condition for the visual quality assessment.
Feature Adapter and Mixer: Trainable feature adapters are used to align the visual and textual features, and a feature mixer module blends these features to capture the correlation between the generated content and the prompts.
Ensemble Method: An ensemble of multiple vision backbones (ConvNeXt-Small, EfficientViT-L, and EVA-02 Transformer-B) is used to mitigate bias in the quality assessment and improve robustness.
The proposed framework is evaluated on two novel datasets for AI-generated image (AIGIQA-20K) and video (T2VQA-DB) quality assessment. The results demonstrate that the PCQA method significantly outperforms baseline approaches, establishing a strong benchmark for the task.
Na inny język
z treści źródłowej
arxiv.org
Głębsze pytania