核心概念
The proposed AMFF-Net comprehensively evaluates the quality of AI-generated images (AGIs) from three dimensions: visual quality, authenticity, and content consistency. It utilizes a multi-scale input strategy and an adaptive feature fusion block to capture both local and global image details, and compares the semantic features between the text prompt and generated image to assess content consistency.
要約
The paper proposes a novel blind image quality assessment (IQA) network, named AMFF-Net, for evaluating the quality of AI-generated images (AGIs). AMFF-Net assesses the quality of AGIs from three dimensions: visual quality, authenticity, and content consistency.
Key highlights:
- Multi-scale input strategy: AMFF-Net scales the AGI up and down and feeds the scaled images and the original-sized image into the image encoder to capture image details at different levels of granularity.
- Adaptive feature fusion (AFF) block: The AFF block adaptively fuses the multi-scale features, reducing the risk of information masking caused by direct concatenation or addition.
- Content consistency evaluation: AMFF-Net compares the semantic features from the text encoder and image encoder to evaluate the alignment between the text prompt and generated image.
- Extensive experiments on three AGI quality assessment databases show that AMFF-Net outperforms nine state-of-the-art blind IQA methods. Ablation studies further demonstrate the effectiveness of the multi-scale input strategy and AFF block.
統計
The visual quality score is predicted using the mean square error loss.
The authenticity score is predicted using the mean square error loss.
The content consistency score is computed as the cosine similarity between the text and image features.
引用
"Considering that both local and global details affect the subjective ratings of visual quality and authenticity, AMFF-Net inputs the scaled AGIs, i.e., I1.5×, I1.0×, and I0.5×, into an image encoder of the pre-trained CLIP model [33] to obtain multi-scale semantic representations."
"An AFF block is proposed to fuse multi-scale features. Different from current works that directly concatenate or add multi-scale features, the proposed block adaptively calculates the weights for different features, reducing the risk of information masking caused by concatenation and addition."
"For content consistency prediction, AMFF-Net uses the text encoder in the pre-trained CLIP to encode the text prompt and computes the similarity between the obtained textual features and the fused multi-scale features."