The paper introduces a novel approach, VQCT, to transfer a well-trained codebook from language models to enhance Vector-Quantized Image Modeling (VQIM). By utilizing part-of-speech knowledge and semantic relationships from pretrained language models, the proposed framework aims to alleviate codebook collapse issues. Experimental results demonstrate superior performance over existing methods on four datasets. The method involves constructing vision-related codebooks, designing a codebook transfer network, and achieving cooperative optimization between codes.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Baoquan Zhan... lúc arxiv.org 03-18-2024
https://arxiv.org/pdf/2403.10071.pdfYêu cầu sâu hơn