المفاهيم الأساسية
Transformers modified to generate real-valued vectors outperform VQ-based models in image generation and representation learning.
الملخص
The content introduces Generative Infinite-Vocabulary Transformers (GIVT) that generate real-valued vector sequences, outperforming VQ-based models in image generation. The modifications to transformer decoders enable direct generation of unquantized vectors, leading to better quality and representation learning capabilities. GIVT achieves strong results in various tasks like class-conditional image generation, panoptic segmentation, and depth estimation.
Introduction:
Transformers dominate natural language processing and are gaining popularity in computer vision.
Image classification, detection, and segmentation benefit from transformer encoders.
Quantized Transformer vs. GIVT:
Comparison between standard discrete-token generative transformers and GIVT.
GIVT linearly embeds real-valued vectors at the input and predicts continuous distributions at the output.
Training and Inference:
Training process involves sampling latent vectors from VAE encoder for GIVT training.
Inference includes sequential sampling or MaskGIT-like masking for generating images.
Experiments:
Evaluation on ImageNet datasets for class-conditional image generation.
Results show GIVT outperforms VQGAN and MaskGIT with competitive performance at high resolution.
Panoptic Segmentation and Depth Estimation:
Application of GIVT to UViM framework for dense prediction tasks like panoptic segmentation and depth estimation.
Results:
Sampling FID metrics demonstrate the superior performance of GIVT variants over existing models.
Representation Learning:
Linear probing accuracy on ImageNet shows comparable performance of GIVT-Causal with state-of-the-art models.
الإحصائيات
画像生成において、GIVTはVQベースのモデルを上回る性能を発揮します。
اقتباسات
"We call such transformers Generative Infinite-Vocabulary Transformer (GIVT)."
"Our main contributions can be summarized as follows..."