Alapfogalmak
Through straightforward supervised learning, the decoder-only Transformer architecture used in large language models (LLMs) can be adapted to process visual input effectively, achieving competitive performance compared to encoder-only vision Transformers.
Kivonat
The authors examine whether decoder-only Transformers, such as LLaMA, originally designed for large language models (LLMs), can be adapted to the computer vision field. They first "LLaMAfy" a standard Vision Transformer (ViT) step-by-step to align with LLaMA's architecture, and find that directly applying a causal mask to the self-attention brings an attention collapse issue, resulting in the failure of network training.
To overcome this challenge, the authors propose a post-sequence class token (PS [cls]) technique, which repositions the class token to the end of the image tokens. This enables causal self-attention to efficiently capture the entire image's information. Additionally, they develop a soft mask strategy that gradually introduces a causal mask to the self-attention at the onset of training to facilitate the optimization behavior.
The tailored model, dubbed as image LLaMA (iLLaMA), is akin to LLaMA in architecture and enables direct supervised learning. Its causal self-attention boosts computational efficiency and learns complex representation by elevating attention map ranks. iLLaMA rivals the performance with its encoder-only counterparts, achieving 75.1% ImageNet top-1 accuracy with only 5.7M parameters. Scaling the model to ~310M and pre-training on ImageNet-21K further enhances the accuracy to 86.0%.
The authors conduct extensive experiments to demonstrate iLLaMA's reliable properties, including calibration, shape-texture bias, quantization compatibility, ADE20K segmentation, and CIFAR transfer learning. They hope their study can provide fresh insights for the architectural unification of vision and text models in the era of LLMs.
Statisztikák
73.8% ImageNet top-1 accuracy for ViT-T/16 baseline
74.3% ImageNet top-1 accuracy after replacing MLP with SwiGLU
74.5% ImageNet top-1 accuracy after replacing LN with RMSNorm
71.9% ImageNet top-1 accuracy with post-sequence class token (PS [cls])
72.5% ImageNet top-1 accuracy with modified causal mask
Idézetek
"Through straightforward supervised learning, decoder-only Transformers such as LLaMA, which were originally designed for large language models (LLMs), can be adapted to the computer vision field."
"We suggest to reposition the class token behind the image tokens with a post-sequence class token technique to overcome this challenge, enabling causal self-attention to efficiently capture the entire image's information."
"iLLaMA rivals the performance with its encoder-only counterparts, achieving 75.1% ImageNet top-1 accuracy with only 5.7M parameters. Scaling the model to ~310M and pre-training on ImageNet-21K further enhances the accuracy to 86.0%."