The paper introduces a new efficient LLVM family called Traversal of Layers (TroL) with model sizes of 1.8B, 3.8B, and 7B. TroL employs a layer traversing technique that allows the reuse of layers in a token-wise manner, simulating the effect of retracing the answering stream. This approach aims to enhance the learning capabilities of smaller-sized LLVMs without directly scaling up the model size or using additional modules.
The key aspects of TroL are:
Model Architecture: TroL consists of a vision encoder, a vision projector, and a backbone multimodal large language model (MLLM) based on pre-trained LLMs.
Visual Instruction Tuning Dataset: TroL is trained on a diverse dataset of 2.3M visual instruction samples covering various capabilities, such as image understanding, common-sense knowledge, math problems, and their integrated abilities.
Layer Traversing: The layer traversing technique in TroL allows the reuse of layers by mixing the output of the current layer (L(x)) and the output of the same layer applied twice (L(L(x))). This is implemented using a TroL-Mixer module with a gating mechanism to determine the optimal mixing ratio.
Two-step Training: TroL is trained in two steps: (1) training the vision projector and TroL-Mixers, and (2) further training these components along with the backbone multimodal LLMs. This approach facilitates the effective use of the layer traversing technique.
The experiments demonstrate that TroL outperforms open-source LLVMs with larger model sizes (e.g., 26B, 34B, 72B, and 110B) and rivals the performances of closed-source LLVMs with substantially more parameters, despite its smaller model sizes.
toiselle kielelle
lähdeaineistosta
arxiv.org
Tärkeimmät oivallukset
by Byung-Kwan L... klo arxiv.org 09-26-2024
https://arxiv.org/pdf/2406.12246.pdfSyvällisempiä Kysymyksiä