The paper introduces XC-CACHE, a novel approach for efficient conditional generation using large language models (LLMs). The key ideas are:
Cacheability: The authors provide evidence that encoder-decoder architectures are better suited for conditional generation than decoder-only models, as they enable more efficient caching of context representations.
Parameter Efficiency: The authors show that training a small number of cross-attention layers is sufficient to convert a pre-trained decoder-only model into an encoder-decoder architecture capable of context-conditional generation.
Decoder-as-Encoder: The authors propose two variants of XC-CACHE - one that uses the pre-trained decoder as the encoder, and another that trains a small bi-directional encoder. Both approaches significantly reduce the memory footprint required for caching context compared to standard in-context learning (ICL) methods.
The authors evaluate the proposed XC-CACHE models on question-answering tasks, where they outperform ICL baselines and achieve comparable performance to fine-tuned prompted models, while reducing the cache memory footprint by over 98%. The authors also discuss the limitations of their approach, including potential issues with generalization to out-of-distribution data.
To Another Language
from source content
arxiv.org
Deeper Inquiries