Alapfogalmak
Binary token representations can significantly improve the inference speed and reduce the storage footprint of retrieval-augmented language models while maintaining high task performance.
Kivonat
The paper introduces Binary Token Representations (BTR), a technique to improve the efficiency of retrieval-augmented language models. Retrieval-augmented language models use a retrieve-and-read pipeline, where a retriever finds relevant passages and a reader model generates the output. The reader model is the computational bottleneck, as it needs to process a large number of retrieved passages.
BTR addresses this by precomputing binary token representations for the retrieved passages. The key ideas are:
- Calibrated binarization: BTR binarizes the token representations in the reader encoder layers, using a calibration technique to preserve the representation quality.
- Offline compression: BTR further compresses the binary token representations by merging similar representations, reducing the storage footprint.
- Training objectives: BTR introduces two training objectives - passage representation recovery and query-aware passage token distillation - to mitigate the accuracy loss from binarization and decomposition.
- Runtime compression: BTR applies additional compression techniques during inference to further improve the speed of the reader encoder and decoder.
The experiments show that BTR can accelerate state-of-the-art retrieval-augmented language models by 2-4x, reduce storage by over 100x, while maintaining over 95% of the original models' performance on five knowledge-intensive NLP tasks.
Statisztikák
The paper reports the following key metrics:
BTR-Atlas base model achieves 49.5% EM on NaturalQuestions, 66.7% EM on TriviaQA, 43.8% EM on WebQuestions, 70.2% accuracy on FEVER, and 35.4% accuracy on MMLU.
BTR-Atlas base model achieves 3.1x speedup over the Atlas base model on inference throughput.
BTR-Atlas large model achieves 56.1% EM on NaturalQuestions, 70.8% EM on TriviaQA, 49.1% EM on WebQuestions, 75.9% accuracy on FEVER, and 39.2% accuracy on MMLU.
BTR-Atlas large model achieves 4.0x speedup over the Atlas large model on inference throughput.
The storage footprint of BTR-Atlas base is 127GB, compared to 12,804GB for the baseline DeFormer model.
Idézetek
"BTR reduces the storage footprint and improves the runtime speed since the representations are 1-bit vectors, and the reader uses the cached representations."
"BTR tackles the challenge by building compact binary token representations for the passages."
"BTR presents better efficiency versus accuracy trade-offs by maintaining high accuracy and inference throughput with a smaller storage footprint."