The study demonstrates that language models can effectively transfer knowledge across diverse languages, with the transfer being largely independent of language proximity. This suggests the presence of language-agnostic representations that enable cross-lingual generalization.
HGRN2 introduces a simple outer-product-based state expansion mechanism to significantly increase the recurrent state size of HGRN without introducing additional parameters, leading to improved performance in language modeling, image classification, and long-range tasks.
This work introduces an efficient attention mechanism called Infini-attention that enables Transformer-based Large Language Models (LLMs) to effectively process infinitely long inputs with bounded memory and computation.
MiniCPM, a series of small language models with 1.2B and 2.4B non-embedding parameters, demonstrate capabilities on par with 7B-13B large language models through meticulous model wind tunnel experiments, a novel Warmup-Stable-Decay learning rate scheduler, and a two-stage pre-training strategy.
Introducing StreamingLLM, an efficient framework that enables large language models trained with a finite attention window to work on text of infinite length without fine-tuning by leveraging attention sinks.
This paper proposes the use of "multicalibration" to yield interpretable and reliable confidence scores for outputs generated by large language models (LLMs), which can help detect hallucinations.
Employing dense training and sparse inference to enhance the parameter efficiency of Mixture-of-Experts (MoE) language models while maintaining comparable performance to dense models.
Auto-regressive large language models (LLMs) trained on sentences of the form "A is B" fail to generalize to the reverse direction "B is A".
A novel model-aware approach that leverages language model token embeddings to efficiently determine when retrieval augmentation is necessary, without requiring access to sensitive pre-training data.
Language models sometimes use a simple vector arithmetic mechanism to solve relational tasks by leveraging regularities encoded in their hidden representations.