Language imbalance during training can boost cross-lingual generalization in multilingual language models, leading to better performance on less frequent languages.
A mathematical theory is developed to explain the emergence of learned skills in large language models when the number of system parameters and the size of training data surpass certain thresholds.
Modernizing n-gram language models by scaling the training data to 5 trillion tokens and extending the n-gram to be unbounded, enabling novel analyses of human-written and machine-generated text, and improving the performance of large neural language models.
Incorporating the count of digits before each number, a technique called NumeroLogic, enhances the numerical capabilities of large language models by enabling them to better understand the place value of digits and reason about the magnitude of numbers before generating them.
This paper demonstrates how Large Language Models (LLMs) can be effectively used to perform reference resolution, a crucial task for conversational agents, by converting it into a language modeling problem. The authors propose a novel approach to encode on-screen entities as text, enabling the LLM to handle both conversational and on-screen references.
Jamba is a novel hybrid language model architecture that combines Transformer and Mamba (state-space) layers, along with a mixture-of-experts (MoE) component, to achieve improved performance and efficiency compared to pure Transformer models.
LongLoRA presents an efficient fine-tuning approach to extend the context of large language models, reducing computational costs while maintaining performance.
High-level semantic concepts are encoded linearly in large language models due to the next token prediction objective and the implicit bias of gradient descent.
The release of the Invalsi dataset provides a challenging benchmark for evaluating language models in Italian, paving the way for future improvements in mathematical and language understanding.
Quantitatively predict model performance with data mixing laws.