Variance-reduced zeroth-order optimization methods can effectively fine-tune large language models while significantly reducing memory requirements compared to first-order methods.
Despite the continuous embedding space being more expressive than the discrete token space, soft prompting and prefix-tuning are potentially less expressive than full fine-tuning, even with the same number of learnable parameters. Prefix-tuning cannot change the relative attention pattern over the content and can only bias the outputs of an attention layer in a fixed direction.
Language-independent representations improve the performance of zero-shot summarization by decoupling task-specific knowledge from language-specific abilities.
The number of languages used in instruction fine-tuning of large language models can significantly impact their performance on multilingual tasks, but there is no consistent optimal number across different benchmarks and languages.
Incorporating additional Kullback-Leibler (KL) regularization and using a mixture of previous iterates as the opponent can mitigate performance instability issues in the self-play fine-tuning (SPIN) approach for aligning language models with human preferences.
Introducing a position-aware parameter efficient fine-tuning approach to mitigate the inherent positional bias in pre-trained large language models.
LLaMA-Excitor is a lightweight method that stimulates the potential of large language models like LLaMA to better follow instructions by gradually paying more attention to worthwhile information, without directly changing the intermediate hidden state during self-attention calculation.
Reinforcement Learning from Human Feedback (RLHF) is an effective approach to aligning large language models (LLMs) with human preferences, but the reward model can suffer from inaccuracy due to distribution shift. This paper proposes Reward Learning on Policy (RLP), an unsupervised framework that refines the reward model using policy samples to keep it on-distribution, improving the overall RLHF performance.