The paper presents SAFER-INSTRUCT, a framework for automatically generating large-scale preference data for training language models through reinforcement learning from human feedback (RLHF). The key components of the framework are:
Reversed Instruction Tuning: The authors fine-tune a language model (LLaMA) to perform instruction induction, generating instructions based on responses, rather than the typical supervised learning approach of generating responses from instructions.
Instruction Induction: The reversed instruction tuning model is used to efficiently generate instructions related to specific topics, such as hate speech, without relying on manually crafted prompts. This adds flexibility and diversity to the instruction data.
Low-quality Instruction Filtering: The generated instructions are filtered using an expert model (GPT-4) to ensure they can potentially elicit unsafe behaviors from language models.
Response Generation: For each instruction, the authors generate a preferred response using an expert model (GPT-4) and a dispreferred response from the original harmful content datasets.
The authors apply this SAFER-INSTRUCT framework to construct a safety preference dataset, which they use to fine-tune an Alpaca model. The resulting model significantly outperforms other Alpaca-based models in terms of harmlessness, while maintaining competitive performance on downstream tasks.
Ke Bahasa Lain
dari konten sumber
arxiv.org
Wawasan Utama Disaring Dari
by Taiwei Shi,K... pada arxiv.org 04-02-2024
https://arxiv.org/pdf/2311.08685.pdfPertanyaan yang Lebih Dalam