Core Concepts
Instruction-following capability is an important aspect of large language models, but existing benchmarks primarily focus on common instructions that align well with model priors. This paper proposes a novel evaluation protocol called verbalizer manipulation to systematically assess models' ability to follow instructions that may not align with their prior knowledge.
Abstract
The paper proposes a novel instruction-following evaluation protocol called verbalizer manipulation. It involves constructing instructions that align with model priors to different extents - from natural (e.g., outputting "positive" for positive sentiment), to neutral (e.g., outputting "foo" for positive sentiment), to unnatural (e.g., outputting "negative" for positive sentiment).
The authors evaluate four major model families (Flan-T5, GPT-Series, Vicuna, OPT-IML) across nine datasets and twelve sets of verbalizers. They find that:
Larger models generally perform better on both natural and neutral instructions, indicating that scaling is an effective way to improve instruction-following ability.
However, model performance diverges significantly on unnatural instructions, with no clear and consistent trend across model families. Even the strongest GPT-4 model struggles to perform better than random guessing on the most challenging verbalizer.
Examining verbalizers one by one, the authors find that models are not sensitive to verbalizers in natural instructions, but their performance diverges significantly in unnatural instructions, depending on the model family and verbalizer.
Adding zero-shot chain-of-thought prompting can improve model performance on unnatural instructions, but large performance gaps still exist compared to natural instructions, especially for weaker model families.
The results highlight the need for continued advancements to improve the instruction-following abilities of large language models, as they still have fundamental limitations in following instructions that contradict their prior knowledge.
Stats
"Larger models generally perform better on both natural and neutral instructions, indicating that scaling is an effective way to improve instruction-following ability."
"Model performance diverges significantly on unnatural instructions, with no clear and consistent trend across model families."
"Even the strongest GPT-4 model struggles to perform better than random guessing on the most challenging verbalizer."
Quotes
"Even the strongest GPT-4 model struggles to perform better than random guessing on the most challenging verbalizer, emphasizing the need for continued advancements to improve their instruction-following abilities."
"When model scales to larger sizes, they still have difficulty in following instructions contradicting to prior knowledge even though they are allowed to output intermediate reasoning steps."