Core Concepts
Large language models (LLMs) struggle to follow sequences of instructions, even when those instructions are logically connected, highlighting a critical area for improvement in LLM robustness.
Stats
The SIFo benchmark contains 800 samples in total, with 200 samples for each of the four tasks.
The average number of instructions per sample varies across tasks, ranging from 4.0 to 4.6.
GPT-4 achieved a sample-level accuracy of 42.50% on the Text Modification task, significantly outperforming open-source models.
Llama3-70B-Instruct, the best-performing open-source model, achieved a sample-level accuracy of 39.00% on the Question Answering task.
Closed-source models (GPT-4 and Claude-3) consistently demonstrated higher accuracy and instruction-following depth across all tasks.
Quotes
"Models exhibit different abilities to follow instructions in later sequence steps; even the most powerful models perform significantly poorer in later steps."
"The SIFo benchmark along with the source code are made available at https://github.com/shin-ee-chen/SIFo."