Core Concepts
Careful design of training samples can significantly improve the downstream performance of large language models, beyond the impact of prompt engineering.
Abstract
This paper introduces Sample Design Engineering (SDE) as a methodical approach to enhancing the downstream fine-tuning performance of large language models (LLMs). Through a series of in-domain and out-of-domain experiments on multi-aspect sentiment analysis tasks, the authors evaluate the impact of various SDE options, including input design (instruction placement, input modeling), output design (multiple predictions formatting, handling of unmentioned targets, textual vs. numerical labels), and reasoning design (Chain-of-Thought).
The experiments reveal several intriguing patterns that hold consistently across different LLMs. Based on these insights, the authors propose an integrated SDE strategy (ES-SDE) that combines the most effective options. Extensive evaluations on three complex downstream tasks (Nested-NER, Event Detection, and Multi-Aspect Sentiment Analysis) demonstrate that ES-SDE notably outperforms weaker SDE combinations and heuristic designs. ES-SDE also exhibits robust performance against variations in training size, decoding randomness, and instruction content.
Additionally, the authors explore the relationship between effective prompt engineering (PE) and SDE, finding that well-crafted PE strategies do not necessarily translate to successful SDE strategies. This observation encourages further research into the mechanisms underlying SDE, which could lead to enhanced downstream applications of LLMs.
Stats
Placing the instruction before the task text (Inst-first) outperforms placing it after (Inst-last) or omitting it (No-inst).
Modeling the input during fine-tuning (MI) leads to worse performance compared to excluding it (No-MI).
The Lines format for multiple predictions outperforms the more natural Natural format and the more structured JSON format across various LLMs.
Providing placeholders for unmentioned targets (PU) is better than omitting them (OU).
Textual labels (TxtLabel) are more effective than numerical labels (NumLabel).
Chain-of-Thought (CoT) reasoning design brings notable improvements in out-of-domain tasks, but has a more subtle impact on in-domain tasks.
Quotes
"Careful design of training samples can significantly improve the downstream performance of large language models, beyond the impact of prompt engineering."
"A well-crafted prompt engineering strategy may not necessarily translate to a successful sample design engineering strategy."