핵심 개념
Training language models specifically for data synthesis, rather than general question-answering, significantly improves the quality and effectiveness of the generated data, especially when carefully managing prompt masking and training data size.
통계
NOMAD achieves >4% gains in TriviaQA and >2% in GSM8K with limited training data.
When using a large training dataset (300K examples), training the synthetic data generation model on a smaller subset (15K examples) outperforms using the full dataset.
Synthetic data generated by NOMAD, when mixed with the original training data, consistently improves the performance of the downstream student model across various tasks.