核心概念
IMPOSSIBLE DISTILLATION distills high-quality paraphrase datasets and models from low-quality LMs using paraphrastic proximity and critic-guided filtering.
摘要
IMPOSSIBLE DISTILLATION introduces a novel framework for paraphrasing and sentence summarization.
The framework leverages paraphrastic proximity intrinsic to pre-trained LMs like GPT2 to distill high-quality datasets and models.
By identifying generations from proximal subspaces in LM distributions, the method outperforms strong baselines in multiple benchmarks.
The distilled dataset exhibits higher diversity and fidelity compared to larger datasets like ParaBank or ChatGPT-Para.
The process involves pair generation, filtering with critics, student model distillation, self-distillation, controllability enhancement, domain-specific testing, and generalization to sentence summarization.
統計資料
IMPOSSIBLE DISTILLATION produces a high-quality dataset even from GPT2-scale LMs.
Our model with 770M parameters consistently outperforms strong baselines in multiple benchmarks.
The distilled dataset from 1.5B LMs shows better metrics than state-of-the-art datasets like ParaBank or ChatGPT-Para.