The paper explores the use of diffusion models for text generation, which introduces a novel noising paradigm and training objective compared to traditional language models. Recent works have adapted diffusion models to the text domain by converting discrete tokens to embeddings and then utilizing continuous diffusion processes.
The authors identify two key challenges in optimizing embedding diffusion models:
Embedding Space Collapse: The embedding space is learnable for textual data, unlike the stationary data distributions in image and audio domains. This can lead to the collapse of the embedding space and unstable training. To address this, the authors propose an "anchor loss" that effectively regularizes the embeddings and stabilizes the training process.
Denoising Model Degeneration: The authors find that the noise levels introduced by conventional schedules are insufficient for training a desirable denoising model, leading to model degeneration. To mitigate this, they propose a "noise rescaling" framework that adaptively adjusts the noise schedule to prevent degeneration.
Based on these solutions, the authors introduce Difformer, a denoising diffusion Transformer model for text generation. Experiments on various text generation tasks, including machine translation, summarization, and paraphrasing, demonstrate the effectiveness of the proposed techniques and the superiority of Difformer over previous state-of-the-art embedding diffusion models.
The paper also analyzes the inference speed and diversity of Difformer, showing its potential for efficient and high-quality text generation.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問