Kernekoncepter
Retrieval-based text generation introduces a paradigm shift in language modeling by directly selecting context-aware phrases from supporting documents. The method addresses challenges in constructing training oracles through heuristic initialization and iterative self-reinforcement.
Resumé
The content discusses a novel approach to text generation that involves retrieving context-aware phrases from supporting documents. By addressing challenges in constructing training oracles, the method outperforms standard language models on various tasks and demonstrates improved quality in open-ended text generation.
Standard language models generate text using fixed vocabularies, while the proposed method selects context-aware phrases from supporting documents. Training oracles are initialized using linguistic heuristics and refined through self-reinforcement. Extensive experiments show superior performance over baselines, with increased accuracy and improved generation quality.
The approach transitions from token generation to phrase retrieval, enhancing interpretability and factuality of language models. A balanced design is emphasized, with source and target encoders for prefixes and phrases respectively. Efficient maximum inner product search algorithms are used for phrase retrieval.
Training objectives include InfoNCE loss for phrase retrieval and next-token prediction loss for token-level generation. Negative sampling techniques improve model differentiation ability, incorporating in-batch negatives and hard negatives to enhance discriminative representations.
Experiments demonstrate the effectiveness of the method on knowledge-intensive tasks like question answering and open-ended text generation. Results show significant improvements over standard LMs and state-of-the-art methods across various datasets.
Human evaluation scores indicate better performance of the proposed method in coherence, informativeness, fluency, and grammar compared to baseline models. The model also achieves high MAUVE scores with balanced coherence and diversity metrics.
Ablation studies highlight the importance of self-reinforcement mechanisms in enhancing text generation quality over multiple rounds of training iterations. The proposed approach offers scalability with domain-specific indices for improved performance without additional training.
Statistik
Our model raises accuracy from 23.47% to 36.27% on OpenbookQA.
MAUVE score improves from 42.61% to 81.58% in open-ended text generation.
Model exhibits fastest generation speed among retrieval-augmented baselines.
Proposed model outperforms various baseline models across all datasets.
Enlarged phrase index boosts model's performance across different datasets.
Domain-specific index enhances performance on medical QA datasets.
Citater
"Our model not only outperforms standard language models but also demonstrates improved quality in open-ended text generation."
"Our study can inspire future research to build more efficient and accurate LMs that harness retrieval-based approaches."