Core Concepts
Neologisms significantly impact LLM performance, necessitating a new benchmark for evaluation.
Abstract
1. Abstract:
- Neologisms cause temporal drift in LLMs due to data misalignment.
- NEO-BENCH evaluates LLMs' ability to handle neologisms.
2. Introduction:
- Humans adapt easily to language changes, but LLMs struggle.
- Prior work on temporal language change lacks analysis of neologism robustness.
3. Data Collection Methods:
- Three methods used to collect 2,505 neologisms from various sources.
- Semantic neologisms are infrequent but crucial for understanding language evolution.
4. Benchmark Tasks:
- NEO-BENCH includes Machine Translation, Cloze QA, Definition Generation tasks.
- Older models perform worse on neologisms than newer models.
5. Key Findings:
- Automatic metrics cannot accurately evaluate MT models handling neologisms.
- GPT-4's knowledge of neologisms is task-specific.
- Models perform worse on neologisms compared to pre-existing words.
6. Related Work:
- Previous studies focus on temporal drift and named entities in LLMs.
- Various methods used for neologism collection lack semantic diversity.
7. Conclusion:
- NEO-BENCH provides insights into the impact of neologisms on LLMs.
Stats
Neural networks struggle with a single neologism in machine translation sentences, decreasing quality by 43% in human evaluation (§2).
Adding a neologism decreases model performance by 44% in machine translation (§7).