Core Concepts
Language models struggle to perform human-like planning tasks, revealing the limitations of current models.
Abstract
1. Abstract:
- Interest in language models' planning abilities is growing.
- Current studies lack linguistic complexity and domain diversity.
- PARADISE dataset introduces abductive reasoning tasks.
- Small models outperform large language models.
- Models fall short of human performance.
2. Introduction:
- Breakthroughs in large language models as planners.
- Majority of studies use toy simulation environments.
- Planning tasks are mostly generation problems.
- Evaluating open-domain planning abilities remains a challenge.
3. Task Formulation:
- Warning and tip inference tasks as multiple-choice questions.
- Goals are questions, warnings and tips are choices.
- Example tasks provided for both warning and tip inference.
4. Candidate Sampling:
- Acquiring goals and positive candidates is straightforward.
- Negative candidate sampling strategy enhanced with noun embeddings.
- Negative candidates randomly reassigned to avoid bias.
5. Test Set Construction:
- Expert annotation process to validate test splits.
- Annotation process ensures examples are relevant and appropriate.
- Dataset statistics provided in a table.
6. Experimental Setup:
- Two setups for evaluating language models: finetuning and zero-shot.
- Finetuning setup for BERT family models.
- Zero-shot setup for large language models like GPT-4.
7. Experiments and Results:
- Fine-tuned models perform better than zero-shot models.
- DeBERTa performs best among fine-tuned models.
- Models fall short of human performance.
- Further insights on model behaviors provided through research questions.
8. Related Work:
- Common sense reasoning in various subdomains.
- Existing abductive reasoning tasks focus on different domains.
- WikiHow corpus extensively used for a range of tasks.
Stats
최근에는 커뮤니티에서 언어 모델의 계획 능력에 대한 관심이 증가했습니다.
작은 모델이 대형 언어 모델보다 대부분의 시나리오에서 더 나은 성능을 보입니다.
모델은 인간의 성능에 미치지 못합니다.
Quotes
"Despite advancements, all models fall short of human performance."
"Small models outperform large language models in most scenarios."
"Models struggle with tangible, physical, and craft-related goals."