CogBench introduces a novel approach to evaluate large language models (LLMs) by focusing on behavioral metrics derived from cognitive psychology experiments. The study highlights the importance of model size and reinforcement learning from human feedback in improving LLMs' performance and alignment with human behavior. Open-source models are found to be less risk-prone than proprietary models, and fine-tuning on code does not necessarily enhance LLMs' behavior. Prompt-engineering techniques like chain-of-thought and take-a-step-back prompting have been shown to influence probabilistic reasoning and model-based behaviors. The benchmark includes tasks such as probabilistic reasoning, horizon task, restless bandit task, instrumental learning, two-step task, temporal discounting, and Balloon Analog Risk Task (BART). Results show that larger models generally perform better and are more model-based than smaller models. The study also explores how specific features of LLMs impact their performance and behaviors through hypothesis-driven experiments.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Julian Coda-... at arxiv.org 02-29-2024
https://arxiv.org/pdf/2402.18225.pdfDeeper Inquiries