Core Concepts
Instruction-tuned language models exhibit higher robustness in text-based multiple choice answering compared to first token probability-based evaluation, especially when the mismatch between the two approaches is high.
Abstract
The paper investigates the robustness of text-based multiple choice question (MCQ) answering by instruction-tuned language models, comparing it to the traditional first token probability-based approach.
Key highlights:
Instruction-tuned models like Llama2, Gemma, and Mistral show a mismatch between their first token probabilities and the actual text answers, with the mismatch rate ranging from 10.2% to 56.8%.
The text answers from these models are more robust to various prompt perturbations (e.g., typos, word swapping, option order changes) compared to the first token probabilities.
The robustness discrepancy between the two approaches increases as the mismatch rate grows. When the mismatch exceeds 50%, the text answers show lower selection bias than the state-of-the-art first token debiasing method PriDe.
The authors also observe model-specific behaviors, such as Llama2 models frequently refusing to answer sensitive questions, highlighting the importance of inspecting text outputs beyond just first token probabilities.
The findings suggest that evaluating instruction-tuned language models using text-based MCQ answers provides a more reliable and comprehensive assessment of their capabilities compared to the traditional probability-based approach.
Stats
The mismatch rate between the first token probabilities and text answers ranges from 10.2% for Mistral-7b to 56.8% for Gemma-7b.
Quotes
"The text answer shows small selection bias and high robustness to various sentence perturbations across all the models we examined."
"When the mismatch rate is high (over 50%), the text answer shows a smaller selection bias than the state-of-art first token debiasing method PriDe."