Instruction-Tuned Language Models Outperform First Token Probabilities in Robust Multiple Choice Answering
Instruction-tuned language models exhibit higher robustness in text-based multiple choice answering compared to first token probability-based evaluation, especially when the mismatch between the two approaches is high.