Evaluating the Reliability of Automatic Methods for Assessing Instruction-Tuned Large Language Models
Automatic evaluation methods based on text overlap and language model judgments can approximate human ratings under specific conditions, but their reliability is highly context-dependent.