Large language models exhibit significant performance gaps across different languages, with high-resource languages like English significantly outperforming low-resource languages.
LLM-as-a-Judge and reward models are widely used for evaluating large language models, but their effectiveness in non-English contexts remains largely unexplored. This study provides a comprehensive analysis of their performance on a new Korean meta-evaluation dataset, KUDGE, uncovering key insights on the transferability and limitations of these automated evaluators.