Limitations of Using Large Language Models for Relevance Judgments in Information Retrieval Evaluation
Using large language models (LLMs) to generate relevance judgments for information retrieval evaluation is problematic, as it limits the ability to measure systems that may outperform the LLM-generated judgments.