The standard random baseline for in-context learning classification tasks is often insufficient, as it does not account for the common practice of validation set reuse and small dataset sizes. A stronger random baseline that considers the expected maximum accuracy across multiple random classifiers provides a more appropriate comparison.


coremsg

stronger-random-baselines-for-evaluating-in-context-learning-performance


Stronger Random Baselines for Evaluating In-Context Learning Performance