Stronger Random Baselines for Evaluating In-Context Learning Performance
The standard random baseline for in-context learning classification tasks is often insufficient, as it does not account for the common practice of validation set reuse and small dataset sizes. A stronger random baseline that considers the expected maximum accuracy across multiple random classifiers provides a more appropriate comparison.