This work introduces the Multi-Grain Stereotype (MGS) dataset and explores different machine learning approaches to establish baselines for stereotype detection. It fine-tunes several language models to create stereotype classifier models and utilizes explainable AI techniques to analyze the models' decision-making. The study also evaluates the presence of stereotypes in text generation tasks with popular LLMs using the proposed stereotype detectors.
Instruction tuning and reinforcement learning from human feedback can introduce or amplify cognitive biases, such as the decoy effect, certainty effect, and belief bias, in large language models.
IndiBias is a comprehensive benchmark dataset designed to measure and quantify social biases in language models for the Indian context, addressing biases across multiple dimensions including gender, religion, caste, age, region, physical appearance, and occupation.