By localizing and editing specific weights in language models, it is possible to control and mitigate encoded gender stereotypes while preserving the model's overall performance.
AXOLOTL introduces a novel post-processing framework for debiasing Large Language Model outputs, ensuring fairness and performance preservation.
DAFAIR proposes a novel approach to mitigate social bias in language models without relying on demographic information, achieving competitive performance while reducing bias.