Core Concepts
Representation-space interventions can be leveraged to generate natural language counterfactuals that reflect minimal changes to a given text based on a specified binary property of interest.
Abstract
The paper presents a method for converting representation-space counterfactuals into natural language counterfactuals. The key steps are:
Intervene in the representation space of a language model to modify the encoding of a target concept (e.g., gender) using techniques like LEACE, MiMiC, and MiMiC+α.
Apply an inversion model to map the intervened representation back to the input space, generating a minimally different version of the original text.
The authors conduct experiments on a dataset of short biographies, analyzing the linguistic changes induced by the different interventions. They find that the counterfactuals capture subtle biases in word usage beyond just pronoun changes.
The authors further demonstrate that the generated counterfactuals can be used for data augmentation to improve fairness in a multi-class classification task. Classifiers trained on the augmented dataset exhibit lower true positive rate gaps between genders compared to baselines.
The paper highlights the potential of representation-space interventions to enable interpretable and controllable text generation, with applications in bias mitigation and causal analysis of language models.
Stats
Pronouns like "he", "his", and "him" become much more frequent in the f→m counterfactuals, and vice versa for "she", "her", and "hers".
In the m→f direction, the frequency of words like "medical", "university", "featured", "member", and "finalist" increases, while "affiliated", "dr", "surgery", and "received" decrease.
In the f→m direction, the frequency of words like "of", "the", "a", "at", and "for" increases.
Quotes
"Interventions performed in the representation space of LMs have proven effective at exerting control over the generation of the model."
"Converting representation counterfactuals into input counterfactuals serves various practical purposes. Firstly, it aids in interpreting and visualizing the effects of commonly employed intervention techniques, which are typically applied in a high-dimensional and non-interpretable representation space."
"The counterfactuals we generate have intrinsic value, serving as goals in their own right. They prove beneficial for data augmentation, and we showcase their potential to address fairness concerns in a 'real-world' multi-class classification."