Core Concepts
Adapter modules can achieve comparable performance to fully finetuned models while significantly reducing training time, but their impact on fairness is mixed and depends on the level of bias in the base model.
Abstract
The paper investigates the trade-off between performance, efficiency, and fairness when using adapter modules for text classification tasks. The authors conduct experiments on three datasets: Jigsaw for toxic text detection, HateXplain for hate speech detection, and BIOS for occupation classification.
Regarding performance, the authors confirm that adapter modules achieve accuracy levels roughly on par with fully finetuned models, while reducing training time by around 30%.
In terms of fairness, the impact of adapter modules is more nuanced. On the Jigsaw dataset, adapter modules tend to slightly decrease the equalized odds (EO) metric across most models and adapter types, with the most pronounced disparity observed for GPT-2+LoRA on the race group. On HateXplain, a steady fairness decrease is seen on the religion group, with the largest drop for RoBERTalarge+LoRA and RoBERTalarge+Adapters. However, improvements are also observed, such as for GPT-2+Adapters on race and gender.
On the BIOS dataset, a strong decrease in fairness, measured by the true positive rate (TPR) gender gap, is seen for BERT and RoBERTabase with adapter modules, with RoBERTabase+LoRA exhibiting the highest decrease.
Further analysis reveals that when the fully finetuned base model has low bias, adapter modules do not introduce additional bias. However, when the base model exhibits high bias, the impact of adapter modules becomes more variable, posing the risk of significantly amplifying the existing bias for certain groups.
The authors conclude that a case-by-case evaluation is necessary when using adapter modules, as their impact on fairness can be unpredictable, especially in the presence of high bias in the base model.
Stats
The Jigsaw dataset contains approximately 2 million public comments, while the HateXplain dataset includes around 20,000 tweets and tweet-like samples.
The BIOS dataset comprises around 400,000 biographies labeled with 28 professions and gender information.
The authors use balanced accuracy as the performance metric for the toxic text datasets and accuracy for the occupation classification task.
Fairness is measured using equalized odds (EO) for the toxic text datasets and the true positive rate (TPR) gender gap for the BIOS dataset.
Quotes
"When the fully finetuned model has low bias, using adapter modules results in lower variance and does not add more bias to an unbiased base model. Conversely, when the base model exhibits high bias, the impacts of adapter modules show greater variance."
"Our findings underscore the importance of assessing each situation individually rather than relying on a one-size-fits-all judgment."