insight - AI Fairness - # Data Repair with Optimal Transport

Optimal Transport for Fairness: Archival Data Repair using Small Research Data Sets

Q: How can the method be adapted for datasets where sensitive attributes are not available

In datasets where sensitive attributes are not available, the method can be adapted by employing techniques for estimating or inferring these attributes. One approach is to use proxy variables that are correlated with the sensitive attribute but do not directly reveal it. For example, in a hiring dataset where race is not recorded, zip code or educational background could serve as proxies for racial demographics. Another option is to utilize machine learning models to predict the missing sensitive attributes based on other features present in the data. This process, known as imputation, can help create synthetic labels for the missing information.

Q: What are the implications of using different cost functions in the optimal transport framework

The choice of cost function in the optimal transport framework has significant implications on the outcomes of the repair method. Different cost functions capture different notions of distance between points and influence how resources are allocated during transportation. For instance: Using squared Euclidean distance (L2 norm) emphasizes minimizing overall movement and tends to preserve structure. Earth Mover's Distance (EMD) considers moving mass from one point to another and may lead to more balanced distributions. Wasserstein distance measures account for both quantity and spatial arrangement of mass being transported. Selecting an appropriate cost function depends on the specific characteristics of the data and objectives of fairness repair. It impacts how closely repaired distributions align with desired targets and influences computational complexity.

Q: How might non-Gaussian or nonstationary features impact the performance of the repair method

Non-Gaussian or nonstationary features can significantly impact the performance of a repair method based on optimal transport: Non-Gaussian Features: If features have non-Gaussian distributions, traditional methods assuming Gaussianity may lead to suboptimal repairs. The tails or skewness in non-Gaussian data might require adjustments in modeling assumptions or kernel choices during interpolation. Nonstationary Features: Changes over time or across subgroups introduce challenges due to varying underlying distributions that violate stationarity assumptions made by some OT methods. Repairing nonstationary features requires adaptive strategies that account for shifts in distributional properties. These factors can affect estimation accuracy, convergence rates, and generalization capabilities of OT-based repair algorithms when applied to datasets with complex feature structures like those exhibiting non-Gaussianity or nonstationarity.

Core Concepts

Algorithms for repairing unfairness in training data using optimal transport.

Abstract

The paper addresses the need for algorithms to repair unfairness in training data, focusing on conditional independence between protected attributes and features. It introduces a method using optimal transport (OT) for repairing archival data with a small proportion of labeled research data. Experimental results demonstrate effective repair of off-sample, labeled data.
Index Terms:

AI fairness
Optimal transport
Data repair
Conditional independence
Mixture modeling
Kernel density estimation
Sections:

Introduction

Importance of fairness in decision-making.

Fairness as Conditional Independence

Defining fairness and metrics for subgroup fairness.

Optimal Transport for Data Repair

Using OT to establish conditional independence between features and protected attributes.

Off-Sample Data Repair

Framework for repairing archival data using research data-trained OT repairs.

Simulation and Real-Data Studies

Validation of the method on simulated and real-world data sets.

Discussion

Considerations, assumptions, and future directions.

Stats

"nR ≡500 research (on-sample) points"
"nA ≡5000 archival (off-sample) points"
"nQ = 250 to ensure high resolution in interpolated supports"

Quotes

"We define U-conditional fairness as (X ⊥⊥S)|U."
"Disparate impact is often adopted as the proxy for quantifying the extent to which Definition 2.2 is met."

Key Insights Distilled From

Optimal Transport for Fairness

by Abigail Lang... at arxiv.org 03-22-2024

https://arxiv.org/pdf/2403.13864.pdf

Deeper Inquiries

How can the method be adapted for datasets where sensitive attributes are not available

In datasets where sensitive attributes are not available, the method can be adapted by employing techniques for estimating or inferring these attributes. One approach is to use proxy variables that are correlated with the sensitive attribute but do not directly reveal it. For example, in a hiring dataset where race is not recorded, zip code or educational background could serve as proxies for racial demographics. Another option is to utilize machine learning models to predict the missing sensitive attributes based on other features present in the data. This process, known as imputation, can help create synthetic labels for the missing information.

What are the implications of using different cost functions in the optimal transport framework

The choice of cost function in the optimal transport framework has significant implications on the outcomes of the repair method. Different cost functions capture different notions of distance between points and influence how resources are allocated during transportation. For instance:

Using squared Euclidean distance (L2 norm) emphasizes minimizing overall movement and tends to preserve structure.
Earth Mover's Distance (EMD) considers moving mass from one point to another and may lead to more balanced distributions.
Wasserstein distance measures account for both quantity and spatial arrangement of mass being transported.
Selecting an appropriate cost function depends on the specific characteristics of the data and objectives of fairness repair. It impacts how closely repaired distributions align with desired targets and influences computational complexity.

How might non-Gaussian or nonstationary features impact the performance of the repair method

Non-Gaussian or nonstationary features can significantly impact the performance of a repair method based on optimal transport:

Non-Gaussian Features: If features have non-Gaussian distributions, traditional methods assuming Gaussianity may lead to suboptimal repairs. The tails or skewness in non-Gaussian data might require adjustments in modeling assumptions or kernel choices during interpolation.
Nonstationary Features: Changes over time or across subgroups introduce challenges due to varying underlying distributions that violate stationarity assumptions made by some OT methods. Repairing nonstationary features requires adaptive strategies that account for shifts in distributional properties.
These factors can affect estimation accuracy, convergence rates, and generalization capabilities of OT-based repair algorithms when applied to datasets with complex feature structures like those exhibiting non-Gaussianity or nonstationarity.

Optimal Transport for Fairness: Archival Data Repair using Small Research Data Sets