OTClean is a novel framework that utilizes optimal transport theory to address data cleaning under conditional independence constraints. The paper introduces the concept of CI constraints and their significance in machine learning models. It discusses the challenges faced in repairing datasets violating CI constraints due to biases and quality issues. The proposed OTClean framework formulates the data repair problem as a Quadratically Constrained Linear Program (QCLP) and presents an efficient algorithm inspired by Sinkhorn's matrix scaling algorithm. Through experiments, the authors demonstrate the superiority of their techniques in preserving data utility while ensuring adherence to desired CI constraints.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Alireza Pirh... at arxiv.org 03-06-2024
https://arxiv.org/pdf/2403.02372.pdfDeeper Inquiries