Core Concepts
Optimizing data selection in decentralized markets through experimental design improves prediction accuracy without labeled validation data.
Abstract
Abstract: Acquiring high-quality training data is crucial for machine learning models, especially in data-scarce domains like healthcare. Data markets incentivize data sharing, but selecting valuable data points remains a challenge.
Introduction: Data valuation techniques are inadequate for decentralized markets. A federated approach based on experimental design enhances prediction accuracy without labeled validation data.
Challenges with Current Data Valuation Approaches: Existing methods overfit when the dimensionality is high or the validation set is small, highlighting the limitations of current approaches.
Methods: The proposed linear experimental design optimizes data selection directly using unlabeled test data, outperforming traditional valuation methods.
Fast and Federated Optimization: The Frank-Wolfe algorithm efficiently updates weights iteratively, making the method scalable and suitable for decentralized markets.
Experiments: The method demonstrates superior performance on various real-world datasets compared to existing valuation techniques.
Ablation Experiments: Varying factors like the number of test points and regularization strength impact performance, suggesting avenues for further research.
Stats
Unlike prior work, our method achieves lower prediction error without requiring labeled validation data.
Our approach directly estimates the benefit of acquiring data for test set prediction in a decentralized market setting.
Quotes
"Our proposed method maintains low test error as more seller training data is selected."
"Attributing the influence of training data to validation performance is not equivalent to estimating data value for predicting unseen test data."