toplogo
Sign In

Enhancing Data Quality in Federated Fine-Tuning of Foundation Models


Core Concepts
The author proposes a data quality control pipeline for federated fine-tuning of large language models to improve model performance by ensuring high-quality data selection.
Abstract
The content discusses the importance of data quality control in federated fine-tuning of foundation models. It introduces a pipeline for scoring and filtering training data to enhance model performance. The experiments show that selecting high-quality data based on a unified standard improves model performance significantly. The paper addresses the challenges of training large language models on high-quality data due to the exhaustion of public datasets. It proposes a solution involving collaboration with private domain data sources while maintaining privacy through federated learning. The focus is on enhancing model training by controlling the quality of diverse datasets across multiple clients. Key points include identifying low-quality data patterns, establishing a global threshold for data quality, and implementing a two-phase workflow for federated learning with high-quality data. Various scoring methods like perplexity, conditional probability, and influence functions are utilized to evaluate individual training samples' quality. Experiments conducted on question-answering tasks using different datasets demonstrate the effectiveness of the proposed data quality control pipeline. Results show that selecting high-quality data based on anchor set average score outperforms other methods based on proportion or score threshold. Overall, the paper highlights the significance of maintaining data quality in federated learning settings to achieve better model performance while preserving privacy and addressing challenges related to heterogeneous client datasets.
Stats
Higher scores indicate better performance (for more details about the metrics, see Appendix C.2). Low-quality data consistently leads to worse influence on all metrics. The global threshold is determined by calculating the average score of anchor data points. In NIID-2 setting, ConPro and ICL as scoring methods outperform Oracle in terms of performance. Utilizing current data quality control pipeline results in improved model performance across centralized and federated settings.
Quotes
"We propose an automated data quality control pipeline for federated fine-tuning of large language models." "Our experiments show that selecting high-quality data based on a unified standard improves model performance significantly."

Deeper Inquiries

How can differential privacy be integrated into the framework to enhance privacy protection?

Incorporating differential privacy into the framework can further enhance privacy protection by adding noise to the data during model training or aggregation. This noise ensures that individual data points cannot be distinguished, thus safeguarding the privacy of each participant's data. By integrating techniques such as local differential privacy (LDP) or secure multi-party computation (MPC), sensitive information remains protected while still allowing for collaborative model training.

What are potential implications of using automated filters like perplexity filters in collaborative environments?

Using automated filters like perplexity filters in collaborative environments can have several implications. Firstly, these filters may help reduce the volume of low-quality data and improve training efficiency by eliminating irrelevant or noisy samples. However, there could also be challenges in ensuring that these filters accurately identify low-quality data across diverse datasets from different clients. Additionally, relying solely on automated filters may overlook nuanced aspects of data quality that require human judgment.

How does the proposed approach address concerns related to varying capabilities in synthesizing high-quality training samples among clients?

The proposed approach addresses concerns related to varying capabilities in synthesizing high-quality training samples among clients by implementing a unified standard for data quality control based on anchor data scoring. By establishing a global threshold derived from a minimal set of anchor samples, this method ensures consistency in evaluating and filtering out low-quality data across heterogeneous client datasets. This helps mitigate disparities in dataset quality among participants and promotes more effective collaboration without compromising individual client's private information.
0
visual_icon
generate_icon
translate_icon
scholar_search_icon
star