Photon: A Federated Learning System for Efficiently Pre-Training Large Language Models
Core Concepts
Photon is a novel federated learning system that enables efficient and effective pre-training of large language models (LLMs) by leveraging distributed resources and overcoming the limitations of traditional centralized training methods.
Abstract
- Bibliographic Information: Sani, L., Iacob, A., Cao, Z., Lee, R., Marino, B., Gao, Y., ... & Lane, N. D. (2024). Photon: Federated LLM Pre-Training. arXiv:2411.02908v1 [cs.LG].
- Research Objective: This paper introduces Photon, a novel federated learning system designed for pre-training large language models (LLMs) in a decentralized manner, aiming to overcome the limitations of centralized training and leverage distributed resources for improved efficiency and scalability.
- Methodology: Photon leverages a cross-silo federated learning approach, where clients with their own data and compute resources collaboratively train a global LLM model. The system employs adaptive local parallelism, dynamically switching between standard distributed training algorithms and low-bandwidth LocalSGD based on client connectivity. It utilizes small local batch sizes and high learning rates, enabled by the robustness of federated averaging, to achieve faster convergence and improved generalization.
- Key Findings: The authors demonstrate that Photon can effectively pre-train LLMs up to 7B parameters, achieving lower perplexity than centralized training while significantly reducing communication costs. The system exhibits strong scalability, with training time decreasing as the number of clients increases. Photon also demonstrates robustness to data heterogeneity, converging effectively even when clients possess data from diverse sources.
- Main Conclusions: Photon presents a practical and efficient solution for pre-training LLMs in decentralized settings, offering advantages in terms of scalability, communication efficiency, and data privacy. The proposed system paves the way for collaborative LLM training, enabling broader participation and potentially leading to more robust and generalizable models.
- Significance: This research significantly contributes to the field of federated learning by demonstrating its efficacy in pre-training large language models, a task previously considered challenging due to the computational demands and communication overheads.
- Limitations and Future Research: The authors acknowledge the need for further investigation into federated hyperparameter tuning and the exploration of alternative aggregation strategies to further enhance performance, particularly under conditions of significant data heterogeneity.
Translate Source
To Another Language
Generate MindMap
from source content
Photon: Federated LLM Pre-Training
Stats
Photon achieves up to 20% higher throughput (samples/sec) than centralized distributed training.
Photon requires 64×–512× less communication than centralized distributed training.
Photon models converge twice as fast as previous methods like DiLoCo.
3B and 7B parameter models trained with Photon achieve 13.8% to 16.9% lower perplexity than models trained using centralized training.
Clients training a 125M parameter model use 1 Nvidia H100, processing a hardware-determined local batch size of 32.
Quotes
"Photon, the first complete system for federated end-to-end LLM training, leveraging cross-silo FL for global-scale training with minimal communication overheads."
"Photon can train model sizes up to 7B in a federated fashion while reaching an even better perplexity than centralized pre-training."
"Photon outperforms the wall-time of baseline distributed training methods by 35% via communicating 64×–512× less."