Core Concepts
The author proposes scale-invariant modifications to LSTM architectures for efficient training in federated learning, demonstrating improved convergence and performance. These modifications offer a balance between memory-efficient optimizers like SGD and the performance of adaptive optimizers.
Abstract
Efficient language model architectures are proposed for differentially private federated learning, focusing on scale-invariant modifications to LSTM models. The study explores the benefits of these modifications in improving convergence speed and overall utility in large-scale experiments across various model architectures. By introducing novel concepts like SI-CIFG and SI Transformer, the research showcases significant advancements in training efficiency and privacy-utility trade-offs within federated learning systems.
The study delves into the challenges of training neural language models efficiently with memory-efficient optimizers like SGD while maintaining performance levels comparable to adaptive optimizers. By proposing scale-invariant modifications to traditional architectures, such as LSTMs and Transformers, the authors aim to bridge this gap effectively.
Key highlights include the development of a novel Scale Invariant Coupled Input Forget Gate (SI CIFG) recurrent network that outperforms standard CIFG models in cross-device FL experiments. Additionally, the study demonstrates how these modifications can enhance training efficiency for larger transformer models while maintaining compatibility with non-adaptive algorithms.
Furthermore, the research explores the integration of differential privacy techniques with federated learning, showcasing meaningful formal guarantees achieved through DP-FTRL algorithms. By combining privacy protection mechanisms with scale-invariant architectures, the study achieves improved utility without compromising on privacy safeguards.
Overall, the findings suggest that scale-invariant architectures hold promise in revolutionizing language model training within federated learning paradigms by enhancing convergence speeds, improving model quality, and ensuring robustness across diverse network configurations.
Stats
Number of clients per round = 500
Maximum sequence length = 20
Client batch size = 10
Final perplexity values: CIFG 19M - 35.5; SI-CIFG 19M - 33.6; Transformer 21M - 34.6; SI Transformer 21M - 33.7
Quotes
"Using Scale Invariance significantly increases the rate of convergence for both Transformer and CIFG models."
"Our proposed SI-CIFG yields the best final quality and has the fastest convergence speed by far."