DiLoCo, a variant of Federated Averaging, enables distributed training of large language models across poorly connected devices by using AdamW as the inner optimizer, Nesterov momentum as the outer optimizer, and a large number of inner optimization steps.
The Holmes framework enables efficient distributed training of large language models across GPU clusters with heterogeneous network interface cards, outperforming existing frameworks in the heterogeneous NIC environment.