The Holmes framework enables efficient distributed training of large language models across GPU clusters with heterogeneous network interface cards, outperforming existing frameworks in the heterogeneous NIC environment.
DiLoCo, a variant of Federated Averaging, enables distributed training of large language models across poorly connected devices by using AdamW as the inner optimizer, Nesterov momentum as the outer optimizer, and a large number of inner optimization steps.