Cyclic Data Parallelism (CDP) proposes a new paradigm for executing micro-batches sequentially, reducing memory peaks and balancing gradient communications. CDP aims to address the drawbacks of existing methods like Data Parallelism (DP) by introducing a delay in computations. This approach allows for more efficient implementation of mini-batch SGD on GPUs and reduces the number of GPUs needed. The paper discusses the theoretical framework, analytical comparisons, numerical analysis on CIFAR-10 and ImageNet datasets, and activation memory tracking results. Results show that CDP outperforms DP in terms of memory efficiency and communication overhead reduction.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Louis Fourni... at arxiv.org 03-15-2024
https://arxiv.org/pdf/2403.08837.pdfDeeper Inquiries