Concetti Chiave
NeuronaBox, a flexible and high-fidelity approach to emulate distributed DNN training workloads by executing a subset of nodes and emulating the networked execution environment along with collective communication operations.
Sintesi
The paper proposes NeuronaBox, a novel approach for estimating the time-per-iteration in distributed DNN training. The key idea is to execute only a subset of nodes (N) along with emulating the networked execution environment (E) from the perspective of N.
The workflow of NeuronaBox is as follows:
- The user provides the training script, job configuration, and optional "what-if" conditions.
- NeuronaBox initializes the emulation environment by synthesizing the network topology and instantiating a communication model to calculate delay times.
- The training script is launched, with performance metrics and communication traces collected from the real node(s) in N.
The design of NeuronaBox adheres to two key principles: ease of use (no code modifications required) and flexibility/independence of parallelization strategies. It achieves this by targeting the collective communication layer as the primary interface between the computation and network stack.
The paper presents a proof-of-concept implementation using PyTorch and NCCL, and evaluates it through microbenchmarks, end-to-end training emulation, and a "what-if" analysis on latency variations. The results show that NeuronaBox can accurately emulate the behavior of actual systems, with an error margin of less than 1%.
Statistiche
NeuronaBox incurs at most 4% overhead for NCCL collective operations compared to the baseline, with the overhead diminishing as the message size increases.
For end-to-end training, NeuronaBox achieves less than 1% error in predicting training time across various DNN models compared to actual training runs.
CPU usage in NeuronaBox is slightly lower than the baseline, due to the efficient implementation and removal of unnecessary computations.
Citazioni
"NeuronaBox replicates the behavior of actual systems with high accuracy, with an error margin of less than 1% between the emulated measurements and the real system."
"The key benefit of this approach is that it allows us to faithfully execute on real hardware a portion of the training workload, which executes without overheads from instrumentation (since there is none) nor profiling N in controlled conditions."