Enabling Dynamic Parallelism for Deep Learning Jobs with Parallelizable Tensor Collections
TENPLEX, a state management library, enables deep learning jobs to dynamically change their parallelization configuration, including data, model, and pipeline parallelism, in response to changes in GPU resources during training.