Core Concepts
Machine learning (ML) workloads on high-performance computing (HPC) systems exhibit distinct I/O access patterns compared to traditional HPC applications, posing challenges for existing storage systems. Efficient I/O optimization techniques are needed to improve training speeds and enable rapid development of ML models.
Abstract
The paper presents a comprehensive survey of I/O in ML applications on HPC systems, covering the following key aspects:
Data Formats and Modalities:
ML applications use a variety of file formats (e.g., TFRecord, HDF5, Parquet) and dataset modalities (e.g., image, audio, video, text) that impact I/O performance.
The choice of file format and dataset organization (one sample per file vs. multiple samples per file) affects the required I/O operations and metadata management overhead.
Common ML Phases and Training Distribution:
The ML lifecycle consists of data generation, dataset preparation, training, and inference phases.
During the training phase, stochastic gradient descent (SGD) is the most popular algorithm, leading to random small I/O reads that can be a bottleneck for parallel file systems (PFS).
Distributed training strategies, such as data parallelism and model parallelism, introduce additional I/O and synchronization challenges.
Model checkpointing is crucial for long-running training jobs but can significantly impact performance due to the growing complexity of ML models.
I/O Benchmarks and Profiling:
Benchmarks like DLIO can simulate I/O access patterns of DL workloads, helping identify bottlenecks.
Profiling tools like tf-Darshan provide fine-grained I/O performance analysis for ML applications.
Analysis of I/O Access Patterns:
The I/O access patterns of the Unet3D and BERT workloads, simulated using DLIO, exhibit small random reads that can be challenging for PFSs.
The number of batches read per epoch depends on the batch size, number of processes, and total samples, leading to non-overlapping I/O requests.
I/O Optimization Techniques:
Current ML frameworks (PyTorch, TensorFlow, Scikit-Learn with Dask-ML) provide various I/O optimization features, such as dataset streaming, parallel data preparation, sample prefetching, and caching.
Recent research proposes additional techniques, including distributed sample caching, asynchronous and distributed checkpointing, and I/O-aware scheduling.
The survey identifies several gaps in current research, including the need for more realistic benchmarks, comprehensive profiling tools, and advanced I/O optimization techniques tailored to the unique requirements of ML workloads on HPC systems.
Stats
"The dataset for Unet3d consists of one sample per NPZ file where each sample is approximately 146 MiBs. There were a total of 168 samples."
"The dataset size for BERT was configured to increase training speeds, with 10 TFRecord files containing 31,353 samples each, for a total of 131,530 samples. Each sample was 2,500 bytes."
Quotes
"Due to the prevalence of SGD, random batches of samples are read into memory at each iteration during model training. Small random I/O reads can be a bottleneck for PFSs which motivates the need for I/O optimization techniques such as prefetching and caching to ensure fast training speeds."
"Efficient I/O optimization techniques are needed to improve training speeds and enable rapid development of ML models."