Core Concepts
A novel training scheme that enables efficient distributed data-parallel training on sequences of different sizes with minimal overhead, reducing padding by over 100x while maintaining performance.
Abstract
The paper addresses the challenge of efficiently training neural network models using sequences of varying sizes, such as videos of different durations. The authors propose a novel training scheme that enables efficient distributed data-parallel (DDP) training on sequences of different sizes with minimal overhead.
The key highlights and insights are:
Traditional DDP training schemes struggle with data sequences of varied lengths, as the gradient synchronization step can lead to deadlocks when sequences differ in size.
Common strategies to resolve this issue, such as padding all sequences to match the longest sequence or breaking down each sample into smaller chunks, have significant drawbacks. Padding leads to substantial waste, while breaking down samples destroys the temporal relationships in the data.
The authors' proposed method, called BLoad, builds upon the padding strategy but significantly reduces wasteful computations. It creates blocks of size Tmax by concatenating randomly sampled sequences with length Ti ≤ Tmax, and maintains a table of starting indices for each sequence within the block.
Experiments on the Action Genome dataset show that BLoad reduces the padding amount by more than 100x compared to the naive padding solution, while not deleting any frames. This results in an overall increase in performance on both training time and the Recall@20 metric.
The proposed approach opens up new possibilities for training models on diverse data types, such as videos, audio, and text, with varying sequence lengths, while preserving temporal relationships.
Stats
The paper reports the following key metrics:
Padding amount: 534,831 (naive padding), 37,712 (BLoad)
Number of frames deleted: 0 (BLoad), 92,271 (sampling)
Time per epoch: 170 minutes (naive padding), 40 minutes (BLoad)
Recall@20 performance: 41.2 (sampling), 43.3 (BLoad)
Quotes
"By using this scheme we were able to reduce the padding amount by more than 100x while not deleting a single frame, resulting in an overall increased performance on both training time and Recall in our experiments."