Efficient Distributed Training of Neural Networks on Sequential Data with Varying Lengths
A novel training scheme that enables efficient distributed data-parallel training on sequences of different sizes with minimal overhead, reducing padding by over 100x while maintaining performance.