Optimizing All-to-All Collective Communication Performance on Direct-Connect Supercomputer Topologies
This paper presents a comprehensive algorithmic toolchain for generating and lowering bandwidth-optimal all-to-all collective communication schedules to arbitrary supercomputer-scale direct-connect topologies and interconnect technologies.