Optimizing Pipelined Inference of Deep Neural Networks for Maximum Throughput
We optimize pipeline parallelism for deep neural network (DNN) inference by partitioning model graphs into k stages and minimizing the running time of the bottleneck stage, including communication.