Core Concepts
The SMART pruner utilizes a separate, learnable probability mask to rank weight importance, employing a differentiable Top-k operator and dynamic temperature parameter trick to achieve target sparsity and escape non-sparse local minima, resulting in state-of-the-art performance on block and output channel pruning across various computer vision tasks and models.
Abstract
The paper introduces the SMART pruning algorithm, a novel approach for efficient block and output channel pruning on computer vision tasks. The key highlights are:
SMART pruner uses a separate, learnable probability mask to rank weight importance, rather than relying on weight magnitude alone. This enables more precise cross-layer weight importance ranking.
The algorithm employs a differentiable Top-k operator to iteratively adjust and redistribute the mask parameters, facilitating the soft probability mask to gradually converge to a binary mask.
To avoid convergence to non-sparse local minima, the SMART pruner utilizes a dynamic temperature parameter trick, where the temperature is gradually reduced during training to sharpen the differentiable Top-k function.
Theoretical analysis shows that as the temperature parameter approaches zero, the global optimum solution of the SMART pruner is equivalent to the global optimum solution of the fundamental pruning problem, mitigating the impact of regularization bias.
Extensive experiments demonstrate that the SMART pruner outperforms state-of-the-art pruning methods, including PDP, PaS, AWG, and ACDC, across a variety of computer vision tasks and models, including classification, object detection, and image segmentation.
The SMART pruner also exhibits superior performance on Transformer-based models in the context of N:M pruning, showcasing its adaptability and robustness across different neural network architectures and pruning types.
Stats
The total number of weights blocks is denoted as n(w).
The sparsity ratio is represented by r.