Core Concepts
Optimizing the sequence of multiple compression techniques significantly reduces computation costs with minimal accuracy loss.
Abstract
The content explores the Chain of Compression, proposing an optimal sequence for combining various compression techniques to reduce computation costs in neural networks. It discusses interactions between compression methods, the impact of repeating compression, and provides an evaluation of end-to-end performance on popular CNN architectures across different datasets.
1. Introduction
- Deep learning models on resource-constrained systems pose challenges due to high computational costs.
- Various compression techniques have been developed to reduce computational complexity.
- Approaches operate at different granularities or stages, offline or dynamically at runtime.
2. Data Compression Pipeline
- Distillation, pruning, quantization, and early exit are integrated into the compression pipeline.
- The Chain of Compression framework combines diverse compression techniques in a sequential chain.
3. Interaction Between Two Approaches
- Complementary features observed when applying two compressions with an optimal sequence.
- Sequence impacts compression rate and inference accuracy.
- Optimal sequence transitions from large to small granularity and static to dynamic compression.
4. Adding Additional Compression
- Inserting additional compression approaches maintains the established sequence.
- Pruning should be ahead of early exit, quantization ahead of pruning, and early exit ahead of quantization.
5. Combinational Sequence Law
- Established sequence remains unaffected by inserting more compression approaches.
- Optimal sequence law transitions from distillation to pruning, quantization, and early exit.
6. Repeating the Compression
- Continuous repetition of a single compression method does not significantly enhance performance.
- Repeating quantization after the optimal sequence disrupts the established sequence.
7. Evaluation
- The proposed Chain of Compression achieves remarkable compression ratios across diverse benchmarks.
- Superior performance compared to other state-of-the-art compression methods.
- Maintains high post-compression accuracy while achieving significant compression.
8. Related Work
- Various compression techniques explored in recent years to support neural networks on lightweight platforms.
- Proposed Chain of Compression demonstrates superior performance compared to other methods.
Stats
To release this burden, model compression has become an important research focus.
Many approaches like quantization, pruning, early exit, and knowledge distillation have demonstrated the effect of reducing redundancy in neural networks.
Our proposed Chain of Compression can significantly compress the computation cost by 100-1000 times with an ignorable accuracy loss compared with the baseline model.
Quotes
"Applying two compressions with the optimal sequence can achieve better compression performance compared to an individual single compression."
"The sequence of applying two compression approaches will directly impact the compression rate and inference accuracy."