The paper addresses the challenge of enabling on-device training on tiny IoT devices with limited memory resources, which is fundamentally different from cloud training.
The authors identify two unique challenges: 1) Quantized graphs of neural networks are hard to optimize due to low bit-precision and lack of normalization; 2) Limited hardware resources do not allow full back-propagation.
To cope with the optimization difficulty, the authors propose Quantization-Aware Scaling (QAS) to calibrate the gradient scales and stabilize 8-bit quantized training, matching the accuracy of floating-point training.
To reduce the memory footprint, the authors propose Sparse Update to skip the gradient computation of less important layers and sub-tensors. An automated method based on contribution analysis is used to find the best update scheme under different memory budgets.
The algorithm innovations are implemented in a lightweight Tiny Training Engine (TTE), which prunes the backward computation graph, offloads runtime auto-differentiation to compile time, and performs operator reordering and fusion to further reduce memory usage.
The proposed framework is the first solution to enable tiny on-device training of CNNs under 256KB SRAM and 1MB Flash without auxiliary memory. It achieves 2300x memory reduction compared to PyTorch/TensorFlow, matches the accuracy of cloud training+edge deployment on the VWW dataset, and surpasses the common tinyML accuracy requirement by 9%.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問