Enabling Tiny On-Device Training with 256KB Memory: An Algorithm-System Co-Design Approach
The authors propose an algorithm-system co-design framework to enable tiny on-device training of convolutional neural networks under a tight 256KB SRAM and 1MB Flash memory constraint, without auxiliary memory. The key innovations include Quantization-Aware Scaling to stabilize 8-bit quantized training, Sparse Update to reduce the memory footprint, and a lightweight Tiny Training Engine that prunes the backward computation graph and offloads runtime auto-differentiation to compile time.