Core Concepts
PyTorch's automatic differentiation implementation can be optimized to reduce memory consumption in selective differentiation scenarios by leveraging information about parameter differentiability.
Abstract
The authors observe that PyTorch's current automatic differentiation (AD) implementation does not take into account the differentiability of layer parameters when storing the computation graph. This information can be used to discard the inputs to linear layers (dense, convolution, normalization) whose parameters are marked as non-differentiable, thereby reducing the memory footprint.
The authors provide a drop-in implementation of various layers that is agnostic to parameter differentiability. They demonstrate that this approach can significantly reduce the memory consumption, by up to 6x, in selective differentiation scenarios such as fine-tuning, adversarial example generation, and neural style transfer, without affecting the runtime performance.
The key insights are:
PyTorch's AD stores the computation graph as if all parameters were differentiable, even when only a subset is actually required.
The authors' implementation tracks the differentiability of layer parameters and selectively stores the layer inputs, leading to memory savings.
Interactions between layers (e.g., ReLU after convolution) can diminish the memory savings, but the authors provide a custom ReLU implementation to overcome this.
The authors evaluate their approach on various popular CNN architectures, including ResNet, EfficientNet, and object detection models, and consistently observe significant memory reductions without performance degradation.
Stats
The peak memory consumption for a deep CNN with only convolution layers is linearly proportional to the number of layers when all parameters are marked as differentiable, but remains constant when all parameters are marked as non-differentiable.
Quotes
"PyTorch stores a layer input with requires_grad = True although the layer's parameters might be non-differentiable, and therefore not require it to be stored."
"Our implementation shares the performance of PyTorch's."