핵심 개념
LORS introduces a low-rank residual structure to reduce parameters in stacked modules while maintaining performance.
초록
The article introduces LORS to address the issue of increasing parameters in stacked structures, focusing on transformers. It proposes a method to share parameters among stacked modules, reducing unique parameters while maintaining performance. Extensive experiments on object detection tasks validate the effectiveness of LORS in reducing parameters by up to 70% while achieving comparable or better performance.
Introduction:
Large models face challenges due to the increase in parameters.
Various methods like knowledge distillation, pruning, quantization, and parameter sharing aim to reduce parameters.
Related Work:
Many neural networks use stacked structures, like CNN-based models and Transformers.
LoRA and its variants focus on fine-tuning large language models.
Approach:
LORS decomposes parameters into shared and private ones to reduce overall parameter usage.
LORS adds unique parameters to shared ones, reducing parameters while maintaining performance.
Experiments:
LORS applied to AdaMixer's decoders shows a significant reduction in parameters while achieving competitive performance.
Longer training times improve LORS performance, showcasing its effectiveness across different backbones and query numbers.
Ablation studies confirm the importance of shared and private parameters in LORS.
Additional experiments on DeiT and Transformers demonstrate the versatility and effectiveness of LORS in reducing parameters.
통계
GPT-3 utilizes 175 billion parameters and consists of 96 layers of stacked Transformer layers.
AdaMixer's decoders saw a reduction of up to 70% in parameters while maintaining performance.
인용구
"LORS allows stacked modules to share parameters, reducing unique parameters while maintaining performance."
"Experiments validate LORS's effectiveness in reducing parameters while achieving competitive performance."