Core Concepts
The author proposes LoRAT, leveraging LoRA for efficient tracking with larger Vision Transformers, overcoming challenges in adapting PEFT to visual tracking.
Abstract
The content introduces LoRAT, a method that optimizes visual tracking using larger Vision Transformers and the Low-Rank Adaptation technique. It addresses challenges in fine-tuning large models efficiently and showcases significant performance improvements across various benchmarks.
The essence of the work lies in adapting LoRA to the domain of visual tracking, enabling practical training on GPUs with limited memory. By decoupling position embeddings and designing an anchor-free head network, the authors achieve improved performance metrics.
Through ablation experiments, it is demonstrated that LoRA significantly enhances the tracker's performance compared to full fine-tuning methods. The proposed input embedding scheme and MLP-only head network contribute to better adaptation for visual tracking tasks.
Efficiency comparisons show that LoRAT achieves impressive results in terms of speed and resource requirements. The model outperforms state-of-the-art Transformer-based trackers while maintaining practical inference speeds on different datasets.
Stats
LaSOT SUC score from 0.703 to 0.743 with L-224 variant
Training time reduced from 35.0 to 10.8 GPU hours for L-224 variant
Inference speed increased from 52 to 119 FPS for L-224 variant