Core Concepts
Linear complexity attention mechanism in VRWKV offers efficient and scalable visual perception.
Abstract
Vision-RWKV introduces a model adapted from the RWKV model used in the NLP field for vision tasks. It efficiently handles sparse inputs, demonstrating robust global processing capabilities while scaling effectively. The reduced spatial aggregation complexity allows seamless processing of high-resolution images without windowing operations. Evaluations show that VRWKV matches ViT's classification performance with faster speeds and lower memory usage. In dense prediction tasks, it outperforms window-based models while maintaining comparable speeds. The model shows potential as an efficient alternative for visual perception tasks.
Stats
VRWKV-T achieves 75.1% top-1 accuracy trained only on ImageNet-1K, outperforming DeiT-T by 2.9 points.
VRWKV-L achieves a top-1 accuracy of 85.3% on ImageNet-22K, slightly higher than ViT-L.
On COCO dataset, VRWKV-L achieves 50.6% box mAP, 1.9 points better than ViT-L.
Quotes
"VRWKV matches ViT's classification performance with significantly faster speeds and lower memory usage."
"Our evaluations highlight VRWKV’s potential as a more efficient alternative for visual perception tasks."