Vision-RWKV is a model adapted from the RWKV model, designed to efficiently handle sparse inputs and demonstrate robust global processing capabilities, offering a more efficient alternative for visual perception tasks.
The paper proposes a Single-Head Vision Transformer (SHViT) that achieves state-of-the-art speed-accuracy tradeoff on various devices by addressing computational redundancies in both macro (spatial) and micro (channel) design.