The author proposes SkelVIT, a three-level architecture utilizing vision transformers for skeleton-based action recognition. The study highlights the robustness of VITs on pseudo-image representation and the effectiveness of ensemble classifiers.
Skeleton-based action recognition benefits from the use of vision transformers, providing robustness and efficiency in pseudo-image representation.