Core Concepts
Efficiently prune and recover pose tokens to improve the efficiency of transformer-based 3D human pose estimation.
Abstract
The article introduces the Hourglass Tokenizer (HoT) framework for efficient transformer-based 3D human pose estimation. It addresses the high computational costs of video pose transformers (VPTs) by pruning redundant pose tokens and recovering full-length tokens. The proposed method achieves high efficiency and estimation accuracy compared to existing VPT models through token pruning and token recovering strategies. Extensive experiments on benchmark datasets demonstrate the effectiveness of the HoT framework.
Directory:
- Introduction
- Video-based 3D human pose estimation applications.
- Transformer-based architectures for video pose estimation.
- Method
- Token Pruning Cluster (TPC) for selecting representative tokens.
- Token Recovering Attention (TRA) for restoring full-length tokens.
- Experiments
- Ablation studies on block index and number of representative tokens.
- Comparison with state-of-the-art methods on Human3.6M and MPI-INF-3DHP datasets.
- Conclusion
- Summary of the proposed HoT framework for efficient 3D human pose estimation.
Stats
Our HoT can save nearly 50% FLOPs without sacrificing accuracy and nearly 40% FLOPs with only 0.2% accuracy drop.
Extensive experiments on Human3.6M and MPI-INF-3DHP datasets demonstrate the efficiency and accuracy of our method.
Quotes
"Our method reveals that maintaining the full pose sequence is unnecessary, and using a few pose tokens of representative frames can achieve both high efficiency and estimation accuracy."