Efficiently model global and local relationships in vision transformers through Hierarchical Multi-Head Self-Attention.