The paper introduces a two-stage approach to address the challenges of deploying deep learning-based facial landmark detection models on embedded systems with limited computational resources.
In the first stage, the authors train a Swin Transformer (SwinV2) as the teacher model using a combination of LAAM and LSTAR loss functions, achieving a promising score of 18.08.
In the second stage, the authors leverage the distilled knowledge from the teacher model to train a more lightweight MobileViT-v2 student model. Even in its nascent stage, the student model showcases significant promise by achieving a score of 15.75.
The authors employ heatmap-based methods for superior accuracy and utilize the Anisotropic Attention Module (AAM) to enhance the heatmap's precision. They also design a straightforward knowledge distillation loss (LKD) to efficiently transfer primary features from the teacher model to the student model.
Experimental results on the validation dataset demonstrate that the proposed MobileViT-v2-0.5 student model outperforms other transformer-based and CNN-based models in terms of complexity, model size, speed, power, and accuracy. The authors also provide details on the student model architecture and the modifications made to ensure compatibility with tflite-runtime versions up to 2.11.0.
Til et andet sprog
fra kildeindhold
arxiv.org
Vigtigste indsigter udtrukket fra
by Zong-Wei Hon... kl. arxiv.org 04-10-2024
https://arxiv.org/pdf/2404.06029.pdfDybere Forespørgsler