toplogo
Sign In

Efficient Multi-Resolution Video Object Detection on Ultra-Low-Power Embedded Systems


Core Concepts
This paper introduces Multi-Resolution Rescored ByteTrack (MR2-ByteTrack), a novel video object detection framework that reduces the average compute load of deep neural networks by alternating the processing of high-resolution and multiple down-sized frames, while maintaining accuracy through temporal correlation and a novel probabilistic rescoring algorithm.
Abstract
The paper proposes the MR2-ByteTrack framework for efficient video object detection on ultra-low-power embedded systems. The key highlights are: MR2-ByteTrack combines an off-the-shelf deep neural network (DNN) object detector with the ByteTrack Kalman-filter-based tracker and a novel Rescore algorithm to improve classification accuracy over time. The method reduces the average compute load by alternating the processing of high-resolution (320x320 pixels) images with multiple down-sized frames (192x192 pixels). This multi-resolution approach trades off some accuracy for lower computational cost. The Rescore algorithm correlates the output detections over time and corrects potential misclassifications, mitigating the accuracy degradation due to the reduced input size. Evaluated on the ImageNetVID dataset, MR2-ByteTrack improves mean Average Precision (mAP) scores compared to baseline object detectors by up to 5.17% and the F1 score by 3.58% when using only full-resolution frames. When interleaving two low-resolution frames for each full-resolution frame, the mAP improvement is +2.16% while the average MAC computational costs are reduced by up to 43%. Deployed on the GAP9 microcontroller, MR2-ByteTrack achieves up to 1.76x lower inference latency compared to baseline object detectors, with no increase in the DNN's parameter footprint and only a modest increase in code size.
Stats
The NanoDet-Plus object detector requires 463 MMAC operations for a 320x320 pixel input image, and 167 MMAC operations for a 192x192 pixel input image. The YOLOX-Nano object detector requires 316 MMAC operations for a 320x320 pixel input image, and 114 MMAC operations for a 192x192 pixel input image. The EfficientDet-D0 object detector requires 1440 MMAC operations for a 384x384 pixel input image, and 640 MMAC operations for a 256x256 pixel input image.
Quotes
"MR2-ByteTrack combines an off-the-shelf DNN-based object detector for multi-resolution inference, the ByteTrack Kalman-based tracker, and the Rescore method to refine category assignment of tracked frames, reducing misdetections and misclassifications." "When deployed on a multi-core MCU, this method incurs no additional memory cost for parameter storage compared to a single-resolution inference scheme, as the same object detection DNN is applied to both full and low-resolution frames."

Deeper Inquiries

How could the MR2-ByteTrack framework be extended to handle dynamic adjustment of the resolution mix (full-res vs. low-res frames) based on the scene complexity or available computational resources?

To enable dynamic adjustment of the resolution mix based on scene complexity or computational resources, the MR2-ByteTrack framework could incorporate a dynamic resolution selection mechanism. This mechanism could analyze the complexity of the scene by evaluating factors such as the number of objects, their sizes, and movements. Based on this analysis, the framework could dynamically adjust the ratio of full-res and low-res frames to optimize accuracy and throughput. Additionally, the framework could include adaptive algorithms that monitor the computational resources available during runtime. By continuously assessing the processing capabilities of the system, the framework could dynamically allocate resources to full-res or low-res frames as needed. This adaptive approach would ensure efficient utilization of computational resources while maintaining high accuracy in object detection.

How could the MR2-ByteTrack framework be applied to other computer vision tasks beyond object detection, such as semantic segmentation or pose estimation, to enable efficient real-time processing on embedded systems?

The MR2-ByteTrack framework can be extended to other computer vision tasks beyond object detection by adapting the tracking and rescore algorithms to suit the requirements of tasks like semantic segmentation or pose estimation. For semantic segmentation, the framework could be modified to track and refine pixel-wise predictions over time, similar to how it tracks and refines object detections. By incorporating a pixel-level tracking mechanism and a rescore algorithm that considers spatial relationships between pixels, the framework can improve the accuracy of semantic segmentation tasks. In the case of pose estimation, the framework can track key points or joints over consecutive frames and refine the estimated poses using the rescore algorithm. By correlating pose estimations across frames and adjusting them based on temporal information, the MR2-ByteTrack framework can enhance the accuracy of pose estimation tasks. Overall, by customizing the tracking and rescore components to suit the specific requirements of tasks like semantic segmentation or pose estimation, the MR2-ByteTrack framework can enable efficient real-time processing of a wide range of computer vision tasks on embedded systems.

What other techniques could be explored to further improve the accuracy-throughput tradeoff of the MR2-ByteTrack approach, such as adaptive thresholding or model scaling?

To further enhance the accuracy-throughput tradeoff of the MR2-ByteTrack approach, several techniques can be explored: Adaptive Thresholding: Implementing adaptive thresholding techniques can dynamically adjust the confidence thresholds for detections based on the scene complexity or the tracker's confidence level. By adaptively setting thresholds, the framework can improve accuracy by filtering out false positives or low-confidence detections. Model Scaling: Exploring model scaling techniques such as model pruning, quantization, or distillation can help reduce the computational complexity of the DNN object detector. By optimizing the model architecture and parameters, the framework can achieve higher throughput without compromising accuracy. Temporal Fusion: Introducing temporal fusion methods to combine information from multiple frames can improve the robustness of object tracking and classification. By fusing information over time, the framework can enhance accuracy by leveraging temporal context in the video stream. Reinforcement Learning: Incorporating reinforcement learning algorithms to optimize the tracking and rescore processes can further improve the framework's performance. By training the system to make optimal decisions based on feedback from the environment, the framework can adapt and improve its accuracy-throughput tradeoff dynamically. By exploring these techniques and integrating them into the MR2-ByteTrack framework, it can achieve a better balance between accuracy and throughput for video object detection on embedded systems.
0
visual_icon
generate_icon
translate_icon
scholar_search_icon
star