Core Concepts
A scalable framework for crafting efficient lightweight models for video object detection utilizing self-training and knowledge distillation techniques, with a focus on camera clustering to enhance model accuracy and reduce training complexity.
Abstract
The paper presents a scalable framework for video object detection that leverages self-training and knowledge distillation techniques. The key contributions are:
- Validation of active distillation in a multi-camera setup, including an analysis of the potential biases of model-based pseudo-annotations.
- Introduction of a novel camera-clustering approach based on model cross-performance, and an in-depth analysis of its impact on model accuracy.
- Availability of a novel data and codebase to support further research.
The framework operates on a two-level system, with local student models on camera nodes and a central teaching server. The teaching server collects images from the cameras, pseudo-labels them using a general-purpose teacher model, and then trains specialized student models for groups of similar cameras using a clustering approach. The updated student models are then deployed back to the associated cameras.
The paper explores the impact of various sampling strategies (e.g., Top-Confidence, Least-Confidence) on the student model accuracy, and investigates the correlation between the complexity of the teacher model and the tendency to induce confirmation bias. It also analyzes the trade-offs involved in choosing the number of clusters, considering factors such as training complexity, model performance, and the balance between specificity and universality.
The results demonstrate that the proposed camera clustering approach can notably improve the accuracy of distilled models, outperforming both the methodologies that employ distinct models for each camera and the approach that uses a universal model trained on the aggregate camera data.
Stats
The number of training samples per stream (B) has a significant impact on model performance, with performance gains plateauing beyond B ≥ 1500.
Increasing the number of epochs benefits smaller cluster configurations (K = 3, 5, 9), enabling them to achieve comparable or superior performance to more universal configurations (K = 1, 2).
Compact fine-tuned student models can outperform a large general-purpose teacher model, demonstrating the effectiveness of distillation.
Quotes
"Our method facilitates localized consistent updates, which are crucial for maintaining local DNN performance in individual cameras and promoting cost-efficient training and scalability."
"Stream aggregation not only reduces the requirement for numerous student models but also improves their prediction accuracy compared with training a separate model for each stream with only its specific images."
"Employing multiple clusters leads to superior accuracy than training a universal student model with images from all streams, indicating that the proposed camera clustering approach can notably improve the accuracy of distilled models."