Core Concepts
Using audio to generate regions of interest in images can reduce computational load in face detection algorithms.
Abstract
Efficient face detection is crucial for natural human-robot interactions, but traditional methods involve heavy computational loads due to processing large amounts of pixel data quickly. This paper proposes using audio to generate regions of interest in optical images, reducing the number of pixels processed by computer vision. By localizing a speech source through an attention mechanism, the proposed pipeline offers a trade-off between speed and accuracy. The system includes modules for voice activity detection, denoising, sound source localization, and ROI selection based on speech dominance scores. Experimental results show significant improvements in runtime and computational load compared to baseline methods.
Stats
Time complexity of convolution operations: O(K2HW)
Accuracy score of VAD module: 93.4%
Accuracy score of denoising module: 85.3%
Average distance reduction by denoising module: Significant amount even in low SNR.
Reduction in FLOPs by proposed pipeline: Factor of 1.88 at 0 dB SNR.
Quotes
"Audio signals have the advantage of being represented by fewer data points than optical images while providing information redundancy at the semantic and spatial levels."
"Our results show that the attention mechanism reduces the computational load and offers an interesting trade-off between speed and accuracy."
"The proposed pipeline has potential to improve speed while maintaining accuracy."