The paper presents a multimodal approach for detecting the state of pedestrian traffic lights (PTLs) from the perspective of an autonomous quadruped robot navigating urban environments. The system combines vision-based object detection and color analysis with audio feature extraction to classify PTLs as red or green.
The vision-based detection utilizes a YOLO object detector to identify PTL bounding boxes, followed by a pixel counting method in the HSV color space to determine the light state. The audio-based detection extracts Mel-Frequency Cepstral Coefficients (MFCC) from the audio signal and uses a Random Forest classifier for classification.
The authors propose two fusion strategies - feature-level fusion and decision-level fusion - to integrate the audio and visual inputs. The feature-level fusion concatenates the visual and audio features, while the decision-level fusion combines the confidence scores from the individual modalities.
The system is evaluated on a dataset captured by the Unitree Go1 quadruped robot, including scenarios with varying degrees of visual occlusion and robot motion. The results demonstrate that the audio-visual fusion approaches significantly outperform single-modality solutions, especially in challenging conditions. The feature-level fusion method achieves an accuracy of over 98% when the robot is in motion, while the decision-level fusion performs best (97.4%) when the robot's view is occluded.
The authors also implement the proposed system on the Unitree Go1 robot, enabling autonomous road crossing when a green light is detected. The system's efficient inference time (average of 242ms) and robust performance under diverse urban conditions highlight its potential for practical deployment in autonomous urban robotics.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Sagar Gupta,... at arxiv.org 05-01-2024
https://arxiv.org/pdf/2404.19281.pdfDeeper Inquiries