Combining speech and visual information significantly enhances the performance of co-speech gesture detection compared to using visual information alone.