Core Concepts
Leveraging the Bridge-Prompt framework, a prompt-based video encoder can effectively recognize surgical gestures, including in zero-shot scenarios where unseen gestures are encountered.
Abstract
The paper presents a method for surgical gesture recognition using a prompt-based video encoder called Bridge-Prompt. The key highlights are:
Bridge-Prompt is a training protocol that fine-tunes a pre-trained vision-text model (CLIP) for gesture recognition in surgical videos. This allows the use of extensive outside video data and label metadata, as well as weakly supervised contrastive losses.
Experiments on the JIGSAWS and RARP-45 datasets show that the prompt-based video encoder outperforms standard encoders like 3DResNet and I3D in surgical gesture recognition tasks.
The prompt-based encoder displays strong zero-shot performance, where it can recognize gestures that were not provided during the encoder training phase. This is crucial for surgical robotics, as the vocabulary of gestures is too large to learn purely from annotated data.
The authors find that the text descriptions of gesture labels do not significantly improve the performance of the prompt-based encoder, suggesting that the categorical information alone is sufficient.
The ability of the prompt-based encoder to generalize to unseen gestures and tasks makes it a valuable tool for surgical robotics applications, where the diversity of surgical procedures requires flexible and adaptable visual representations.
Stats
"this video contains K actions in total"
"this is the ith action in the video"
"Firstly, the person is performing {Gesture 1 text description}"
"Secondly, the person is performing {Gesture 2 text description}"