Conceitos essenciais
A novel self-supervised learning approach, Frame Interpolation Masked Autoencoder (FIMAE), achieves state-of-the-art performance in robust and efficient tracking of interventional devices like catheters in X-ray image sequences.
Resumo
The paper presents a self-supervised learning approach, Frame Interpolation Masked Autoencoder (FIMAE), for learning spatio-temporal features from a large dataset of over 16 million interventional X-ray frames. The key highlights are:
The FIMAE pretraining strategy overcomes the limitations of previous masked image modeling approaches by learning fine inter-frame correspondences through a novel frame interpolation-based masking technique.
The pretrained spatio-temporal features are then used to build a lightweight Vision Transformer-based model for the downstream task of device tracking. This eliminates the need for complex multi-stage feature extraction and fusion modules used in prior work.
Comprehensive experiments demonstrate that the proposed approach achieves state-of-the-art performance in terms of accuracy, robustness, and inference speed for catheter tip tracking in coronary X-ray sequences. It outperforms highly optimized reference solutions by a significant margin, e.g., 66.31% reduction in maximum tracking error.
The model exhibits superior stability and consistency in performance across diverse scenarios, including angiography, fluoroscopy, and cases with additional device occlusions. It maintains a high tracking success rate of 97.95% at the frame level.
The results highlight the effectiveness of the self-supervised pretraining strategy in learning generalizable spatio-temporal features that can be efficiently leveraged for interventional image analytics tasks, eliminating the need for complex tracking-specific modules.
Estatísticas
The maximum tracking error is reduced by 66.31% compared to reference solutions.
The proposed method achieves a tracking success score of 97.95% at the frame level.
The inference speed of the proposed method is 42 frames per second on a single Tesla V100 GPU.
Citações
"The proposed data-driven approach achieves superior performance particularly in robustness and speed compared to the frequently used multi-modular approaches for device tracking."
"The results encourage the use of our approach in various other tasks within interventional image analytics that require effective understanding of spatio-temporal semantics."