Core Concepts
A multimodal fusion framework Multitrans based on Transformer architecture and self-attention mechanism can effectively combine non-contrast computed tomography (NCCT) images and discharge diagnosis reports to accurately predict functional outcomes of stroke treatment.
Abstract
The study proposes a multimodal detection architecture Multitrans that combines NCCT images and discharge diagnosis reports using a Transformer-based approach to predict the functional outcomes of stroke treatment.
The key highlights and insights are:
The performance of single-modal text classification is significantly better than single-modal image classification, but the effect of multimodal combination is better than any single modality.
Although the Transformer model performs poorly on imaging data alone, when combined with clinical meta-diagnostic information, both can learn better complementary information and make good contributions to accurately predicting stroke treatment effects.
The ablation experiments using different Transformer architectures (ViT, Swin Transformer, RoBERTa, ALBERT) found that the information derived from image information is limited, and improving textual information extraction, as well as fusing more modalities, may be the focus of future work.
The self-attention mechanism used for fusion is a key component, and further research on more effective fusion methods and structures is needed.
Stats
Stroke refers to a range of disorders caused by occlusion or haemorrhage of blood vessels supplying the brain.
Acute ischaemic stroke is the most common type of stroke and is one of the leading causes of disability and death worldwide.
The study collected relevant diagnostic results and data from 128 patients of Capital Medical University for acute ischaemic stroke caused by intracranial artery occlusion.
42 of these patients received intra-arterial treatment and 86 patients received usual care.
Quotes
"The results show that the performance of single-modal text classification is significantly better than single-modal image classification, but the effect of multi-modal combination is better than any single modality."
"Although the Transformer model only performs worse on imaging data, when combined with clinical meta-diagnostic information, both can learn better complementary information and make good contributions to accurately predicting stroke treatment effects."