This paper presents a large-scale benchmarking study on multimodal recommendation systems, with a focus on evaluating the impact of different multimodal feature extractors. The authors first review the current state of multimodal datasets and feature extractors used in the literature, highlighting the lack of standardized and comprehensive benchmarking approaches.
To address this gap, the authors propose an end-to-end pipeline that combines two recent frameworks, Ducho and Elliot, to enable extensive benchmarking of multimodal recommender systems. Ducho is used for multimodal feature extraction, supporting a wide range of state-of-the-art models, while Elliot is used for training and evaluating the recommendation models.
The authors conduct experiments on five popular Amazon product datasets, evaluating 12 recommendation algorithms (6 classical and 6 multimodal approaches) with different multimodal feature extractor combinations. The results show that multimodal recommender systems significantly outperform classical approaches, and that the choice of multimodal feature extractor can have a significant impact on the final recommendation performance.
Specifically, the authors find that multimodal-by-design feature extractors, such as CLIP, Align, and AltCLIP, can provide substantial improvements over the commonly used ResNet50 and Sentence-BERT extractors. However, they also note that the use of these advanced extractors may come at the cost of increased computational complexity, requiring careful tuning of hyperparameters to achieve an optimal performance-complexity trade-off.
The authors make the code, datasets, and configurations used in this study publicly available, aiming to foster reproducibility and encourage further research in the area of multimodal recommendation.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問