Grunnleggende konsepter
EchoSight, a novel retrieval-augmented vision-language system, excels in knowledge-based visual question answering by employing a dual-stage search mechanism that integrates visual-only retrieval with multimodal reranking, significantly improving accuracy over existing VLMs.
Statistikk
EchoSight achieves an accuracy of 41.8% on Encyclopedic VQA.
EchoSight achieves an accuracy of 31.3% on InfoSeek.
Using Eva-CLIP-8B as the visual backbone for retrieval achieves a recall@20 of 48.8%.
Multimodal reranking improves Recall@1 from 13.3% to 36.5% on the E-VQA benchmark.
Multimodal reranking improves Recall@1 from 45.6% to 53.2% on the InfoSeek benchmark.