Concepts de base
Large Language Models (LLMs) enhance urban region profiling by integrating text modality into visual representations.
Résumé
The content introduces UrbanCLIP, a framework that leverages Large Language Models (LLMs) to enhance urban region profiling by integrating text modality into visual representations. It addresses the lack of textual information in urban imagery and presents results showing superior performance in predicting urban indicators.
Abstract:
Urban region profiling from web-sourced data is crucial for urban computing.
Introduction of text modality enhances urban region profiling through LLMs.
UrbanCLIP integrates text knowledge into visual representations for improved performance.
Introduction:
Manual surveys face limitations in gathering urban statistics due to costs.
Web-sourced data provides consistent updates and accessibility for machine learning models.
Methodology:
Text Generation: Detailed location descriptions generated using LLaMA-Adapter V2.
Single-modality Representation Learning: Visual and textual representation encoding explained.
Cross-modality Representation Learning: Modality alignment and interaction tasks detailed.
Experiments:
Performance Comparison: UrbanCLIP outperforms baselines across all datasets and indicators.
Ablation Studies: Effectiveness of textual modality, refined text, and knowledge infusion demonstrated.
Stats
結果は、最新の手法であるUrbanCLIPが他のベースラインを上回っていることを示しています。