Leveraging YOLO-World and GPT-4V Large Multimodal Models for Zero-Shot Person Detection and Action Recognition in Drone Imagery
YOLO-World demonstrates good detection performance for persons in drone imagery, while GPT-4V struggles with accurately classifying action classes but delivers promising results in filtering out unwanted region proposals and providing a general description of the scenery.