Core Concepts
Leveraging the contextual understanding capabilities of large language models to formulate appearance knowledge elements and integrate them with visual cues can significantly enhance pedestrian detection performance.
Abstract
The paper introduces a novel approach to utilize the strengths of large language models (LLMs) in understanding contextual appearance variations and leverage this knowledge into a pedestrian detection model.
Key highlights:
The authors establish a description corpus that includes numerous narratives describing various appearances of pedestrians and other instances.
By feeding the description corpus through an LLM, they extract appearance knowledge sets that contain the representations of appearance variations.
They perform a task-prompting process to obtain appearance elements which are guided representative appearance knowledge relevant to the pedestrian detection task.
The obtained appearance elements are adaptable to various detection frameworks and can be integrated with visual cues to enhance pedestrian detection performance.
Comprehensive experiments with different pedestrian detection frameworks show the adaptability and effectiveness of the proposed method, achieving state-of-the-art detection performance on public benchmarks (CrowdHuman and WiderPedestrian) with significant performance gains.
Stats
A low resolution rendering of a small person wearing a yellow jacket.
A cropped photo of a short girl wearing a yellow t-shirt.
A bright rendering of a big pedestrian wearing a red dress.
A good picture of a fat stroller wearing a red hat.
A close-up rendering of a horse.
Stock Photo: Horse grazing on field.
A photo of a lamp post.
A bad picture of the street lamp.
A rendering of the street sign in the scene.
A pixelated photo of a stop sign.
A bright rendering of a small guy playing a baseball.
A rendering of a short guy playing a tennis.