แนวคิดหลัก
A weakly-supervised approach to learn subject-aware image cropping from a large collection of professional stock photos, without requiring manual annotations beyond the existing stock image dataset.
บทคัดย่อ
The paper proposes a weakly-supervised method, GenCrop, to learn subject-aware image cropping from a large collection of professional stock photos. The key challenge is that the stock images are already cropped, and the original uncropped versions are unknown.
The authors address this by combining the stock image dataset with a pre-trained text-to-image diffusion model. The stock images serve as pseudo-labels for good crops, and the diffusion model is used to "outpaint" (i.e., generate plausible uncropped versions) the stock images. This allows the authors to automatically generate a large dataset of cropped-uncropped image pairs to train a cropping model.
The cropping model is designed to be subject-aware, taking both the input image and a subject mask as input. The model uses a CNN feature extractor, a transformer-encoder, and a composition branch to predict the final crop.
The authors evaluate GenCrop on existing subject-aware cropping benchmarks as well as new evaluation sets they created for different subject categories (humans, cats, dogs, birds, horses, cars). GenCrop performs competitively with fully-supervised methods while being superior to comparable weakly-supervised baselines. Qualitative evaluation also shows that GenCrop produces fewer compositional errors compared to prior weakly-supervised approaches.
The authors also explore extending GenCrop to allow conditional control over the crop aspect ratio and tightness, demonstrating the flexibility of their approach.
สถิติ
"We filter for images that include an identifiable subject (e.g., person in portraiture; Fig. 2a). This is done with metadata tags first and then with an object detector (Ultralytics 2023)."
"We randomly downscale the image with bilinear interpolation and paste it into a surrounding 512×512 canvas to obtain an image x (Fig. 2c)."
"We also compute a binary mask m with 1's in the area corresponding to valid pixels."
คำพูด
"Our proposed method, GenCrop, addresses this challenge by combining a readily available dataset of stock images with powerful, pre-trained image generation models to synthesize the required inputs."
"The key advantage of GenCrop is that it is weakly-supervised, requiring no new manual crop or scoring annotations beyond access to the original professional image collection."