Action Reimagined presents ReimaginedAct, a novel method for text-to-pose video editing that predicts human action changes from open-ended textual inputs. The approach utilizes an LLM to generate answers, pose videos for alignment with source videos, and a diffusion model for high-quality video generation. Experimental results demonstrate superior performance in action editing tasks compared to existing methods.
Large-scale diffusion-based text-to-video models struggle with manipulating human actions; ReimaginedAct aims to predict open-ended human action changes in videos based on textual inputs. Existing methods require reference videos or additional conditions and support only limited actions like dance and movement.
ReimaginedAct introduces a new evaluation dataset called WhatifVideo-1.0 to facilitate research in text-to-pose video editing. The dataset includes various scenarios of different difficulty levels along with questions and text prompts for evaluation purposes.
The proposed method employs modules such as pose matching, Grounded-SAM for segmentation masks, and timestep attention blending for consistent video editing while ensuring fidelity to the original content. Results show superior performance over baselines in terms of accuracy and consistency metrics.
To Another Language
from source content
arxiv.org
Głębsze pytania