Learning visual representations of editing components is crucial for video creation tasks, supported by a novel dataset and method.