ODGEN, a novel method leveraging fine-tuned diffusion models and object-wise conditioning, effectively generates high-quality, domain-specific synthetic images with bounding box annotations, significantly improving object detection model performance.
DiffLM is a novel framework that leverages variational autoencoders (VAEs) and diffusion models to improve the quality and controllability of synthetic data generated by large language models (LLMs) for various structured data formats.
The DEREC-SIMPRO framework improves the fidelity and evaluation of synthetic data in data clean rooms by addressing limitations of existing multi-table synthesizers and evaluation metrics.
SimGen leverages the strengths of both real-world data and driving simulators to generate diverse and controllable synthetic driving scenes, addressing limitations of previous methods reliant on static datasets.
Training language models specifically for data synthesis, rather than general question-answering, significantly improves the quality and effectiveness of the generated data, especially when carefully managing prompt masking and training data size.
CK4Gen is a novel framework that leverages knowledge distillation from Cox Proportional Hazards (CoxPH) models to generate high-utility synthetic survival datasets in healthcare, addressing limitations of existing generative models like VAEs and GANs in preserving clinically relevant risk profiles.
SoftSRV, a novel soft prompting framework, leverages frozen large language models (LLMs) to generate targeted synthetic text sequences for fine-tuning smaller language models, outperforming traditional hard-prompting methods in terms of both downstream task performance and similarity to the target data distribution.
HS3F, a novel method for generating synthetic tabular data, surpasses the existing Forest Flow method by improving speed, handling of mixed data types, and robustness to changes in initial conditions, making it a significant advancement in synthetic data generation.
This paper introduces ERGAN, a novel framework leveraging ensemble learning and recurrent GANs to generate synthetic residential load data that accurately reflects real-world patterns while preserving diversity and statistical properties.
Montessori-Instruct, a new data synthesis framework, enhances the training of large language models by optimizing the generation of synthetic training data to align with the specific learning preferences of student models, leading to significant performance improvements.