The DEREC-SIMPRO framework improves the fidelity and evaluation of synthetic data in data clean rooms by addressing limitations of existing multi-table synthesizers and evaluation metrics.
SimGen leverages the strengths of both real-world data and driving simulators to generate diverse and controllable synthetic driving scenes, addressing limitations of previous methods reliant on static datasets.
Training language models specifically for data synthesis, rather than general question-answering, significantly improves the quality and effectiveness of the generated data, especially when carefully managing prompt masking and training data size.
CK4Gen is a novel framework that leverages knowledge distillation from Cox Proportional Hazards (CoxPH) models to generate high-utility synthetic survival datasets in healthcare, addressing limitations of existing generative models like VAEs and GANs in preserving clinically relevant risk profiles.
SoftSRV, a novel soft prompting framework, leverages frozen large language models (LLMs) to generate targeted synthetic text sequences for fine-tuning smaller language models, outperforming traditional hard-prompting methods in terms of both downstream task performance and similarity to the target data distribution.
HS3F, a novel method for generating synthetic tabular data, surpasses the existing Forest Flow method by improving speed, handling of mixed data types, and robustness to changes in initial conditions, making it a significant advancement in synthetic data generation.
This paper introduces ERGAN, a novel framework leveraging ensemble learning and recurrent GANs to generate synthetic residential load data that accurately reflects real-world patterns while preserving diversity and statistical properties.
Montessori-Instruct, a new data synthesis framework, enhances the training of large language models by optimizing the generation of synthetic training data to align with the specific learning preferences of student models, leading to significant performance improvements.
This paper introduces a novel framework leveraging large language models (LLMs) and a retrieval-reasoning approach to generate synthetic clinical trials with binary success/failure labels, demonstrating their potential to augment real datasets, enhance model training for clinical trial outcome prediction, and accelerate clinical research while upholding patient privacy.
This paper introduces an algorithm for generating large, realistic synthetic datasets of power injections in electric power grids, addressing the challenge of limited access to real-world operational data for training machine learning models in the power systems domain.