An integrated data processing framework that automates data cleaning, deduplication, and quality evaluation to enhance the pretraining data for foundation models.
Dataverse is an open-source, user-friendly ETL (Extract, Transform, Load) pipeline designed to efficiently process and analyze massive datasets for large language model development.
ShuffleBench introduces a new benchmark for evaluating stream processing frameworks' performance in large-scale data shuffling operations.
WanJuan-CC is a safe and high-quality English webtext dataset derived from Common Crawl, created through a meticulous process to ensure data safety and quality.
Efficient indexing structure DynaWarp offers significant storage savings and faster query throughput for large-scale log data processing.