Core Concepts
Large language models trained on code datasets may contain license inconsistencies, highlighting the need for best practices in dataset creation and management.
Abstract
The investigation delves into the presence of code licensed under strong copyleft licenses in datasets used to train large language models. It analyzes 53 models trained on file-level code, revealing license inconsistencies and recommending best practices for dataset creation and management. The study also examines publicly available datasets, identifying overlaps with a strong copyleft dataset. Key insights include:
Increase in interest in using permissively licensed code for training.
Presence of exact duplicates of strong copyleft-licensed code in various datasets.
Identification of ownership/copyright disclaimers in comments within code files.
Stats
Our analysis revealed that every dataset examined contained license inconsistencies despite being selected based on associated repository licenses.
We discovered 38 million exact duplicates present in our strong copyleft dataset out of 514 million code files analyzed.
Quotes
"Every dataset we examined contained license inconsistencies."
"Our recommendation is to prioritize the development and adoption of best practices for dataset creation and management."