Core Concepts
The author introduces PreciseBugCollector as a comprehensive bug-fix collection approach that overcomes limitations in existing bug datasets by incorporating CVEs from NVD, bugs from OSS-Fuzz, and injection-based bugs.
Abstract
The PreciseBugCollector methodically collects bug data from various sources to create a diverse and extensive bug-fix dataset. It addresses challenges in bug dataset creation by providing precise bug types and execution information for over 1 million bugs across multiple programming languages.
The content discusses the methodology of collecting bugs from NVD, OSS-Fuzz, and through bug injection. It highlights the importance of project-specific bugs for industrial settings and emphasizes the need for diverse bug types in datasets.
Key points include the introduction of PreciseBugCollector, its components (bug tracker and bug injector), data extraction methods, comparison with existing datasets, implications for industry settings, and evaluation questions answered through detailed analysis.
The dataset comprises over 1 million bugs collected from thousands of open-source projects using different approaches to ensure precision and diversity in bug types. The focus is on creating a valuable resource for software maintenance tasks like bug detection, fault localization, and automated program repair.
Stats
To date, PreciseBugCollector comprises 1 057 818 bugs extracted from 2 968 open-source projects.
Of these bugs, 12 602 are sourced from NVD and OSS-Fuzz repositories while the remaining 1 045 216 are project-specific bugs generated by the bug injector.
Quotes
"Addressing the industry challenge of imprecise bug-fix datasets requires both components to build deep learning models that can learn broadly and in-depth."
"Project-specific bugs hold significant value in industrial settings as they align with domain knowledge and coding styles employed in real-world projects."