Core Concepts
The DISL dataset provides a large and diverse collection of real-world Solidity smart contracts to aid in research, tool development, and machine learning tasks.
Abstract
The DISL dataset comprises 514,506 unique Solidity files from Ethereum mainnet.
It surpasses existing datasets in size and recency, catering to the need for real-world smart contract data.
DISL serves as a valuable resource for developing machine learning systems and benchmarking software engineering tools.
The dataset is publicly available on Huggingface for researchers and practitioners.
Detailed processes are outlined for dataset collection, deduplication, and metadata inclusion.
Applications of the DISL dataset include AI-based tool development and benchmarking of smart contract software engineering tools.
Stats
"DISL contains the source code for all smart contracts on Etherscan from the genesis block to January 15, 2024."
"DISL only includes the verified source code of smart contracts to ensure it comprises solely real contracts in use."
"After filtering we obtained 514,506 Solidity files, consolidating our decomposed dataset."