Core Concepts
The rise of open science has led to the accumulation of molecular dynamics (MD) simulation data in generalist data repositories, constituting the "dark matter" of MD - data that is technically accessible but not indexed, curated, or easily searchable. This work presents a strategy to index and analyze this vast amount of publicly available MD data to shed light on current simulation practices and enable better data sharing and reuse.
Abstract
The authors describe an original "Explore and Expand" (Ex2) strategy to index and analyze molecular dynamics (MD) simulation data deposited in generalist data repositories such as Zenodo, Figshare, and Open Science Framework.
Key highlights:
They indexed about 250,000 files and 2,000 datasets representing 14 TB of MD data, with a focus on files generated by the Gromacs MD software.
Analysis of the Gromacs-related files revealed insights into the types of molecular systems simulated, the simulation parameters used (temperature, thermostat, barostat, etc.), and the scale of the simulations in terms of system size and number of frames.
The authors found a large number of trajectory (.xtc) and topology (.itp, .top) files, which could be valuable resources for the community if properly indexed and annotated.
They propose guidelines for better sharing of MD simulation data, including avoiding zip archives, providing extensive metadata, and linking datasets to related research articles and software.
The authors also discuss strategies to improve the metadata of currently available MD data, such as extracting simulation parameters from the files and using natural language processing techniques.
To facilitate exploration of the collected data, the authors developed a prototype web application called "MDverse data explorer".
Overall, this work highlights the vast potential of the "dark matter" of MD data and calls for community efforts to improve data sharing practices and metadata to enable better reuse of these valuable resources.
Stats
The system size of the simulated molecular systems ranged from 2 coarse-grain particles to over 3 million atoms/particles.
Half of the analyzed .xtc trajectory files contained more than 10,000 frames.
The most common thermostat used was the V-rescale, often combined with the Parrinello-Rahman barostat.
The majority of simulations were performed at temperatures between 298 K and 310 K, with some simulations at higher temperatures up to 800 K.
Quotes
"Storage is exceptionally cheap compared to the resources used to generate simulations data, and they represent a potential goldmine of information for researchers wanting to reanalyze them."
"We are qualifying this amount of scattered data as the dark matter of MD, and we believe it is essential to shed light onto this overlooked but high-potential volume of data."
"Globally, we indexed about 250,000 files and 2,000 datasets that represented 14 TB of data deposited between August 2012 and March 2023."