מושגי ליבה
A dictionary-based approach to efficiently compress SMILES-based datasets, enabling random access and readable output.
תקציר
The paper proposes ZSMILES, a methodology to reduce the storage footprint of SMILES-based datasets used in extreme-scale virtual screening applications. ZSMILES employs a custom dictionary-based compression approach that leverages domain knowledge to provide better compression ratios compared to state-of-the-art tools.
The key highlights and insights are:
- ZSMILES uses a preprocessing step to increase the reuse of ring enumerations in SMILES, improving the probability of finding common patterns.
- ZSMILES pre-populates the dictionary with the printable ASCII characters used in the SMILES format, avoiding the need to escape unknown patterns.
- The dictionary generation algorithm uses a greedy approach to select the most frequent substrings that provide the highest coverage of the input SMILES.
- ZSMILES compression and decompression algorithms are designed to maintain the separability of SMILES and enable random access to the compressed data.
- Experimental results show that ZSMILES can achieve up to 0.29 compression ratio, outperforming state-of-the-art tools like FSST and SHOCO by up to 1.13x in similar scenarios.
- A CUDA-accelerated version of ZSMILES targeting NVIDIA GPUs demonstrates a potential speedup of 7x for compression and 2x for decompression compared to the serial C++ implementation.
סטטיסטיקה
The screening data of a virtual screening campaign on CINECA's Marconi100 was approximately 72 TB.