The paper proposes FastqZip, an improved reference-based genome sequence compression framework. Key highlights:
FastqZip uses a novel sequence matching procedure that can find matches even when the Hamming Distance is large but the Edit Distance is small between the read and the reference. This allows many previously unmatchable reads to be reconstructed from the reference sequence.
FastqZip employs read reordering and optional lossy quality score compression to further improve the compression ratio. Lossy quality score compression is achieved through bin-quantization or dominant bitmaps.
Comprehensive evaluations show that FastqZip outperforms state-of-the-art compression algorithms like Genozip by around 10% in terms of compression ratio, while having an acceptable slowdown.
FastqZip scales better than existing algorithms when parallelized over many resources, as its architecture allows for high degrees of parallelism.
На другой язык
из исходного контента
arxiv.org
Дополнительные вопросы