Core Concepts
Applying maximal exact match (MEM) based taxonomic classification to compressed genome representations, such as KATKA kernels and minimizer digests, can achieve significant compression while maintaining high classification accuracy.
Abstract
The paper explores efficient methods for taxonomic classification of metagenomic sequences using maximal exact matches (MEMs).
Key highlights:
- Kraken, a popular taxonomic classifier, uses k-mers, but recent research indicates that using MEMs can lead to better classifications.
- Finding MEMs efficiently is challenging, especially for large genome collections.
- The authors propose using compressed representations of the genome collection, such as KATKA kernels and minimizer digests, to build augmented FM-indexes that can approximate MEM tables.
- Experiments on a large dataset of bacterial genomes show that KATKA kernels of minimizer digests can achieve significant compression (up to 5x) while only slightly decreasing the true-positive classification rate (from 78.6% to 74.3%).
- The compressed indexes also maintain fast search times, sometimes even outperforming the index built on the full dataset.
- The authors conclude that KATKA kernels of minimizer digests can inherit the strengths of both compression techniques, providing a promising approach for taxonomic classification of metagenomic sequences.
Stats
GATTACAT$AGATACAT$GATACAT$GATTAGAT$GATTAGATA$
ACTTAGCTGACGTTCCGGGTGTTTTTGGCCATCTTCTATAGATTTCCCAGAGACATACTAGGCGTGCTGAAGTTGTGACTCGCGGCCGTATT
TCTAACG$
ACTTAGCTGACGTTCCGGGTGTTTTAGGCCATCTTCTATAGATTTCTCAGAGACATAGTAGGCGTGCTGAAGTTGTGACTCGCGGCCGTATTCCCTAACG$
ACTTAGCTGACGTTCCGGGTGTTTTAGGCCATCTTCTATAGTTTTCTCAGAGACATACTAGGCGTGCTGAAGTTGTCACGCGCGCCCGTATTTCCTAACG$
Quotes
"alternative approaches to traditional k-mer-based [lowest common ancestor] identification methods, such as those featured within KrakenHLL [4], Kallisto [3], and DUDes [21], will be required to maximize the benefit of longer reads coupled with ever-increasing reference sequence databases and improve sequence classification accuracy."
"limiting all analyses to a single choice of k causes other problems as well. First, some branches of the taxonomic tree are well studied and contain a large number of genome assemblies for diverse strains and species. Other branches are scientifically significant but harder to study, and contain only a few genomes. In the more richly sampled spaces, larger values of k will better allow for discrimination at deeper levels of the tree."