Core Concepts
Low-coverage whole genome sequencing combined with efficient imputation can provide valuable insights into the genetic factors underlying severe COVID-19 disease presentation and progression.
Abstract
This study explores the use of low-coverage whole genome sequencing (lcWGS) and imputation to characterize the genetic profiles of a cohort of severe COVID-19 patients. The researchers generated a dataset of 79 imputed variant call format (VCF) files using the GLIMPSE1 imputation tool, with each file containing an average of 9.5 million single nucleotide variants.
The key highlights and insights from the study are:
Demographic and Genetic Characterization of the Cohort:
The patient cohort exhibited a right-skewed age distribution, with a higher prevalence of individuals in the 45-64 age range, and a higher frequency of male patients.
Principal component analysis revealed that most patients clustered within the European genetic ancestry group, with some individuals also exhibiting admixed American and South Asian ancestries.
Hospital Stay and Intensive Care Unit (ICU) Admission Analysis:
The distribution of hospital stay durations was right-skewed, with most patients requiring relatively short stays, but a subset experiencing significantly longer stays.
Male patients exhibited greater variability in hospital stay durations, with some outliers requiring unusually long stays.
Approximately 25% of the cohort was admitted to the ICU, with a much larger proportion of males necessitating ICU admission compared to females.
Comprehensive Clinical Phenotyping:
The researchers developed a specialized set of 28 standardized medical terms to characterize the clinical phenotypes of the severe COVID-19 patients.
The Pulmonary category, including pneumonia and ARDS, was the most prevalent, followed by Extra-Pulmonary, Coagulation, and Systemic phenotypes.
Correlation analysis revealed moderate associations between certain phenotypes, such as neurological conditions and exanthema, myopathies, and bone marrow abnormalities.
Validation of Imputation Accuracy:
The researchers validated the imputation accuracy of the GLIMPSE1 algorithm using a high-coverage genome from an independent Iberian Population in Spain (IBS) individual, sequenced on both Illumina and MGI platforms.
The validation showed that GLIMPSE1 can accurately impute variants with minor allele frequencies as low as 2%, with an aggregate squared Pearson correlation of approximately 0.97 across all minor allele frequency bins.
The methods and findings presented in this study demonstrate the potential of leveraging low-coverage whole genome sequencing and efficient imputation techniques to uncover the genetic determinants of severe COVID-19 outcomes. The dataset and insights generated can be valuable resources for future genomic research on COVID-19 and other complex diseases.
Stats
Approximately 325 GB of FASTQ data, 156 GB of CRAM data, and 6 GB of VCF data were generated for the 79 severe COVID-19 patient samples.
The average number of high-confidence single nucleotide variants per VCF file was 9.49 million [95%CI: 9.37 million - 9.61 million].
The aggregate squared Pearson correlation (r^2) between high-coverage and imputed genotypes for the validation IBS001 genome was approximately 0.97 across all minor allele frequency bins.
Quotes
"Despite continuous improvements in genotype imputation algorithms, lcWGS imputation remains underutilised as an economical alternative over higher-coverage sequencing."
"The validation of our imputation and filtering process shows that GLIMPSE1, with the 1000 Genomes Project Phase 3 as the reference panel, can be used to confidently impute variants with MAF up to approximately 2%."