Author: Nagpal, Sargun; Pal, Ridam; Ashima,; Tyagi, Ananya; Tripathi, Sadhana; Nagori, Aditya; Ahmad, Saad; Mishra, Hara Prasad; Kutum, Rintu; Sethi, Tavpritesh
Title: Genomic Surveillance of COVID-19 Variants with Language Models and Machine Learning Cord-id: 6mp90ff2 Document date: 2021_8_26
ID: 6mp90ff2
Snippet: The global efforts to control COVID-19 are threatened by the rapid emergence of novel SARS-CoV-2 variants that may display undesirable characteristics such as immune escape or increased pathogenicity. Early prediction of emerging strains could be vital to pandemic preparedness but remains an open challenge. Here, we developed Strainflow, to learn the latent dimensions of 0.9 million high-quality SARS-CoV-2 genome sequences, and used machine learning algorithms to predict upcoming caseloads of SA
Document: The global efforts to control COVID-19 are threatened by the rapid emergence of novel SARS-CoV-2 variants that may display undesirable characteristics such as immune escape or increased pathogenicity. Early prediction of emerging strains could be vital to pandemic preparedness but remains an open challenge. Here, we developed Strainflow, to learn the latent dimensions of 0.9 million high-quality SARS-CoV-2 genome sequences, and used machine learning algorithms to predict upcoming caseloads of SARS-CoV-2. In our Strainflow model, SARS-CoV-2 genome sequences were treated as documents, and codons as words to learn unsupervised codon embeddings (latent dimensions). We discovered that codon-level changes lead to a change in the entropy of the latent dimensions. We used a machine learning algorithm to find the most relevant latent dimensions called Dimensions of Concern (DoCs) of SARS-CoV-2 spike genes, and demonstrate their potential to provide a lead time for predicting new caseloads in several countries. The DoCs capture codons associated with global Variants of Concern (VOCs) and Variants of Interest (VOIs), and may be surveilled to predict country-specific emergence and spread of SARS-CoV-2 variants. Highlights We developed a genomic surveillance model for SARS-CoV-2 genome sequences, Strainflow, where sequences were treated as documents with words (codons) to learn the codon context of 0.9 million spike genes using the skip-gram algorithm. Time series analysis of the information content (Entropy) of the latent dimensions learned by Strainflow shows a leading relationship with the monthly COVID-19 cases for seven countries (e.g., USA, Japan, India, and others). Machine Learning modeling of the entropy of the latent dimensions helped us to develop an epidemiological early warning system for the COVID-19 caseloads. The top codons associated with the most relevant latent dimensions (DoCs) were linked to SARS-CoV-2 variants, and these DoCs may be used as a surrogate to track the country-specific spread of the variants. Graphical abstract
Search related documents:
Co phrase search for related documents- long range and machine learning: 1, 2, 3, 4, 5, 6, 7, 8
- long sequence and machine learning: 1, 2, 3, 4, 5
- low dimensional and machine learning: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
Co phrase search for related documents, hyperlinks ordered by date