Author: Mullick, Baishali; Magar, Rishikesh; Jhunjhunwala, Aastha; Farimani, Amir Barati
                    Title: Understanding Mutation Hotspots for the SARS-CoV-2 Spike Protein Using Shannon Entropy and K-Means Clustering  Cord-id: rqnho47m  Document date: 2021_10_5
                    ID: rqnho47m
                    
                    Snippet: The SARS-CoV-2 virus like many other viruses has transformed in a continual manner to give rise to new variants by means of mutations commonly through substitutions and indels. These mutations in some cases can give the virus a survival advantage making the mutants dangerous. In general, laboratory investigation must be carried to determine whether the new variants have any characteristics that can make them more lethal and contagious. Therefore, complex and time-consuming analyses are required 
                    
                    
                    
                     
                    
                    
                    
                    
                        
                            
                                Document: The SARS-CoV-2 virus like many other viruses has transformed in a continual manner to give rise to new variants by means of mutations commonly through substitutions and indels. These mutations in some cases can give the virus a survival advantage making the mutants dangerous. In general, laboratory investigation must be carried to determine whether the new variants have any characteristics that can make them more lethal and contagious. Therefore, complex and time-consuming analyses are required in order to delve deeper into the exact impact of a particular mutation. The time required for these analyses makes it difficult to understand the variants of concern and thereby limiting the preventive action that can be taken against them spreading rapidly. In this analysis, we have deployed a statistical technique Shannon Entropy, to identify positions in the spike protein of SARS Cov-2 viral sequence which are most susceptible to mutations. Subsequently, we also use machine learning based clustering techniques to cluster known dangerous mutations based on similarities in properties. This work utilizes embeddings generated using language modeling, the ProtBERT model, to identify mutations of a similar nature and to pick out regions of interest based on proneness to change. Our entropy-based analysis successfully predicted the fifteen hotspot regions, among which we were able to validate ten known variants of interest. As the situation of SARS-COV-2 virus rapidly evolves we believe that the remaining nine mutational hotspots may contain variants that can emerge in the future. We believe that this may be promising in helping the research community to devise therapeutics based on probable new mutation zones in the viral sequence and resemblance in properties of various mutations
 
  Search related documents: 
                                Co phrase  search for related documents- accurate prediction and machine learning model: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
 
                                Co phrase  search for related documents, hyperlinks ordered by date