Selected article for: "dsrna virus and virus class"

Author: Enrico Lavezzo; Michele Berselli; Ilaria Frasson; Rosalba Perrone; Giorgio Palù; Alessandra R. Brazzale; Sara N. Richter; Stefano Toppo
Title: G-quadruplex forming sequences in the genome of all known human viruses: a comprehensive guide
  • Document date: 2018_6_11
  • ID: c3lmmll6_42
    Snippet: In this line of thinking, we asked whether the observed number of G4s, and more precisely its statistical significance with respect to the two random assembling scenarios, is representative for a particular virus class. To answer this question, we checked whether it is possible to classify each virus of the six classes considered, that is, dsDNA, ssDNA, dsRNA, ssRNA (+), ssRNA (-) and ssRNA (RT), into its own class based on the information of how.....
    Document: In this line of thinking, we asked whether the observed number of G4s, and more precisely its statistical significance with respect to the two random assembling scenarios, is representative for a particular virus class. To answer this question, we checked whether it is possible to classify each virus of the six classes considered, that is, dsDNA, ssDNA, dsRNA, ssRNA (+), ssRNA (-) and ssRNA (RT), into its own class based on the information of how significant its median G4 counts are. We chose to use a classifier built on multinomial logistic regression, as this method is both interpretable and robust to unbalanced group sizes as long as the group sizes are large enough. To avoid the latter drawback, we excluded from the model fit the hepatitis B virus, the only virus classified as dsDNA (RT), and the two unclassified Hepatitis delta and Hepatitis E viruses. Six features were used to classify the viruses, i.e. the six mid-P values (those calculated for GG-, GGG-, GGGG-, both in the positive and negative strand) which qualify the G4 content of the real viral sequences. The values were multiplied by 1 or -1 depending on whether the median G4 count was over-or underrepresented. Since real and corresponding simulated sequences contain the same base or G-islands composition, the classification model based on G4 content does not depend on the highly variable content of G/C in the different virus classes but is specifically designed on the peculiar presence or absence of G4s in each viral class. Furthermore, 34 viruses with no G4 count in all three G-island types in both the positive and negative strand and non-significant mid-P values at the 10% level were excluded from the analysis. We reclassified every viral genome used in our assessment using the discriminant function obtained from a leave-one-out analysis. This latter technique allowed us to accurately estimate how our classifier performs without the need to split our data into a training and a author/funder. All rights reserved. No reuse allowed without permission.

    Search related documents: