Selected article for: "genome sequence and viral genome sequence"

Author: Zhengqiao Zhao; Bahrad A. Sokhansanj; Gail L. Rosen
Title: Characterizing geographical and temporal dynamics of novel coronavirus SARS-CoV-2 using informative subtype markers
  • Document date: 2020_4_9
  • ID: 9sk11214_2
    Snippet: H "´ÿ kPL p k˚l og 2 pp k q where L is a list of unique characters in all sequences and p k is a probability of a character k. We estimated 130 p k from the frequency of characters. We refer to characters in the preceding because, in addition to the bases 131 A, C, G, and T, the sequences include additional characters representing gaps (-) and ambiguities, which are 132 listed in 1. 5 133 Sites N and -(representing a fully ambiguous site and a.....
    Document: H "´ÿ kPL p k˚l og 2 pp k q where L is a list of unique characters in all sequences and p k is a probability of a character k. We estimated 130 p k from the frequency of characters. We refer to characters in the preceding because, in addition to the bases 131 A, C, G, and T, the sequences include additional characters representing gaps (-) and ambiguities, which are 132 listed in 1. 5 133 Sites N and -(representing a fully ambiguous site and a gap respectively) are substantially less 134 informative. Therefore, we further define a masked entropy as entropy calculated without considering 135 sequences containing N andin a given nucleotide position in the genome. Based on the entropy calculation, 136 we developed a masked entropy calculation whereby we ignore the N and -. With the help of this masked 137 entropy calculation, we can focus on truly informative positions, instead of positions at the start and end of 138 the sequence in which there is substantial uncertainty due to artifacts in the sequencing process. Finally, 139 high entropy positions are selected by two criteria: 1) entropy ą 0.5, and 2) the percentage of N andis less 140 than 25%. This yielded 17 distinct positions along the viral genome sequence. We then extract Informative 141 4 https://github.com/nextstrain/ncov/blob/master/data/metadata.tsv 5 The sequences are of cDNA derived from viral RNA, so there is a T substituting for the U that would appear in the viral RNA sequence. identified these two criteria. The left hand side of the plot shows that there is a peak with entropy greater 143 than 0.5, which we sought to retain. Looking to the right hand side of the plot, setting threshold to 0.25 will 144 keep the peak on the left which represents the most informative group of sites in the genome. 145 Error correction to resolve ambiguities in sequence data and remove spurious 146 ISMs 147

    Search related documents:
    Co phrase search for related documents
    • end start and genome sequence: 1