Selected article for: "cc NC ND International license and large number"

Author: Marie Hoffmann; Michael T. Monaghan; Knut Reinert
Title: PriSeT: Efficient De Novo Primer Discovery
  • Document date: 2020_4_7
  • ID: 3b3hv53b_79_0
    Snippet: We set the relative frequency cutoff to 5%. The number of k-mer locations grows therefore linearly with the library size. We have chosen not to show the performance on a synthetic data set, because k-mer frequencies and dropout . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.04.06.027961 doi: bioRxiv prepr.....
    Document: We set the relative frequency cutoff to 5%. The number of k-mer locations grows therefore linearly with the library size. We have chosen not to show the performance on a synthetic data set, because k-mer frequencies and dropout . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.04.06.027961 doi: bioRxiv preprint Table 8 . De novo computation and statistical comparison with published primers ranked by Coverage (left) or Variation (right). Of both sets the top performing primer pair in terms of coverage or variation were picked. Frequency reflects the raw number of occurrences in the clade, whereas Coverage computes how many taxa of each clade were covered in proportion to taxa with references, and Variation the total number of amplicons. Note, that PriSeT computed primer pairs with narrowed constraints -primer pairs like SSU, EUKA, etc., would not emerge in the result set (see Table 6 ). When ranking by coverage, for 11 out of 19 clades PriSeT identifies at least one new primer pair with a higher coverage rate than the published primers. Whereas when ranking by amplicon variation for seven out of 19 clades PriSeT found at least one more or equally performant primer pair. . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.04.06.027961 doi: bioRxiv preprint rates are defined by inherent sequence properties (entropy, repeats, etc.), which are obviously not homogeneous over all clades. A single clade or synthetic data set would produce non-representative, and even misleading results. For example clade 6657 has a library being 7 MB larger than the one of clade 3041. Surprisingly, clade 6657 produces only 4.83 million k-mers, compared to 43,9 million k-mers of clade 3041 (see Figure 7 ). For the largest data set (clade 451864 of Dikarya), we had to raise the frequency cutoff to 10 %, for not running into memory issues caused by the vast amount of k-mers (see Discussion 5). We sorted the data sets by size (abscissa) and log 2 -scaled abscissa and ordinate for readability. The total runtime for the smallest data set Perkinsidae (clade 27999 with 0.13 MB) is ≤ 1 second, for Fungi (clade 112252 with 10.21 MB) 33 seconds, and for a large dataset like Nematoda (clade 6231 with 82.55 MB) 70 min. The frequency computation contributes the most to the total runtime. This is due to the large number of possible k-mers within a library. After the filter & transform step the number of k-mers is reduced drastically, s.t. the expensive combine step with a runtime in the size of all k-mer positions times the window size remains relatively low. The k-mer dropout rate during filter & transform and combine is highly dependent on the sequence structure within the clades as indicated by the non-linearity of filter & transform and combine runtimes in proportion to the original library size. Theoretical upper limits can be seen in Table 9 . The frequency computation performs a single k-mer look-up in O(k) where k is the length of the k-mer to be looked up. Additionally, all occurrence locations need to be gathered, which depend on the number of occurrences occ and the total library length N . This can be done in O(k + occ) by exploiting the lexicographical ordering of the index (Ferragina & M

    Search related documents:
    Co phrase search for related documents
    • coverage rate and data set: 1, 2