Selected article for: "insert size and read length"

Author: Jiao Chen; Jiayu Shang; Jianrong Wang; Yanni Sun
Title: A binning tool to reconstruct viral haplotypes from assembled contigs
  • Document date: 2019_7_16
  • ID: 2basllfv_42
    Snippet: With available HIV haplotypes, simulated reads were generated from them by ART-illumina (Huang et al., 2012) as error-containing MiSeq pairedend reads, with read length of 250 bp, average insert size of 600 bp, and standard deviation of 150 bp. With the total coverage of 1000-x, three sets of reads are produced using different abundance distributions. The first one is based on the power law equation (Barbosa et al., 2012) . The second and the thi.....
    Document: With available HIV haplotypes, simulated reads were generated from them by ART-illumina (Huang et al., 2012) as error-containing MiSeq pairedend reads, with read length of 250 bp, average insert size of 600 bp, and standard deviation of 150 bp. With the total coverage of 1000-x, three sets of reads are produced using different abundance distributions. The first one is based on the power law equation (Barbosa et al., 2012) . The second and the third sets of reads represent challenging cases where different haplotypes have similar abundances, which create difficulties for abundance-based binning algorithms. The abundance differences in the second and third data set are 0.06 and 0.03, respectively. In total, there are 38,914 simulated reads for 5 HIV haplotypes. The relative abundance for five haplotypes in each read set can be found in Table 1 . As the total coverage is 1000-x, the sequencing coverage of each haplotype is the product of the total coverage and the relative abundance. Contig simulation For each reference genome (denote its length as L), we randomly generated a list of location pairs (p 1 , p 2 ), where 1 ≤ p 1 < p 2 ≤ L. Each location pair represents a candidate contig's starting and ending position. Then, in the simulated contigs, we only keep the ones above 500 bp (i.e. p 2 − p 1 + 1 ≥ 500). In addition, we would like to simulate the hard case where the contigs cannot be extended any more using large overlaps. Thus, we sort all the remaining contigs by p 1 and remove the ones that have overlaps of size above 100 bp with previous contigs in the sorted list. The five sets of simulated contigs have different N50 values and are referred to as "1000" to "5000", indicating the upper bound of the contig length in each set. Table S1 in the Supplementary Data file shows the detailed properties of the five sets of contigs. All the simulated data sets can be downloaded from VirBin's Github repository.

    Search related documents:
    Co phrase search for related documents
    • abundance difference and data set: 1, 2
    • abundance distribution and data set: 1
    • abundance distribution and different abundance distribution: 1, 2
    • challenge case and data set: 1, 2