Author: Jiao Chen; Jiayu Shang; Jianrong Wang; Yanni Sun
Title: A binning tool to reconstruct viral haplotypes from assembled contigs Document date: 2019_7_16
ID: 2basllfv_58
Snippet: We use the contigs assembled by PEHaplo as input for VirBin. PEHaplo produced 24 contigs from the real MiSeq HIV data set that can cover about 92% of the five HIV-1 strains. These contigs have a N50 value of 2,223 bp and the longest contig is 9,133 bp. Haplotype number estimation VirBin was applied to the aligned contigs for haplotype number estimation. All the windows were sorted in descending order of window length. Out of the top 50 windows, 2.....
Document: We use the contigs assembled by PEHaplo as input for VirBin. PEHaplo produced 24 contigs from the real MiSeq HIV data set that can cover about 92% of the five HIV-1 strains. These contigs have a N50 value of 2,223 bp and the longest contig is 9,133 bp. Haplotype number estimation VirBin was applied to the aligned contigs for haplotype number estimation. All the windows were sorted in descending order of window length. Out of the top 50 windows, 27 contain 5 contigs, 16 contain 6 contigs, and 2 contain 4 contigs. Out of the top 25 windows, 17 contain 5 contigs, 5 contain 6 contigs, and 1 contains 4 contigs. Similar to the simulated data, using the consensus window depth (i.e. 5) correctly predicted the haplotype number. Clustering results The clustering algorithm was applied to cluster the contigs into 5 groups. For each contig, its originating haplotype is determined by comparing the contig with all reference genomes. The haplotype with the highest similarity and above 98% is assigned. The outputs of VirBin and MaxBin are shown in Table 2 . StrainPhlAn and ConStrains were applied on this real HIV data set. StrainPhlAn was able to identify the HIV species, but could not report any strain information. ConStrains could not align enough reads to marker genes for further strain-level analysis. Compared to the simulated contigs or assembled contigs using simulated reads, the results of VirBin on the real sequencing data have generally lower sensitivity and precision. There are two major reasons. First, the assembled contigs for real reads are more likely to contain errors. Second, this data set has several haplotypes with very similar average abundances. Referring to Fig. 8 , the abundance difference between the 2 least abundant haplotypes is < 2%. Thus, the clustering algorithm could mix contigs from these haplotypes.
Search related documents:
Co phrase search for related documents- abundant haplotype and haplotype number: 1
- abundant haplotype and haplotype number estimation: 1
- assemble contig and contig assemble: 1, 2, 3, 4, 5, 6
- average abundance and cluster algorithm: 1, 2
- average abundance and clustering algorithm: 1, 2
- average abundance and contig cluster: 1
- cluster algorithm and contig cluster: 1
- clustering algorithm and contig cluster: 1
- clustering algorithm and data set: 1, 2, 3, 4, 5, 6
- consensus window and haplotype number: 1, 2, 3
- consensus window and haplotype number estimation: 1, 2
- consensus window depth and haplotype number: 1, 2, 3
- consensus window depth and haplotype number estimation: 1, 2
Co phrase search for related documents, hyperlinks ordered by date