Author: Bikash K. Bhandari; Paul P. Gardner; Chun Shen Lim
Title: Solubility-Weighted Index: fast and accurate prediction of protein solubility Document date: 2020_2_16
ID: 2rpr7aph_9
Snippet: We applied this arithmetic mean approach (i.e., sequence composition scoring) to the PSI:Biology dataset and compared four sets of previously published, normalised B-factors (Bhaskaran and Ponnuswamy 1988; Ragone et al. 1989; M. Vihinen, Torkkila, and Riikonen 1994; Smith et al. 2003 ) Among these sets of B-factors, sequence composition scoring using the most recently published set of normalised B-factors produced the highest AUC score To improve.....
Document: We applied this arithmetic mean approach (i.e., sequence composition scoring) to the PSI:Biology dataset and compared four sets of previously published, normalised B-factors (Bhaskaran and Ponnuswamy 1988; Ragone et al. 1989; M. Vihinen, Torkkila, and Riikonen 1994; Smith et al. 2003 ) Among these sets of B-factors, sequence composition scoring using the most recently published set of normalised B-factors produced the highest AUC score To improve the prediction accuracy of solubility, we iteratively refined the weights of amino acid residues using the Nelder-Mead optimisation algorithm (Nelder and Mead 1965) . To avoid testing and training on similar sequences, we generated 10 cross-validation sets with a maximised heterogeneity between these subsets (i.e. no similar sequences between subsets). We first clustered all 12,216 PSI:Biology protein sequences using a 40% similarity threshold using USEARCH to produce 5,050 clusters with remote similarity (see Methods and Supplementary Fig S4) . The clusters were grouped into 10 cross-validation sets of approximately 1,200 sequences each manually. We did not select a representative sequence for each cluster as about 12% of clusters contain a mix of soluble and insoluble proteins (Supplementary Fig S4C) . More importantly, to address the issues of sequence similarity and imbalanced classes, we performed 1,000 bootstrap resamplings for each cross-validation step (Fig 2A and Supplementary Fig S5) . We calculated the solubility scores using the optimised weights as Equation 1 and the AUC scores for each cross-validation step. Our training and test AUC scores were 0.72 ± 0.00 and 0.71 ± 0.01, respectively, showing an improvement over flexibility in solubility prediction (mean ± standard deviation; Fig 2B and Supplementary Table S3 ).
Search related documents:
Co phrase search for related documents- amino acid and biology dataset: 1, 2
- amino acid and bootstrap resampling: 1, 2
- amino acid and cross validation set: 1, 2, 3
- amino acid and cross validation step: 1
- amino acid and mean standard deviation: 1, 2
- amino acid and Nelder Mead optimisation algorithm: 1
- amino acid and optimisation algorithm: 1
- amino acid and prediction accuracy: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18
- amino acid and protein sequence: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83
- amino acid and remote similarity: 1
- amino acid and representative sequence: 1, 2, 3, 4, 5, 6, 7
- amino acid and sequence composition: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20
- amino acid and sequence composition score: 1
- amino acid and sequence similarity: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74
- amino acid and similar sequence: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53
- amino acid and similarity threshold: 1, 2
- amino acid and standard deviation: 1, 2, 3, 4, 5
- amino acid and test training: 1, 2, 3, 4
- amino acid and training testing: 1, 2, 3, 4
Co phrase search for related documents, hyperlinks ordered by date