Author: Bikash K. Bhandari; Paul P. Gardner; Chun Shen Lim
Title: Solubility-Weighted Index: fast and accurate prediction of protein solubility Document date: 2020_2_16
ID: 2rpr7aph_31
Snippet: To improve protein solubility prediction, we optimised the most recently published set of normalised B-factors using the PSI:Biology dataset (Smith et al. 2003) (Fig 2) . To avoid including homologous sequences in the test and training sets, we clustered the PSI:Biology targets using USEARCH v11.0.667, 32-bit (Edgar 2010) . His-tag sequences were removed from all sequences before clustering to avoid false cluster inclusions. We obtained 5,050 clu.....
Document: To improve protein solubility prediction, we optimised the most recently published set of normalised B-factors using the PSI:Biology dataset (Smith et al. 2003) (Fig 2) . To avoid including homologous sequences in the test and training sets, we clustered the PSI:Biology targets using USEARCH v11.0.667, 32-bit (Edgar 2010) . His-tag sequences were removed from all sequences before clustering to avoid false cluster inclusions. We obtained 5,050 clusters using the parameters: -cluster_fast -id 0.4 -msaout -threads 4 . These clusters were divided into 10 subsets with approximately 1,200 sequences per subsets manually . The subsequent steps were done with His-tag sequences. We used Smith et al. 's normalised B-factors as the initial weights to maximise AUC using these 10 subsets with a 10-fold cross-validation. Since AUC is non-differentiable, we used the Nelder-Mead optimisation method (implemented in SciPy v1.2.0), which is a derivative-free, heuristic, simplex-based optimisation (Oliphant 2007; Millman and Aivazis 2011; Nelder and Mead 1965) . For each step in cross-validation, we used 1,000 bootstrap resamplings containing 1,000 soluble and 1,000 insoluble proteins. Optimisation was carried out for each sample, giving 1,000 sets of weights. The arithmetic mean of these weights was used to determine the training and test AUC for the cross-validation step (Fig 2A) .
Search related documents:
Co phrase search for related documents- arithmetic mean and auc test: 1, 2
- arithmetic mean and biology dataset: 1
- arithmetic mean and cross validation: 1, 2
- arithmetic mean and cross validation step: 1, 2
- arithmetic mean and et al Smith: 1, 2, 3
- auc test and biology dataset: 1
- auc test and cross validation: 1, 2, 3, 4, 5, 6, 7
- auc test and cross validation step: 1, 2
- auc test and et al Smith: 1, 2
- biology dataset and cross validation: 1, 2
- biology dataset and cross validation step: 1
- biology dataset and et al Smith: 1, 2
- bootstrap resampling and cross validation: 1, 2
- cross validation and et al Smith: 1, 2, 3
- cross validation step and et al Smith: 1, 2
Co phrase search for related documents, hyperlinks ordered by date