Author: Nicholas Fountain-Jones; Gustavo Machado; Scott Carver; Craig Packer; Mariana Mendoza; Meggan E Craft
Title: How to make more from exposure data? An integrated machine learning pipeline to predict pathogen exposure Document date: 2019_3_6
ID: jc5c87b9_1
Snippet: An individual's risk of infection by a pathogen is dependent upon a wide variety of host and 2.2 Pre-processing 166 167 It is important to account for missing data either by imputation or removal prior to model 168 construction (Fig. 2) . Some machine learning algorithms, such as gradient boosting, bin missing 169 data as a separate node in the decision tree (Friedman, 2002, Fig. S1 ), however other algorithms 170 such as SVM are less flexible. I.....
Document: An individual's risk of infection by a pathogen is dependent upon a wide variety of host and 2.2 Pre-processing 166 167 It is important to account for missing data either by imputation or removal prior to model 168 construction (Fig. 2) . Some machine learning algorithms, such as gradient boosting, bin missing 169 data as a separate node in the decision tree (Friedman, 2002, Fig. S1 ), however other algorithms 170 such as SVM are less flexible. In order to compare predictive performance across models, 171 missing data can either be imputed or removed from the dataset. Although providing specific 172 advice on whether to include missing data or not is outside the scope of this paper (see 173 Nakagawa & Freckleton, 2008), we provide an option if imputation is suitable for the study 174 problem. We integrated the 'missForest' (Stekhoven & Bühlmann, 2012 ) machine-learning 175 imputation routine (using the RF algorithm) into our pipeline, as it has been found to have low The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/569012 doi: bioRxiv preprint Yellow boxes indicate which data split is being tested in that particular 'fold'. We incorporated an internal repeated 10-fold cross-validation (CV) process to estimate model 199 performance. CV can help prevent overfitting and artificial inflation of accuracy due to use of the The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/569012 doi: bioRxiv preprint sensitivity and specificity for classification models). Another advantage of this package is that it 208 can perform classification or regression using 237 different types of models from generalized 209 linear models (GLMs such as logistic regression) to complex machine learning and Bayesian 210 models using a standardized approach (see Kuhn, 2008 for a complete list of models).
Search related documents:
Co phrase search for related documents- classification model and cross validation: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25
- classification model and data split: 1, 2
- classification model and decision tree: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
- classification model and different type: 1
- classification model and gradient boosting: 1, 2, 3, 4
- complex machine learning and cross validation: 1, 2
- complex machine learning and data split: 1, 2
- complex machine learning and decision tree: 1, 2, 3, 4
- complex machine learning and gradient boosting: 1
- cross validation and data split: 1, 2, 3, 4, 5, 6, 7, 8, 9
- cross validation and decision tree: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25
- cross validation and different type: 1, 2
- cross validation and gradient boosting: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
- cross validation and gradient boosting algorithm: 1, 2, 3, 4
- data split and decision tree: 1, 2, 3, 4, 5
- data split and gradient boosting: 1, 2
- data split and gradient boosting algorithm: 1
- dataset remove and gradient boosting: 1
- decision tree and gradient boosting: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25
Co phrase search for related documents, hyperlinks ordered by date