Document: is the number of tumor types in the dataset, that depends 64 on miRNA's relative position, bringing the result displayed in Figure 1 . As the objective is to find and validate a reduced list of miRNAs to be used as a 66 signature, feature selection is to be performed on the dataset. Popular approaches to 67 feature selection range from univariate statistical considerations, to iterated runs of the 68 same classifier with a progressively reduced number of features, in order to assess the 69 contribution of the features to the overall result. As the considered case study is 70 particularly complex, however, relying upon simple statistical analyses or a single 71 classifier might not suffice. Following the idea behind ensemble feature selection [31] [32] [33] , 72 we use multiple algorithms to obtain a more robust predictive performance. For this 73 purpose, we train a set of classifiers to then extract a sorted list of the most relevant 74 features from each. As, intuitively, a feature considered important by the majority of 75 classifiers in the set is likely to be relevant for our aim, the information from all 76 classifiers is then compiled to find the most common relevant features. 77 Starting from a thorough comparison of 22 different state-of-the-art classifiers on the 78 considered dataset presented in [34] , in this work a subset of those classifiers is selected 79 considering both (i) high accuracy and (ii) a way to extract the relative importance of 80 the features from the trained classifier. After preliminary tests to set algorithms' 81 hyperparameters, 8 classifiers are chosen, all featuring an average accuracy higher than 82 90% on a 10-fold cross-validation: 83 • • SVC (Support Vector Machines Classifier with a linear kernel) [42] 91 All considered classifiers are implemented in the scikit-learn Python toolbox [43] . 92 Overall, the selected classifiers fall into two broad typologies: those exploiting 93 ensembles of classification trees [44] (Bagging, GradientBoosting, RandomForest), 94 and those optimizing the coefficients of linear models to separate classes 95 (LogisticRegression, PassiveAggressive, Ridge, SGD, SVC). Depending on classifier 96 typology, there are two different ways of extracting relative feature importance. For 97 classifiers based on classification trees, the features used in the splits are counted and 98 sorted by frequency, from the most to the least common. For classifiers based on linear 99 models, the values of the coefficients associated to each feature can be used as a proxy 100 of their relative importance, sorting coefficients from the largest to the smallest in 101 absolute value. As the two feature extraction methods return heterogeneous numeric 102 values, only the relative sorting of features provided by each classifier is considered. We 103 arbitrarily decide to extract the top 100 most relevant features, so we assign to each 104 feature f a simple score S f = N t /N c , where N t is the number of times that specific 105 features appears among the top 100 of a specific classifier instance, while N c is the total 106 number of classifiers instances used; for instance, a feature appearing among the 100 107 most relevant in 73% of the classifiers used would obtain a score S f = 0.73. In order to 108 increase the generalizability of our results, each selected classifier is run 10 times, using 109 a 10-fold stratified cross-validation, so that each fold preserves the percentage of Table 2 comp
Search related documents:
Co phrase search for related documents- linear kernel and scikit learn: 1
- linear kernel and statistical analysis: 1
- linear kernel and time number: 1
- linear model and multiple algorithm: 1, 2, 3, 4, 5
- linear model and predictive performance: 1, 2, 3, 4, 5
- linear model and preliminary test: 1
- linear model and relative importance: 1, 2, 3
- linear model and relative position: 1
- linear model and scikit learn: 1
- linear model and statistical analysis: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45
- linear model and time number: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29
- statistical analysis and time number: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30
- statistical analysis and tumor type: 1, 2
- time number and tumor type: 1, 2
- time number and tumor type number: 1, 2
Co phrase search for related documents, hyperlinks ordered by date