Document: is the number of tumor types in the dataset, that depends 64 on miRNA's relative position, bringing the result displayed in Figure 1 . As the objective is to find and validate a reduced list of miRNAs to be used as a 66 signature, feature selection is to be performed on the dataset. Popular approaches to 67 feature selection range from univariate statistical considerations, to iterated runs of the 68 same classifier with a progressively reduced number of features, in order to assess the 69 contribution of the features to the overall result. As the considered case study is 70 particularly complex, however, relying upon simple statistical analyses or a single 71 classifier might not suffice. Following the idea behind ensemble feature selection [31] [32] [33] , 72 we use multiple algorithms to obtain a more robust predictive performance. For this 73 purpose, we train a set of classifiers to then extract a sorted list of the most relevant 74 features from each. As, intuitively, a feature considered important by the majority of 75 classifiers in the set is likely to be relevant for our aim, the information from all 76 classifiers is then compiled to find the most common relevant features. 77 Starting from a thorough comparison of 22 different state-of-the-art classifiers on the 78 considered dataset presented in [34] , in this work a subset of those classifiers is selected 79 considering both (i) high accuracy and (ii) a way to extract the relative importance of 80 the features from the trained classifier. After preliminary tests to set algorithms' 81 hyperparameters, 8 classifiers are chosen, all featuring an average accuracy higher than 82 90% on a 10-fold cross-validation: 83 • • SVC (Support Vector Machines Classifier with a linear kernel) [42] 91 All considered classifiers are implemented in the scikit-learn Python toolbox [43] . 92 Overall, the selected classifiers fall into two broad typologies: those exploiting 93 ensembles of classification trees [44] (Bagging, GradientBoosting, RandomForest), 94 and those optimizing the coefficients of linear models to separate classes 95 (LogisticRegression, PassiveAggressive, Ridge, SGD, SVC). Depending on classifier 96 typology, there are two different ways of extracting relative feature importance. For 97 classifiers based on classification trees, the features used in the splits are counted and 98 sorted by frequency, from the most to the least common. For classifiers based on linear 99 models, the values of the coefficients associated to each feature can be used as a proxy 100 of their relative importance, sorting coefficients from the largest to the smallest in 101 absolute value. As the two feature extraction methods return heterogeneous numeric 102 values, only the relative sorting of features provided by each classifier is considered. We 103 arbitrarily decide to extract the top 100 most relevant features, so we assign to each 104 feature f a simple score S f = N t /N c , where N t is the number of times that specific 105 features appears among the top 100 of a specific classifier instance, while N c is the total 106 number of classifiers instances used; for instance, a feature appearing among the 100 107 most relevant in 73% of the classifiers used would obtain a score S f = 0.73. In order to 108 increase the generalizability of our results, each selected classifier is run 10 times, using 109 a 10-fold stratified cross-validation, so that each fold preserves the percentage of Table 2 comp
Search related documents:
Co phrase search for related documents- art state and statistical analysis: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
- art state and stratified cross validation: 1, 2
- average accuracy and statistical analysis: 1, 2, 3, 4
- average accuracy and stratified cross validation: 1, 2, 3, 4
- case study and statistical analysis: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25
- classification tree and statistical analysis: 1, 2, 3, 4
- classifier set and statistical analysis: 1
- cross validation and statistical analysis: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17
- cross validation and stratified cross validation: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18
- dataset perform and statistical analysis: 1, 2
- feature extraction and statistical analysis: 1, 2, 3
- feature extraction method and statistical analysis: 1
- feature importance and statistical analysis: 1, 2
- feature selection and statistical analysis: 1, 2, 3, 4
- high accuracy and statistical analysis: 1, 2, 3, 4, 5, 6, 7, 8, 9
- important consider and statistical analysis: 1, 2
- linear kernel and statistical analysis: 1
- linear model and statistical analysis: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25
- multiple algorithm and statistical analysis: 1, 2
Co phrase search for related documents, hyperlinks ordered by date