Selected article for: "accuracy test and additional bias"

Author: Gurjit S. Randhawa; Maximillian P.M. Soltysiak; Hadi El Roz; Camila P.E. de Souza; Kathleen A. Hill; Lila Kari
Title: Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study
  • Document date: 2020_2_4
  • ID: cetdqgff_20
    Snippet: Test-1 starts at the highest available level and classifies the viral sequences to the 11 325 families and Riboviria realm (Table 1) . There is only one realm available in the viral 326 taxonomy, so all of the families that belong to the realm Riboviria are placed into a 327 single cluster and a random collection of 500 sequences are selected. No realm is defined 328 for the remaining 11 families. The objective is to train the classification mode.....
    Document: Test-1 starts at the highest available level and classifies the viral sequences to the 11 325 families and Riboviria realm (Table 1) . There is only one realm available in the viral 326 taxonomy, so all of the families that belong to the realm Riboviria are placed into a 327 single cluster and a random collection of 500 sequences are selected. No realm is defined 328 for the remaining 11 families. The objective is to train the classification models with the 329 known viral genomes and then predict the labels of the COVID-19 virus sequences. The 330 maximum classification accuracy score of 95% was obtained using the Quadratic SVM 331 model. This test demonstrates that MLDSP-GUI can distinguish between different viral 332 families. The trained models are then used to predict the labels of 29 COVID-19 333 sequences. As expected, all classification models correctly predict that the COVID-19 334 sequences belong to the Riboviria realm, see Table 2 . Test-2 is composed of 12 families 335 from the Riboviria, see Table 1 , and the goal is to test if MLDSP-GUI is sensitive 336 enough to classify the sequences at the next lower taxonomic level. It should be noted 337 that as we move down the taxonomic levels, sequences become much more similar to 338 one another and the classification problem becomes challenging. MLDSP-GUI is still 339 able to distinguish between the sequences within the Riboviria realm with a maximum 340 classification accuracy of 91.1% obtained using the Linear Discriminant classification 341 model. When COVID-19 sequences are tested using the models trained on Test-2, all of 342 the models correctly predict the COVID-19 sequences as Coronaviridae (Table 2) . The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.02.03.932350 doi: bioRxiv preprint four genera (Alphacoronavirus, Betacoronavirus, Deltacoronavirus, Gammacoronavirus), 345 see Table 1 . MLDSP-GUI distinguishes sequences at the genus level with a maximum 346 classification accuracy score of 98%, obtained using the Linear Discriminant model. This 347 is a very high accuracy rate considering that no alignment is involved and the sequences 348 are very similar. All trained classification models correctly predict the COVID-19 as 349 Betacoronavirus, see Table 2 . Test-3a has Betacoronavirus as the largest cluster and it 350 can be argued that the higher accuracy could be a result of this bias. To avoid bias, we 351 did an additional test removing the smallest cluster Gammacoronavirus and limiting the 352 size of remaining three clusters to the size of the cluster with the minimum number of 353 sequences i.e. 20 with Test-3b. MLDSP-GUI obtains 100% classification accuracy for 354 this additional test and still predicts all of the COVID-19 sequences as Betacoronavirus. 355 These tests confirm that the COVID-19 are from the genus Betacoronavirus.

    Search related documents:
    Co phrase search for related documents
    • accuracy rate and classification accuracy score: 1, 2
    • accuracy rate and classification model: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
    • accuracy rate and classification problem: 1, 2
    • accuracy rate and high accuracy: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25
    • bias avoid and high accuracy: 1
    • bias result and classification model: 1
    • classification accuracy and cluster size: 1
    • classification accuracy and Discriminant classification: 1, 2, 3, 4
    • classification accuracy and high accuracy: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25
    • classification accuracy score and high accuracy: 1, 2, 3, 4
    • classification model and cluster size: 1
    • classification model and Discriminant classification: 1, 2, 3, 4, 5, 6
    • classification model and genus level: 1, 2
    • classification model and high accuracy: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25
    • classification problem and genus level: 1
    • classification problem and high accuracy: 1, 2, 3, 4
    • cluster size and Discriminant classification: 1
    • correctly predict and genus level: 1
    • correctly predict and high accuracy: 1, 2, 3, 4