Selected article for: "machine learning and prediction performance"

Author: Kanduri, Chakravarthi; Pavlović, Milena; Scheffer, Lonneke; Motwani, Keshav; Chernigovskaya, Maria; Greiff, Victor; Sandve, Geir K.
Title: Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification
  • Cord-id: coupcqje
  • Document date: 2021_9_3
  • ID: coupcqje
    Snippet: Background Machine learning (ML) methodology development for classification of immune states in adaptive immune receptor repertoires (AIRR) has seen a recent surge of interest. However, so far, there does not exist a systematic evaluation of scenarios where classical ML methods (such as penalized logistic regression) already perform adequately for AIRR classification. This hinders investigative reorientation to those scenarios where further method development of more sophisticated ML approaches
    Document: Background Machine learning (ML) methodology development for classification of immune states in adaptive immune receptor repertoires (AIRR) has seen a recent surge of interest. However, so far, there does not exist a systematic evaluation of scenarios where classical ML methods (such as penalized logistic regression) already perform adequately for AIRR classification. This hinders investigative reorientation to those scenarios where further method development of more sophisticated ML approaches may be required. Results To identify those scenarios where a baseline method is able to perform well for AIRR classification, we generated a collection of synthetic benchmark datasets encompassing a wide range of dataset architecture-associated and immune state-associated sequence pattern (signal) complexity. We trained ≈1300 ML models with varying assumptions regarding immune signal on≈850 datasets with a total of ≈210’000 repertoires containing ≈42 billion TCRβ CDR3 amino acid sequences, thereby surpassing the sample sizes of current state-of-the-art AIRR ML setups by two orders of magnitude. We found that L1-penalized logistic regression achieved high prediction accuracy even when the immune signal occurs only in 1 out of 50’000 AIR sequences. Conclusions We provide a reference benchmark to guide new AIRR ML classification methodology by: (i) identifying those scenarios characterised by immune signal and dataset complexity, where baseline methods already achieve high prediction accuracy and (ii) facilitating realistic expectations of the performance of AIRR ML models given training dataset properties and assumptions. Our study serves as a template for defining specialized AIRR benchmark datasets for comprehensive benchmarking of AIRR ML methods.

    Search related documents:
    Co phrase search for related documents
    • absolute count and logistic regression: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38
    • absolute count and logistic regression model: 1, 2, 3, 4, 5, 6, 7
    • absolute count and machine learning: 1, 2, 3
    • abundance distribution and logistic regression: 1
    • abundance distribution and machine learning: 1
    • accuracy increase and additional diagnostic: 1
    • accuracy increase and logistic regression: 1, 2, 3
    • accuracy increase and machine learning: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22