Selected article for: "data set and false positive"

Author: Zhan, X.; Humbert-Droz, M.; Mukherjee, P.; Gevaert, O.
Title: Structuring clinical text with AI: old vs. new natural language processing techniques evaluated on eight common cardiovascular diseases
  • Cord-id: ih75w4up
  • Document date: 2021_1_29
  • ID: ih75w4up
    Snippet: Mining the structured data in electronic health records(EHRs) enables many clinical applications while the information in free-text clinical notes often remains untapped. Free-text notes are unstructured data harder to use in machine learning while structured diagnostic codes can be missing or even erroneous. To improve the quality of diagnostic codes, this work extracts structured diagnostic codes from the unstructured notes concerning cardiovascular diseases. Five old and new word embeddings w
    Document: Mining the structured data in electronic health records(EHRs) enables many clinical applications while the information in free-text clinical notes often remains untapped. Free-text notes are unstructured data harder to use in machine learning while structured diagnostic codes can be missing or even erroneous. To improve the quality of diagnostic codes, this work extracts structured diagnostic codes from the unstructured notes concerning cardiovascular diseases. Five old and new word embeddings were used to vectorize over 5 million progress notes from Stanford EHR and logistic regression was used to predict eight ICD-10 codes of common cardiovascular diseases. The models were interpreted by the important words in predictions and analyses of false positive cases. Trained on Stanford notes, the model transferability was tested in the prediction of corresponding ICD-9 codes of the MIMIC-III discharge summaries. The word embeddings and logistic regression showed good performance in the diagnostic code extraction with TF-IDF as the best word embedding model showing AUROC ranging from 0.9499 to 0.9915 and AUPRC ranging from 0.2956 to 0.8072. The models also showed transferability when tested on MIMIC-III data set with AUROC ranging from 0.7952 to 0.9790 and AUPRC ranging from 0.2353 to 0.8084. Model interpretability was showed by the important words with clinical meanings matching each disease. This study shows the feasibility to accurately extract structured diagnostic codes, impute missing codes and correct erroneous codes from free-text clinical notes with interpretable models for clinicians, which helps improve the data quality of diagnostic codes for information retrieval and downstream machine-learning applications.

    Search related documents:
    Co phrase search for related documents
    • acute myocardial infarction and low prevalence: 1, 2, 3, 4
    • acute myocardial infarction and machine learning: 1, 2, 3, 4, 5
    • additional method and low prevalence: 1
    • additional method and machine learning: 1, 2
    • logistic function and lr logistic regression: 1
    • logistic function and machine learning: 1, 2, 3
    • logistic regression and low prevalence: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25
    • logistic regression and lr classifier: 1, 2, 3, 4, 5, 6, 7
    • logistic regression and lr logistic regression: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25
    • logistic regression and machine learning: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25
    • logistic regression and machine learning application: 1, 2, 3, 4
    • logistic regression and machine learning method: 1, 2, 3, 4, 5, 6, 7
    • logistic regression and machine learning outcome prediction: 1, 2
    • low prevalence and machine learning: 1, 2, 3, 4, 5
    • lr classifier and lr logistic regression: 1, 2, 3, 4, 5, 6, 7
    • lr classifier and machine learning: 1, 2, 3, 4, 5, 6, 7
    • lr classifier and machine learning method: 1
    • lr logistic regression and machine learning method: 1
    • lr logistic regression and machine learning outcome prediction: 1