Author: Cai, M.; Li, J.; Nali, M.; MacKey, T. K.
Title: Evaluation of Hybrid Unsupervised and Supervised Machine Learning Approach to Detect Self-Reporting of COVID-19 Symptoms on Twitter Cord-id: zcpn3nz5 Document date: 2021_1_1
ID: zcpn3nz5
Snippet: With over 127 million cases globally, the COVID-19 pandemic marks a sentinel event in global health. However, true case estimations have been elusive due to lack of testing and diagnostic capacity, asymptomatic cases, and individuals who do not get tested or seek care. Concomitantly, new digital surveillance tools to detect, characterize, and report COVID-19 cases are emerging, including using structured and unstructured data from users self-reporting COVID-19-related experiences on the Internet
Document: With over 127 million cases globally, the COVID-19 pandemic marks a sentinel event in global health. However, true case estimations have been elusive due to lack of testing and diagnostic capacity, asymptomatic cases, and individuals who do not get tested or seek care. Concomitantly, new digital surveillance tools to detect, characterize, and report COVID-19 cases are emerging, including using structured and unstructured data from users self-reporting COVID-19-related experiences on the Internet and social media platforms. In this study, we develop and evaluate a hybrid unsupervised and supervised machine learning approach to detect self-reported COVID-19-related symptoms on Twitter during the early stages of the pandemic. Tweets were collected from the public API stream from March 3rd-31st 2020, filtered for COVID-19-related terms. We used the biterm topic model to cluster tweets into theme-associated groups for the first 18 days of tweets, which were then extracted and manually annotated to identify users self-reporting suspected COVID-19 symptoms or status. Using this manually annotated data as a training set, we used an XLNet deep learning model for classifying symptom-related tweets from a larger corpus and also evaluated model performance. From 4, 492, 954 tweets collected, the unsupervised learning process yielded 3, 465 (<1%) symptom tweets used to form our ground-truth COVID-19 symptoms dataset (n = 11, 550). The XLNet text classifier achieved the highest accuracy (.91) and f1 (.62) compared to baseline models evaluated for classification. After re-training with adjusted loss function, we boosted the classifier's precision to 0.81 while maintaining a high f1 (0.66), resulting in identification of an additional 2, 622 symptom-related tweets when applied to an additional 11 days of tweets collected. Our study used a hybrid machine learning approach to enable high precision identification of Twitter user-generated COVID-19 symptom discussions. The model is a digital epidemiology tool that can identify social media users who self-report symptoms during the early periods of an outbreak. © 2021 IEEE.
Search related documents:
Co phrase search for related documents, hyperlinks ordered by date