Author: Murshed, Belal Abdullah Hezam; Mallappa, Suresha; Ghaleb, Osamah A. M.; Al-ariki, Hasib Daowd Esmail
Title: Efficient Twitter Data Cleansing Model for Data Analysis of the Pandemic Tweets Cord-id: bbkbzb7k Document date: 2021_3_21
ID: bbkbzb7k
Snippet: Twitter data generally tends to be unstructured and often very noisy, cluttered/disorganized, and clothed in informal language. In this paper, we propose an intelligent Twitter data cleansing model that can solve data quality problems associated with twitter text. This model can correct a wide variety of anomalies from slangs, typos, Elongated (repeated Characters), transposition, Concatenated words, complex spelling mistakes as unorthodox use of acronyms, manifold forms of abbreviations of same
Document: Twitter data generally tends to be unstructured and often very noisy, cluttered/disorganized, and clothed in informal language. In this paper, we propose an intelligent Twitter data cleansing model that can solve data quality problems associated with twitter text. This model can correct a wide variety of anomalies from slangs, typos, Elongated (repeated Characters), transposition, Concatenated words, complex spelling mistakes as unorthodox use of acronyms, manifold forms of abbreviations of same words, and word boundary errors. The effects of whole range of tasks of Twitter Data Cleansing Model (TDCM) on the performance of sentiment classification utilizing feature models and three common classifiers have been investigated and evaluated. We conducted our experiments on two sets of pandemics twitter datasets: COVID-19 and Dengue datasets. The primary objective of this paper is to both increase the accuracy and the quality of twitter data and to purify and cleanse twitter data for further analysis. The experiment results seem to indicate that the accuracy of sentiment classification increases once the data quality problems associated with the Twitter text are solved. In COVID-19 twitter dataset, the best performance obtained using Random forest classifier after cleansing the data in terms of accuracy, recall, and f1-score are found to be at 84.7%, 88.5%, and 86.3% respectively. However, the best performance in terms of precision at 84.5% was observed using SVM classifier when compared to that obtained with other classifiers. Further, in the Dengue twitter dataset, the best performance for cleansing data in terms of accuracy, precision and f1-score are observed to be 81.7%, 83.7% and 88.6% respectively using Random forest classifier. The best performance in terms of recall, however, is 94.9% and was obtained using SVM classifier when compared with those obtained with other classifiers.
Search related documents:
Co phrase search for related documents- Try single phrases listed below for: 1
Co phrase search for related documents, hyperlinks ordered by date