Author: El Emam, Khaled; Mosquera, Lucy; Jonker, Elizabeth; Sood, Harpreet
Title: Evaluating the utility of synthetic COVID-19 case data Cord-id: ndsrxda5 Document date: 2021_3_1
ID: ndsrxda5
Snippet: BACKGROUND: Concerns about patient privacy have limited access to COVID-19 datasets. Data synthesis is one approach for making such data broadly available to the research community in a privacy protective manner. OBJECTIVES: Evaluate the utility of synthetic data by comparing analysis results between real and synthetic data. METHODS: A gradient boosted classification tree was built to predict death using Ontario’s 90 514 COVID-19 case records linked with community comorbidity, demographic, and
Document: BACKGROUND: Concerns about patient privacy have limited access to COVID-19 datasets. Data synthesis is one approach for making such data broadly available to the research community in a privacy protective manner. OBJECTIVES: Evaluate the utility of synthetic data by comparing analysis results between real and synthetic data. METHODS: A gradient boosted classification tree was built to predict death using Ontario’s 90 514 COVID-19 case records linked with community comorbidity, demographic, and socioeconomic characteristics. Model accuracy and relationships were evaluated, as well as privacy risks. The same model was developed on a synthesized dataset and compared to one from the original data. RESULTS: The AUROC and AUPRC for the real data model were 0.945 [95% confidence interval (CI), 0.941–0.948] and 0.34 (95% CI, 0.313–0.368), respectively. The synthetic data model had AUROC and AUPRC of 0.94 (95% CI, 0.936–0.944) and 0.313 (95% CI, 0.286–0.342) with confidence interval overlap of 45.05% and 52.02% when compared with the real data. The most important predictors of death for the real and synthetic models were in descending order: age, days since January 1, 2020, type of exposure, and gender. The functional relationships were similar between the two data sets. Attribute disclosure risks were 0.0585, and membership disclosure risk was low. CONCLUSIONS: This synthetic dataset could be used as a proxy for the real dataset.
Search related documents:
Co phrase search for related documents- accuracy loss and logistic regression: 1
- accuracy loss and loss function: 1, 2, 3, 4, 5, 6, 7, 8, 9
- accuracy loss and machine learning: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
- accuracy loss and machine learning model: 1, 2
- accuracy measure and logistic regression: 1, 2, 3, 4
- accuracy measure and logistic regression model: 1
- accuracy measure and machine learning: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22
- accuracy measure and machine learning model: 1, 2
- accuracy measure and machine learning technique: 1
- accuracy result and achievable value: 1
- accuracy result and logistic regression: 1, 2, 3
- accuracy result and machine learning: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
- accuracy result and machine learning model: 1
- accuracy result and machine learning technique: 1, 2
- achieve approach and logistic regression: 1, 2
- achieve approach and logistic regression model: 1
- achieve approach and machine learning: 1, 2, 3, 4
- achieve approach and machine learning model: 1
Co phrase search for related documents, hyperlinks ordered by date