Medical data sharing and synthetic clinical data generation - maximizing biomedical resource utilization and minimizing participant re-identification risks.

TitleMedical data sharing and synthetic clinical data generation - maximizing biomedical resource utilization and minimizing participant re-identification risks.
Publication TypeJournal Article
Year of Publication2025
AuthorsMarino S, Cassidy R, Nanni J, Wang Y, Liu Y, Tang M, Yuan Y, Chen T, Sinha A, Pandian B, Dinov ID, Burns ML
JournalNPJ Digit Med
Volume8
Issue1
Pagination526
Date Published2025 Aug 16
ISSN2398-6352
Abstract

The sensitive nature of electronic health records (EHR) and wearable data presents challenges in sharing biomedical resources while minimizing re-identification risks. This article introduces an end-to-end, titratable pipeline that generates privacy-preserving "digital twin" datasets from complex EHR and wearable-device records (Apple Watch data from 3029 participants) using DataSifter and Synthetic Data Vault (SDV) methods. Various obfuscation levels were applied (DataSifter: small, medium, large; SDV: CTGAN, Gaussian Copula) and benchmarked using utility (statistical fidelity, machine learning performance) and privacy (re-identification risk, detection likelihood) metrics. The highest-obfuscation DataSifter twin delivered the strongest privacy protection (0.83) while preserving key statistical and predictive signals (83.1% confidence interval overlap in regression models), outperforming SDV, particularly for longitudinal data. Despite declining performance in machine learning tasks with higher obfuscation, utility was generally preserved. The study underscores the importance of digital twin datasets and highlights DataSifter's adaptability in privacy-utility trade-offs, advocating its utility for secure data sharing.

DOI10.1038/s41746-025-01935-1
Alternate JournalNPJ Digit Med
PubMed ID40818998
PubMed Central IDPMC12357896
Grant List1916425, 1734853, 1636840 / / NSF /
R01 MH126137 / MH / NIMH NIH HHS / United States
R01MH126137, R41TR004515, T32GM141746 / GF / NIH HHS / United States
R41 TR004515 / TR / NCATS NIH HHS / United States
T32 GM141746 / GM / NIGMS NIH HHS / United States