Dept. of Medical Informatics and Clinical Epidemiology
Oregon Health & Science University
From the breakthrough research of Latanya Sweeney we have learned that the secondary use of healthcare data may be privacy-revealing. Common techniques for ensuring privacy focus more on security and validation and are vulnerable to privacy attacks. Yet, it is only through the secondary use of healthcare data that we can ultimately achieve Elias Zerhouniâs vision to âTransform Medicine from Curative to Preemptiveâ. After all, it is the analysis and sharing of data that is crucial to understanding patterns within the healthcare population. But we cannot do this unless we can ensure the privacy of the individual patient. In this paper we review some of the common methods used to ensure privacy, including the use of privacy algorithms, and emphasize the use of differential privacy algorithms and the application of differential privacy via Privacy Integrated Queries (PINQ). Differential privacy protects the patient by adding exponentially distributed random noise to the results of a query against a data set. Exponentially distributed random noise has some interesting properties that provide privacy guarantees. Within the framework of PINQ, one can apply the differential privacy algorithm, specify the amount of accuracy (epsilon) desired, and PINQ translates this to the units of privacy that it can guarantee (i.e. providing risk enforcement). The concern, especially within healthcare datasets which are typically small in size, is that the additional of random noise, while providing privacy guarantees, will significantly reduce statistical accuracy. By applying differential privacy using PINQ against a healthcare dataset, we were able to successfully alleviate this concern. By analyzing historical data, we narrowed down the range of candidate epsilon values. We then replicated the statistical tests performed prior to perturbation at the bounds of the 95% confidence interval to determine the ideal epsilon value. It is important that values with higher sample sizes have a lower epsilon value so that we could in turn apply a higher epsilon value to the values with smaller sample sizes. Altogether this allows us to find and publish the ideal epsilon value that ensures statistical accuracy while providing privacy guarantees.
School of Medicine
Lee, Denny Guang-Yeu, "Protecting patient data confidentiality using differential privacy" (2008). Scholar Archive. 392.