Date

December 2008

Document Type

Capstone

Degree Name

M.B.I.

Department

Dept. of Medical Informatics and Clinical Epidemiology

Institution

Oregon Health & Science University

Abstract

From the breakthrough research of Latanya Sweeney we have learned that the secondary use of healthcare data may be privacy-revealing. Common techniques for ensuring privacy focus more on security and validation and are vulnerable to privacy attacks. Yet, it is only through the secondary use of healthcare data that we can ultimately achieve Elias Zerhouni’s vision to “Transform Medicine from Curative to Preemptive”. After all, it is the analysis and sharing of data that is crucial to understanding patterns within the healthcare population. But we cannot do this unless we can ensure the privacy of the individual patient. In this paper we review some of the common methods used to ensure privacy, including the use of privacy algorithms, and emphasize the use of differential privacy algorithms and the application of differential privacy via Privacy Integrated Queries (PINQ). Differential privacy protects the patient by adding exponentially distributed random noise to the results of a query against a data set. Exponentially distributed random noise has some interesting properties that provide privacy guarantees. Within the framework of PINQ, one can apply the differential privacy algorithm, specify the amount of accuracy (epsilon) desired, and PINQ translates this to the units of privacy that it can guarantee (i.e. providing risk enforcement). The concern, especially within healthcare datasets which are typically small in size, is that the additional of random noise, while providing privacy guarantees, will significantly reduce statistical accuracy. By applying differential privacy using PINQ against a healthcare dataset, we were able to successfully alleviate this concern. By analyzing historical data, we narrowed down the range of candidate epsilon values. We then replicated the statistical tests performed prior to perturbation at the bounds of the 95% confidence interval to determine the ideal epsilon value. It is important that values with higher sample sizes have a lower epsilon value so that we could in turn apply a higher epsilon value to the values with smaller sample sizes. Altogether this allows us to find and publish the ideal epsilon value that ensures statistical accuracy while providing privacy guarantees.

Identifier

doi:10.6083/M4BG2KZ3

School

School of Medicine

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.