March 2013

Document Type


Degree Name



Dept. of Medical Informatics and Clinical Epidemiology


Oregon Health & Science University


Today’s genomics data requires a great deal of preprocessing before it can be utilized in analysis of biological questions. This work details the steps and requirements for processing genome wide association studies (GWAS) in preparation for analysis. The scripting language ‘Python’ is employed to open and read files of genomic datasets including phenotypic data, genotypic data, and demographics data, of a GWAS performed by the Harvard Brain Tissue Resource Center (HBTRC) as well as the Alzheimer’s Disease Neuroimaging Initiative (ADNI). The data files’ subjects remain deidentified. These raw files are then processed by the scripting language ‘Python’ to create hypothesis dependent edited versions of those files suitable for use in a bioinformatics investigative genomics study. Exploratory data analysis (EDA) is performed using ‘R’ to describe the datasets and explore their suitability for the investigative study, including simple graphs. Reasons for dataset rejection as well as accept




School of Medicine



To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.