Dept. of Computer Science and Engineering
Oregon Health & Science University
Genome-wide studies are sensitive to the quality of annotation data included for analyses and they often involve overlaying both computationally derived and experimentally generated data onto a genomic scaffold. A framework for successful integration of data from diverse sources needs to address, at a minimum, the conceptualization of the biological identity in the data sources, the relationship between the sources in terms of the data present, the independence of the sources and, any discrepancies in the data. The outcome of the process should either resolve or incorporate these discrepancies into downstream analyses. In this thesis we identify factors that are important in detecting errors within and between sources and present a generalized framework to detect discrepancies. An implementation of our workflow is used to demonstrate the utility of the approach in the construction of a genome-wide mouse transcription factor binding map and in the classification of Single nucleotide polymorphisms. We also present the impact of these discrepancies on downstream analyses. The framework is extensible and we discuss future directions including summarization of the discrepancies in a biological relevant manner.
OGI School of Science and Engineering
Ramakrishnan, Ranjani, "A data cleaning and annotation framework for genome-wide studies" (2007). Scholar Archive. 157.