Document Type


Degree Name



Department of Medical Informatics & Clinical Epidemiology


Shotgun proteomics is the most common mass spectrometry-based proteomics method for identifying and quantifying proteins present within a sample. Despite improvements in mass spectrometry tools, the issue dealing with inferring proteins and quantifying them from peptides still persists. The choice of protein sequence databases and the underlying genomic complexities of organisms are often considered sources of peptide degeneracy that lead to varying protein identifications and quantifications. In this thesis, the differences between four protein sequence database sources (Uniprot, NCBI, Ensembl, and IPI) were compared using redundant and non-redundant protein counts and shared and unique peptide counts for two higher eukaryotic organisms (human and mouse) and a representative lower eukaryotic organism (yeast). It was also demonstrated that basic parsimony logic in the protein inference process yields protein and peptide identifications in real biological samples of higher eukaryotic organisms that are dependent on the protein sequence database of choice. To address parsimony logic shortcomings, two versions of extended parsimony clustering algorithms (Proteomic Analysis Workflow (PAW) clustering and Scaffold-like clustering) that group proteins with highly significant shared peptide evidence but low unique peptide evidence were implemented and tested. For human samples, these extended parsimony clustering algorithms have significantly reduced both mean shared peptide proportions across databases compared to that of basic parsimony logic, and produced protein identification numbers that are largely independent of protein sequence database choice. Few differences in protein and peptide characteristics were observed for yeast samples before or after implementing PAW or Scaffold-like clustering algorithms. Silhouette scores and gene enrichment analysis on the clusters of the extended parsimony clustering algorithms demonstrated that they are biologically and functionally coherent. From a quantitative perspective, there was a significant increase in mean quantitative information content (QIC) in human samples after PAW or Scaffold-like clustering compared to QIC computed after basic parsimony logic. The variation in the QIC values of human samples significantly decreased across databases after implementation of PAW or Scaffold-like algorithms.




School of Medicine



To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.