Oregon Health & Science University
Biomarker discovery is a challenging process. It is rare that a single marker can accurately classify an outcome. Often, classification is improved by finding a combination of markers that can better distinguish between patients with and without a condition of interest. The present study describes a novel method which can potentially function as a diagnostic algorithm to isolate a parsimonious combination of markers that has good classification properties. The method takes advantage of the well-studied properties of receiver operating characteristic (ROC) curves and logistic regression models to select combination of variables that maximize the partial area under ROC (pAUROC), a clinically relevant metric.
Our new procedure proceeds as follows. The partial area under the ROC curve is determined for all potential markers. The model with the maximum pAUROC over the selected false positive fraction (FPF) range is the first variable to step into the model. Next, an adaptation of the jagged-ordered nonparametric algorithm is used to select from the remaining markers based on improvement in the pAUROC. A potential marker is retained in the classification model only if the resultant integrated discrimination index, (IDI) is above a preset threshold. Thus, by using a combination of different classification metrics, the new method hones in on a combination model with good sensitivity and specificity from a moderate-size pool of potential markers.
Contrary to traditional variable selection methods (example, stepwise selection), which are often based on measures of association, the current method is specifically focused on classification metrics. Hence, it eliminates the need for fitting models for every possible combination of candidate markers, and vastly improves the speed of the variable selection process. This is demonstrated by the performance of the method in isolating a combination model for classifying intra-amniotic inflammation in women with preterm labor with intact membranes. The method accurately selected cervicovaginal proteins with optimum classification performance, but contained relatively few proteins, which is desirable from a clinical perspective. Results obtained were comparable to similar parsimonious models, built by using traditional protracted methods of data mining followed by regression, thus supporting the efficiency of current method.
Division of Biostatistics
School of Medicine
Kaimal, Rajani, "A Receiver operating characteristic, curve-logistic regression based variable selection method" (2015). Scholar Archive. 3668.