Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Predicting Classifier Performance with Limited Training Data: Applications to Computer-Aided Diagnosis in Breast and Prostate Cancer

Abstract

Clinical trials increasingly employ medical imaging data in conjunction with supervised classifiers, where the latter require large amounts of training data to accurately model the system. Yet, a classifier selected at the start of the trial based on smaller and more accessible datasets may yield inaccurate and unstable classification performance. In this paper, we aim to address two common concerns in classifier selection for clinical trials: (1) predicting expected classifier performance for large datasets based on error rates calculated from smaller datasets and (2) the selection of appropriate classifiers based on expected performance for larger datasets. We present a framework for comparative evaluation of classifiers using only limited amounts of training data by using random repeated sampling (RRS) in conjunction with a cross-validation sampling strategy. Extrapolated error rates are subsequently validated via comparison with leave-one-out cross-validation performed on a larger dataset. The ability to predict error rates as dataset size increases is demonstrated on both synthetic data as well as three different computational imaging tasks: detecting cancerous image regions in prostate histopathology, differentiating high and low grade cancer in breast histopathology, and detecting cancerous metavoxels in prostate magnetic resonance spectroscopy. For each task, the relationships between 3 distinct classifiers (k-nearest neighbor, naive Bayes, Support Vector Machine) are explored. Further quantitative evaluation in terms of interquartile range (IQR) suggests that our approach consistently yields error rates with lower variability (mean IQRs of 0.0070, 0.0127, and 0.0140) than a traditional RRS approach (mean IQRs of 0.0297, 0.0779, and 0.305) that does not employ cross-validation sampling for all three datasets.

Introduction

A growing amount of clinical research employs computerized classification of medical imaging data to develop quantitative and reproducible decision support tools [13]. A key issue during the development of image-based classifiers is the accrual of sufficient data to achieve a desired level of statistical power and, hence, confidence in the generalizability of the results. Computerized image analysis systems typically involve a supervised classifier that needs to be trained on a set of annotated examples, which are often provided by a medical expert who manually labels the samples according to their disease class (e.g. high or low grade cancer) [4]. Unfortunately, in many medical imaging applications, accumulating large cohorts is very difficult due to (1) the high cost of expert analysis and annotations and (2) because of overall data scarcity [3, 5]. Hence, the ability to predict the amount of data required to achieve a desired classification accuracy for large-scale trials, based on experiments performed on smaller pilot studies is vital to the successful planning of clinical research.

Another issue in utilizing computerized image analysis for clinical research is the need to select the best classifier at the onset of a large-scale clinical trial [6]. The selection of an optimal classifier for a specific dataset usually requires large amounts of annotated training data [7] since the error rate of a supervised classifier tends to decrease as training set size increases [8]. Yet, in clinical trials, this decision is often based on the assumption (which may not necessarily hold true [9]) that the relative performance of classifiers on a smaller dataset will remain the same as more data becomes available.

In this paper, we aim to overcome the major constraints on classifier selection in clinical trials that employ medical imaging data, namely (1) the selection of an optimal classifier using only a small subset of the full cohort and (2) the prediction of long-term performance in a clinical trial as data becomes available sequentially over time.

To this end, we aim to address crucial questions that arise early in the development in a classification system, namely:

  • Given a small pilot dataset, can we predict the error rates associated with a classifier assuming that a larger data cohort will become available in the future?
  • Will the relative performance between multiple classifiers hold true as data cohorts grow larger?

Traditional power calculations aim to determine confidence in an error estimate using repeated classification experiments [10], but do not address the question of how error rate changes as more data becomes available. Also, they may not be ideal for analyzing biomedical data because they assume an underlying Gaussian distribution and independence between variables [11]. Repeated random sampling (RRS) approaches, which characterize trends in classification performance via repeated classification using training sets of varying sizes, have thus become increasingly popular, especially for extrapolating error rates in genomic datasets [6, 11, 12]. Drawbacks of RRS include (1) no guarantee that all samples will be selected at least once for testing and (2) a large number of repetitions required to account for the variability associated random sampling. In particular, traditional RRS may suffer in the presence of highly heterogeneous datasets (e.g. biomedical imaging data [13]) due to the use of fixed training and testing pools. This is exemplified in Fig 1 by the variability in calculated (black boxes) and extrapolated (blue curves) error rates resulting from the use of different training and testing pools from the same dataset. More recently, methods such as repeated independent design and test (RIDT) [14] have aimed to improve RRS by simultaneously modeling the effects of different testing set sizes in addition to different training set sizes. This approach, however, requires allocation of larger testing sets than RRS, thereby reducing the number of samples available in the training set for extrapolation. It is important to note that the concept of predicting error rates for large datasets should not confused with semi-supervised learning techniques, e.g. active learning (AL) [15], that aim to maximize classifier performance while mitigating the costs of compiling large annotated training datasets [5]. Since AL methods are designed to optimize classification accuracy during the acquisition of new data, they are not appropriate for a priori prediction of classifier performance using only a small dataset.

thumbnail
Fig 1. Traditional repeated random sampling (RRS) of prostate cancer histopathology leads to unstable estimation of error rates.

Application of RRS to the classification of cancerous and non-cancerous prostate cancer histopathology (dataset ) suggests that heterogeneous medical imaging data can produce highly variable calculated (black boxes) and extrapolated (blue curves) mean error rates. Each set of error rates is derived from and independent RRS trial that employs different training and testing pools for classification. The yellow star represents the leave-one-out cross-validation error (i.e. the expected lower bound on error) produced by a larger validation cohort.

https://doi.org/10.1371/journal.pone.0117900.g001

Due to the heterogeneity present in biomedical imaging data, we extend the RRS-based approach originally used to model gene microarray data [11] by incorporating a K-fold cross-validation framework to ensure that all samples are used for both classifier training and testing (Fig 2). First, the dataset is split into K distinct, stratified pools where one pool is used for testing while the remaining K − 1 are used for training. A bootstrap subsampling procedure is used to create multiple subsets of various sizes from the training pool. Each subset is used to train a classifier, which is then evaluated against the testing pool. The pools are rotated K times to ensure that all samples are evaluated once and error rates are averaged for each training set size. The resulting mean error rates are used to determine the three parameters of the power-law model (rate of learning, decay rate, and Bayes error) that characterize the behavior of error rate as a function of training set size.

thumbnail
Fig 2. A flowchart describing the methodology used in this paper.

First, a dataset is partitioned into training and testing pools using a K-fold sampling strategy (red box). Each of the K training pools undergoes traditional repeated random sampling (RRS), in which error rates are calculated at different training set sizes n via a subsampling procedure. A permutation test is used to identify statistically significant error rates, which are then used to extrapolate learning curves and predict error rates for larger datasets. The extension to pixel-level data employs the same sampling and error rate estimation strategies shown in this flowchart; however, the classifiers used for calculating the relevant error rates are trained and evaluated on pixel-level features from the training sets and testing pool, respectively.

https://doi.org/10.1371/journal.pone.0117900.g002

Application of the RRS model to patient-level medical imaging data, where each patient or image is described by a single set of features, is relatively well-understood. Yet disease classification in radiological data (e.g. MRI) occurs at the pixel-level, in which each patient has pixels from both classes (e.g. diseased and non-diseased states) and each pixel is characterized by a set of features [16]. In this work, we present an extension to RRS that employs two-stage sampling in order to mitigate the sampling bias occurring from high intra-patient correlation between pixels. The first stage requires all partitioning of the dataset to be performed at the patient-level, ensuring that pixels from a single patient will not be included in both the training and testing sets. In the second stage, pixel-level classification is performed by training the classifier using pixels from all images (and both classes) in the training set and evaluating against pixels from all images in the testing set. The resulting error rates are used to extrapolate classifier performance as previously described for the traditional patient-level RRS.

This paper focuses on comparing the performance of three exemplar classifiers: (1) the non-parametric k-nearest neighbor (kNN) classifier [8], (2) the probabilistic naive Bayes (NB) classifier that assumes an underlying Gaussian distribution [8], and (3) a non-probabilistic Support Vector Machine (SVM) classifier that aims to maximize class separation using a radial basis function (RBF) kernel. Each of these classifiers has previously been used for a variety of computerized image analysis tasks in the context of medical imaging [17, 18]. All classifiers are evaluated on three distinct classification problems: (1) detection of cancerous image regions in prostate cancer (PCa) histopathology [5], (2) grading of cancerous nuclei in breast cancer (BCa) histopathology [19], and (3) detection of cancerous metavoxels on PCa magnetic resonance spectroscopy (MRS) [16].

The novel contributions of this work include (1) more stable learning curves due to the incorporation of cross-validation into the RRS scheme, (2) a comparison of performance across multiple classifiers as dataset size increases, and (3) enabling a power analysis of classifiers operating on the pixel/voxel level (as opposed to patient/sample level), which cannot be currently done via standard sample power calculators.

The remainder of the paper is organized as follows. First, the Methods section presents a description of the methodology used in this work. Experimental Design includes a description of the datasets and experimental parameters used for evaluation. Results and Discussion are subsequently presented for all experiments, followed by Concluding Remarks.

Methods

Notation

For all experiments, a dataset 𝓓 is divided into independent training 𝓝 ⊂ 𝓓 and a testing 𝓣 ⊂ 𝓓 pools, where 𝓝∩𝓣 = ∅. The class label of a sample x ∈ 𝓓 is denoted by yt ∈ {ω1, ω2}. A set of training set sizes N = {n1, n2, …, nN}, where 1 ≤ n ≤ |𝓝| and |·| denotes set cardinality.

Subsampling test to calculate error rates for multiple training set sizes

The estimation of classifier performance first requires the construction of multiple classifiers trained on repeated subsampling of the limited dataset. For each training set size nN, a total of T1 subsets 𝓢(n) ⊂ 𝓝, each containing n samples, are created by randomly sampling the training pool 𝓝. For each nN and i ∈ {1, 2, …, T1}, the subset Si(n) ∈ 𝓢 is used to train a corresponding classifier Hi(n). Each Hi(n) is evaluated on the entire testing set 𝓣 to produce an error rate ei(n). The mean error rate for each nN is calculated as (1)

Permutation test to evaluate statistical significance of error rates

Permutation tests are a well-established, non-parametric approach for implicitly determining the null distribution of a test statistic and are primarily employed in situations involving small training sets that contain insufficient data to make assumptions about the underlying data distribution [11, 20]. In this work, the null hypothesis states that the performance of the actual classifier is similar to “intrinsic noise” of a randomly trained classifier. Here, a randomly trained classifier is modeled through repeated calculation of error rates from classifiers trained on data with randomly selected class labels.

To ensure the statistical significance of the mean error rates e(n) calculated in Eq 1, the performance of training set Si(n) is compared against the performance of randomly labeled training data. For each Si(n) ∈ 𝓢(n), a total of T2 subsets 𝓢^(n)𝓝, each containing n samples, are created. For each nN, i ∈ {1, 2, …, T1}, and j ∈ {1, 2, …, T2}, the subset S^i,j(n)𝓢^ is assigned a randomized class label yr ∈ {ω1, ω2} and used to train a corresponding classifier H^i,j(n). Each H^i,j(n) is evaluated on the entire testing set 𝓣 to produce an error rate e^i,j(n). For each n, a p-value (2) where θ(z) = 1 if z ≥ 0 and 0 otherwise. Pn is calculated as the fraction of randomly-labeled classifiers H^i,j(n) with error rates ei,j(n) exceeding the mean error rate e(n),nN. The mean error rate e(n) is deemed to be valid for model-fitting only if Pn < 0.05, i.e. there is a statistically significant difference between e(n) and {e^i,j(n),i{1,2,,T1},j{1,2,,T2}}. Hence, the set of valid training set sizes M = {n:nN, Pn < 0.05} includes only those nN that have passed the significance test.

Cross-validation strategy for selection of training and testing pools

The selection of training 𝓝 and testing 𝓣 pools from the limited dataset 𝓓 is governed by a K-fold cross-validation strategy. In this paper, the dataset 𝓓 is partitioned into K = 4 pools in which one pool is used for evaluation while the remaining K − 1 pools are used for training to produce mean error rates ek(n), where k ∈ {1, 2, …, K}. The pools are then rotated and the subsampling and permutation tests are repeated until all pools have been evaluated exactly once. This process is repeated over R cross-validation trials, yielding mean error rates ek,r(n) where r ∈ {1, 2, …, R}. For all training set sizes that have passed the significance test, i.e. ∀nM, learning curves are generated from a comprehensive mean error rate (3) calculated over all cross-validation folds k ∈ {1, 2, …, K} and iterations r ∈ {1, 2, …, R}.

Estimation of power law model parameters

The power-law model [11] describes the relationship between error rate and training set size (4) where e(n) is the comprehensive mean error rate (Eq 3) for training set size n, a is the learning rate, and α is the decay rate. The Bayes error rate b is defined as the lowest possible error given an infinite amount of training data [8]. The model parameters a, α, and b are calculated by solving the constrained non-linear minimization problem (5) where a, α, b ≥ 0.

Extension of error rate prediction to pixel- and voxel-level data

The methodology presented in this work can be extended to such pixel- or voxel-level data by first selecting training set sizes N at the patient-level. Definition of the K training and testing pools as well as creation of each subsampled training set Si(n) ∈ 𝓢 are also performed at the patient-level. Training of the corresponding classifier Hi(n), however, is performed at the pixel-level by aggregating pixels for all patients in Si(n). A similar aggregation is done for all patients in the testing pool 𝓣. By ensuring that all pixels from a given patient remain together, we are able to perform pixel-level calculations while avoiding the sampling bias that occurs when pixels from a single patient span both training and testing sets.

Experimental Design

Our methodology is evaluated on a synthetic dataset and 3 actual classification tasks traditionally affected by limitations in the availability of imaging data (Table 1). All experiments have a number of parameters in common, including T1 = 50 subsampling trials, T2 = 50 permutation trials, and R = 10 independent trials of K = 4 fold cross-validation. In addition, all experiments employ the k-nearest neighbor (kNN), naive Bayes (NB), and Support Vector Machine (SVM) classifiers. A more detailed description of each classifier is presented in S1 Appendix. In each experiment, validation is performed via leave-one-out (LOO) classification on a larger dataset, which allows us to maximize the number of training samples used for classification while yielding the expected lower bound of the error rate.

thumbnail
Table 1. List of the breast cancer and prostate cancer datasets used in this study.

https://doi.org/10.1371/journal.pone.0117900.t001

Ethics Statement

The three different datasets used in this study were retrospectively acquired from independent patient cohorts, where the data was initially acquired under written informed consent at each collecting institution. All 3 datasets comprised de-identified medical image data and provided to the authors through the IRB protocol # E09-481 titled “Computer-Aided Diagnosis of Cancer” and approved by the Rutgers University Office of Research and Sponsored Programs. Further written informed consent was waived by the IRB board, as all data was being analyzed retrospectively, after de-identification.

Experiment 1: Identifying cancerous tissue in prostate cancer histopathology

Automated systems for detecting PCa on biopsy specimens have the potential to act as (1) a triage mechanism to help pathologists spend less time analyzing samples without cancer and (2) an initial step for decision support systems that aim to quantify disease aggressiveness via automated Gleason grading [5]. Dataset 𝓓1 comprises anonymized hematoxylin and eosin (H & E) stained needle-core biopsies of prostate tissue digitized at 20x optical magnification on a whole-slide digital scanner. Regions corresponding to PCa were manually delineated by a pathologist and used as ground truth. Slides were divided into non-overlapping 30 × 30-pixel tissue regions and converted to a grayscale representation. A total of 927 features including first-order statistical, Haralick co-occurrence [21], and steerable Gabor filter features were extracted from each image [22] (Table 2). Due to the small number of training samples used in this study, the feature set was first reduced to two descriptors via the minimum redundancy maximum relevance (mRMR) feature selection scheme [23], primarily to avoid the curse of dimensionality [8]. A relatively small dataset of 100 image regions, with training set sizes N = {25, 30, 35, 40, 45, 50, 55}, was used to extrapolate error rates (Table 1). LOO cross-validation was subsequently performed on a larger dataset comprising 500 image regions.

thumbnail
Table 2. A summary of all features extracted from prostate cancer histopathology images in dataset .

All textural features were extracted separately for red, green, and blue color channels.

https://doi.org/10.1371/journal.pone.0117900.t002

Experiment 2: Distinguishing high and low tumor grade in breast cancer histopathology

Nottingham, or modified Bloom-Richardson (mBR), grade is routinely used to characterize tumor differentiation in breast cancer (BCa) histopathology [24]; yet, it is known to suffer from high inter- and intra-pathologist variability [25]. Hence, researchers have aimed to develop quantitative and reproducible classification systems for differentiating mBR grade in BCa histopathology [19]. Dataset 𝓓2 comprises 2000 × 2000 image regions taken from anonymized H & E stained histopathology specimens of breast tissue digitized at 20x optical magnification on a whole-slide digital scanner. Ground truth for each image was determined by an expert pathologist to be either low (mBR < 6) or high (mBR > 7) grade. First, boundaries of 30–40 representative epithelial nuclei were manually segmented in each image region (Fig 3). Using the segmented boundaries, a total of 2343 features were extracted from each nucleus to quantify both nuclear morphology and nuclear texture (Table 3). A single feature vector was subsequently defined for each image region by calculating the median feature values of all constituent nuclei. Similar to Experiment 1, mRMR feature selection was used to isolate the two most important descriptors. Error rates were extrapolated from a small dataset comprising 45 images with training set sizes N = {20, 22, 24, 26, 28, 30, 32}, while LOO cross-validation was subsequently performed on a larger dataset comprising 116 image regions (Table 1).

thumbnail
Fig 3. Breast cancer (BCa) histopathology images from dataset .

Examples of (a), (b) low modified-Bloom-Richardson (mBR) grade and (c), (d) high mBR grade images are shown with boundary annotations (green outline) for exemplar nuclei. A variety of morphological and textural features are extracted from the nuclear regions, including (e)-(h) the Sum Variance Haralick textural response.

https://doi.org/10.1371/journal.pone.0117900.g003

thumbnail
Table 3. A summary of all features extracted from breast cancer histopathology images in dataset .

All textural features were extracted separately for red, green, and blue color channels from the RGB color space and the hue, saturation, and intensity color channels from the HSV color space.

https://doi.org/10.1371/journal.pone.0117900.t003

Experiment 3: Identifying cancerous metavoxels in prostate cancer magnetic resonance spectroscopy

Magnetic resonance spectroscopy (MRS), a metabolic non-imaging modality that obtains the metabolic concentrations of specific molecular markers and biochemicals in the prostate, has previously been shown to supplement magnetic resonance imaging (MRI) in the detection of PCa [16, 26]. These include choline, creatine, and citrate, and changes in their relative concentrations (choline/citrate or [choline+creatine)/citrate], which have been shown to be linked to presence of PCa [27]. Radiologists typically assess presence of PCa on MRS by comparing ratios between choline, creatine, and citrate peaks to predefined normal ranges. Dataset 𝓓3 comprises 34 anonymized 1.5 Tesla T2-weighted MRI and MRS studies obtained prior to radical prostatectomy, where the ground truth was defined (as cancer and benign metavoxels) via visual inspection of MRI and MRS by an expert radiologist [16] (Fig 4). Six MRS features were defined for each metavoxel by calculating expression levels for each metabolite as well as ratios between each pair of metabolites. Similar to Experiment 1, mRMR feature selection was used to identify the two most important features in the dataset. Error rates were extrapolated from a dataset of 16 patients using training set sizes N = {2, 4, 6, 8, 10, 12}, followed by LOO cross-validation on a larger dataset of 34 patients (Table 1).

thumbnail
Fig 4. Magnetic resonance spectroscopy (MRS) data from dataset .

(a) A study from dataset showing an MR image of the prostate with MRS metavoxel locations overlaid. (b) For ground truth, each MRS spectrum is labeled as either cancerous (red and orange boxes) or benign (blue boxes). Green boxes correspond to metavoxels outside the prostate for which MRS spectra were suppressed during acquisition.

https://doi.org/10.1371/journal.pone.0117900.g004

Comparison with traditional RRS via interquartile range (IQR)

This experiment compares the results of Experiment 1 with the traditional RRS approach, using both dataset 𝓓1 and corresponding experimental parameters from Experiment 1. However, since traditional RRS does not use cross-validation, a total of T^1=T1·K·R subsampling procedures are used to ensure that same number of classification tasks are performed for both approaches. Evaluation is performed via (1) comparison of the learning curves between the two methods and (2) the interquartile range (IQR), a measure of statistical variability defined as the difference between the 25th and 75th percentile error rates from the subsampling procedure.

Synthetic experiment using pre-defined data distributions

The ability of our approach to produce accurate learning curves with low variance was evaluated using a 2-class synthetic dataset, in which each class is defined by randomly selected samples from a two-dimensional Gaussian distribution (Fig 5). Learning curves are created from a small dataset comprising 100 samples (Fig 5(a)) using training set sizes N = {25, 30, 35, 40, 45, 50, 55} in conjunction with a kNN classifier. Validation is subsequently performed on a larger dataset containing 500 samples (Fig 5(b)).

thumbnail
Fig 5. A synthetic dataset is used to validate our cross-validated repeated random sampling (RRS) method.

In this dataset, each class is defined by samples drawn randomly from an independent two-dimensional Gaussian distribution. (a) A small set comprising 100 samples is used for creation of the learning curves and (b) a larger set comprising 500 samples is used for validation.

https://doi.org/10.1371/journal.pone.0117900.g005

Results and Discussion

Experiment 1: Distinguishing cancerous and non-cancerous regions in prostate histopathology

Error rates predicted by NB and SVM classifiers are similar to those from their LOO error rates of 0.1312 and 0.1333 (Fig 6(b) and 6(c)). In comparison to the learning curves, the slightly lower error rate produced by the validation set is to be expected since the LOO classification is known to produce an overly optimistic estimate of the true error rate [28]. The kNN classifier appears to overestimate error considerably compared to the LOO error of 0.1560, which is not surprising because kNN is a non-parametric classifier that is expected to be more unstable for heterogeneous datasets (Fig 6(a)). Comparison across classifiers suggests that both NB and SVM will outperform kNN as dataset size increases (Fig 6(d)). Although the differences between the mean NB and SVM learning curves are minimal, the 25th and 75th percentile curves suggest that the prediction made by NB is more stable and has lower variance than the SVM prediction.

thumbnail
Fig 6. Experimental results for dataset .

Learning curves (blue line) generated for dataset using mean error rates (black squares) calculated from (a) kNN, (b) NB, and (c) SVM classifiers. Each classifier is accompanied by curves for the 25th (green dashed line) and 75th (red dashed line) percentile of the error as well as LOO error on the validation cohort (yellow star). (d) A direct classifier comparison is made in terms of the mean error rate predicted by each learning curve in (a)-(c).

https://doi.org/10.1371/journal.pone.0117900.g006

Experiment 2: Distinguishing low and high grade cancer in breast histopathology

Learning curves from kNN and NB classifiers yield predicted error rates similar to their LOO cross-validation errors (0.1552 for both classifiers) as shown in Fig 7(a) and 7(b). By contrast, while error rates predicted by the SVM classifier are reasonable (Fig 7(c)), they appear to underestimate the LOO error of 0.1724. One reason for this discrepancy may be the class imbalance present in the validation dataset (79 low grade and 37 high grade), since SVM classifiers have been demonstrated to perform poorly on datasets where the positive class (i.e. high grade) is underrepresented [29]. Similar to 𝓓1, a comparison between the learning curves reflects the superiority of both NB and SVM classifiers over the kNN classifier as dataset size increases (Fig 7(d)). However, the relationship between the NB and SVM classifiers is more complex. For small training sets, the NB classifier appears to outperform the SVM classifier; yet, the SVM classifier is predicted to yield lower error rates for larger datasets (n > 60). This suggests that the classifier yielding the best results for the smaller dataset may not necessarily be the optimal classifier as the dataset increases in size.

thumbnail
Fig 7. Experimental results for dataset .

Learning curves (blue line) generated for dataset using mean error rates (black squares) calculated from (a) kNN, (b) NB, and (c) SVM classifiers. Each classifier is accompanied by curves for the 25th (green dashed line) and 75th (red dashed line) percentile of the error as well as LOO error on the validation cohort (yellow star). (d) A direct classifier comparison is made in terms of the mean error rate predicted by each learning curve in (a)-(c).

https://doi.org/10.1371/journal.pone.0117900.g007

Experiment 3: Distinguishing cancerous and non-Cancerous metavoxels in prostate MRS

Similar to dataset 𝓓1, the LOO error for both the NB and SVM classifiers (0.2248 and 0.2468, respectively) fall within the range of the predicted error rates (Fig 8(b) and 8(c)). Once again, the kNN classifier overestimates the LOO error (0.2628), which is most likely due to the high level of variability in the mean error rates used for extrapolation (Fig 8(a)). While both NB and SVM classifiers outperform the kNN classifier, their learning curves show a clearer separation between the extrapolated error rates for all dataset sizes, suggesting that the optimal classifier selected from the smaller dataset will hold true as even as dataset size increases (Fig 8(d)).

thumbnail
Fig 8. Experimental results for dataset .

Learning curves (blue line) generated for dataset using mean error rates (black squares) calculated from (a) kNN, (b) NB, and (c) SVM classifiers. Each classifier is accompanied by curves for the 25th (green dashed line) and 75th (red dashed line) percentile of the error as well as LOO error on the validation cohort (yellow star). (d) A direct classifier comparison is made in terms of the mean error rate predicted by each learning curve in (a)-(c).

https://doi.org/10.1371/journal.pone.0117900.g008

Comparison with traditional RRS

The quantitative results in Tables 46 suggest that employing a cross-validation sampling strategy yields more consistent error rates. In Table 4, traditional RRS yielded a mean IQR (IQR¯ of 0.0297 across all nN; whereas our approach demonstrated a lower IQR¯ of 0.0070. Furthermore, a closer look at the learning curves for these error rates (Fig 9) suggests that traditional RRS is sometimes unable to accurately extrapolate learning curves. Similarly, Tables 5 and 6 show lower IQR¯ values for our approach (0.0127 and 0.0140, respectively) than traditional RRS (0.0779 and 0.305, respectively) for datasets 𝓓2 and 𝓓3. This phenomenon is most likely due to the high level of heterogeneity in medical imaging data and demonstrates the importance of rotating the training and testing pools to avoid biased error rates that do not generalize to larger datasets.

thumbnail
Table 4. Mean interquartile range (IQR) demonstrates decreased variability of cross-validated random repeated sampling (RRS) over traditional RRS in dataset .

https://doi.org/10.1371/journal.pone.0117900.t004

thumbnail
Table 5. Mean interquartile range (IQR) demonstrates decreased variability of cross-validated random repeated sampling (RRS) over traditional RRS in dataset .

https://doi.org/10.1371/journal.pone.0117900.t005

thumbnail
Table 6. Mean interquartile range (IQR) demonstrates decreased variability of cross-validated random repeated sampling (RRS) over traditional RRS in dataset .

https://doi.org/10.1371/journal.pone.0117900.t006

thumbnail
Fig 9. Comparison between traditional random repeated sampling (RRS) and our cross-validated approach.

Learning curves generated for dataset using (a) traditional RRS and (b) cross-validated RRS in conjunction with a Naive Bayes classifier. For both figures, mean error rates from the subsampling procedure (black squares) are used to extrapolate learning curves (solid blue line). Corresponding learning curves for 25th (green dashed line) and 75th (red dashed line) percentile of the error are also shown. The error rate from leave-one-out cross-validation is illustrated by a yellow star.

https://doi.org/10.1371/journal.pone.0117900.g009

Evaluation of synthetic dataset

The reduced variability of cross-validated RRS over traditional RRS is further validated by learning curves generated from the synthetic dataset (Fig 10). Error rates from our approach demonstrate low variability and the ability to create learning curves that can accurately predict the error rate of the validation set (Fig 10(b)). The cross-validated RRS approach yields a mean IQR (IQR¯ = 0.0066) that is an order of magnitude less than traditional RRS (IQR¯ = 0.074).

thumbnail
Fig 10. Evaluation of our cross-validated repeated random sampling (RRS) on the synthetic dataset.

Learning curves generated for the synthetic dataset using (a) traditional RRS and (b) cross-validated RRS in conjunction with a kNN classifier (k = 3). For both figures, mean error rates from the subsampling procedure (black squares) are used to extrapolate learning curves (solid blue line). Corresponding learning curves for 25th (green dashed line) and 75th (red dashed line) percentile of the error are also shown. The error rate from leave-one-out cross-validation is illustrated by a yellow star.

https://doi.org/10.1371/journal.pone.0117900.g010

Concluding Remarks

The rapid development of biomedical imaging-based classification tasks has resulted in the need for predicting classifier performance for large data cohorts given only smaller pilot studies with limited cohort sizes. This is important because, early in the development of a clinical trial, researchers need to: (1) predict long-term error rates when only small pilot studies may be available and (2) select the classifier that will yield the lowest error rates when large datasets are available in the future. Predicting classifier performance from small datasets is difficult because the resulting classifiers often produce unstable decisions and yield high error rates. In these scenarios, traditional RRS approaches have previously been used to extrapolate classifier performance (e.g. for gene expression data). Due to the heterogeneity present in biomedical imaging data, we employ an extension of RRS in this work that uses cross-validation sampling to ensure that all samples are used for both training and testing the classifiers. In addition, we apply RRS to voxel-level studies where data from both classes is found within each patient study, a concept that has previously been unexplored in this regard. Evaluation was performed on three classification tasks, including cancer detection in prostate histopathology, cancer grading in breast histopathology, and cancer detection in prostate MRS.

We demonstrated the ability to calculate error rates with relatively low variance from three distinct classifiers (kNN, NB, and SVM). A direct comparison of the learning curves showed that the more robust NB and SVM classifiers yielded lower error rates than the kNN classifier for both small and large datasets. A limitation of this work is that all datasets comprise an equal number of samples from each class in order to reduce classifier bias from a machine learning standpoint. However, future work will focus on application to imbalanced datasets where class distribution is based on the overall population (e.g. clinical trials). In addition, we will incorporate additional improvements to the RRS method (e.g. subsampling of testing set as in RIDT) while maintaining a robust cross-validation sampling strategy. Additional directions for future research include analyzing the effect of (a) noisy data on different classifiers [30] and (b) ensemble classification methods (e.g. Bagging) on classifier variability in small training sets.

Supporting Information

S1 Appendix. Description of classifiers.

Each experiment in this paper employs three classifiers: k-nearest neighbor (kNN), Naive Bayes (NB), and Support Vector Machine (SVM). For ease of the reader we provide a metholodogical summary for each of these classifiers with appropriate descriptions and equations.

https://doi.org/10.1371/journal.pone.0117900.s001

(PDF)

Acknowledgments

The authors would like to thank Dr. Scott Doyle for providing datasets used in this work.

Author Contributions

Conceived and designed the experiments: AB SV AM. Performed the experiments: AB. Analyzed the data: AB SV AM. Contributed reagents/materials/analysis tools: AB SV AM. Wrote the paper: AB SV AM. Designed the software used in analysis: AB SV.

References

  1. 1. Evans AC, Frank JA, Antel J, Miller DH. The role of MRI in clinical trials of multiple sclerosis: comparison of image processing techniques. Ann Neurol. 1997 Jan;41(1):125–132. pmid:9005878
  2. 2. Shin DS, Javornik NB, Berger JW. Computer-assisted, interactive fundus image processing for macular drusen quantitation. Ophthalmology. 1999 Jun;106(6):1119–1125. pmid:10366080
  3. 3. Vasanji A, Hoover BA. Art & Science of Imaging Analytics. Applied Clinical Trials. 2013 March;22(3):38.
  4. 4. Madabhushi A, Agner S, Basavanhally A, Doyle S, Lee G. Computer-aided prognosis: Predicting patient and disease outcome via quantitative fusion of multi-scale, multi-modal data. Computerized medical imaging and graphics. 2011;35(7):506–514. Available from: http://www.sciencedirect.com/science/article/pii/S089561111100019X. pmid:21333490
  5. 5. Doyle S, Monaco J, Feldman M, Tomaszewski J, Madabhushi A. An active learning based classification strategy for the minority class problem: application to histopathology annotation. BMC bioinformatics. 2011;12(1):424. pmid:22034914
  6. 6. Berrar D, Bradbury I, Dubitzky W. Avoiding model selection bias in small-sample genomic datasets. Bioinformatics. 2006;22(10):1245–1250. Available from: http://bioinformatics.oxfordjournals.org/content/22/10/1245.short. pmid:16500931
  7. 7. Didaci L, Giacinto G, Roli F, Marcialis GL. A study on the performances of dynamic classifier selection based on local accuracy estimation. Pattern Recognition. 2005;38.
  8. 8. Duda RO, Hart PE, Stork DG. Pattern Classification. Wiley; 2001.
  9. 9. Basavanhally A, Doyle S, Madabhushi A. Predicting classifier performance with a small training set: Applications to computer-aided diagnosis and prognosis. In: Biomedical Imaging: From Nano to Macro, 2010 IEEE International Symposium on. IEEE; 2010. p. 229–232.
  10. 10. Adcock C. Sample size determination: a review. Journal of the Royal Statistical Society: Series D (The Statistician). 1997;46(2):261–283. Available from: http://onlinelibrary.wiley.com/doi/10.1111/1467-9884.00082/abstract.
  11. 11. Mukherjee S, Tamayo P, Rogers S, Rifkin R, Engle A, Campbell C, et al. Estimating dataset size requirements for classifying DNA microarray data. J Comput Biol. 2003;10(2):119–142. pmid:12804087
  12. 12. Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American statistical association. 2002;97(457):77–87.
  13. 13. Brooks FJ, Grigsby PW. Quantification of heterogeneity observed in medical images. BMC Med Imaging. 2013;13:7. pmid:23453000
  14. 14. Wickenberg-Bolin U, Göransson H, Fryknäs M, Gustafsson MG, Isaksson A. Improved variance estimation of classification performance via reduction of bias caused by small sample size. BMC bioinformatics. 2006;7(1):127. pmid:16533392
  15. 15. Freund Y, Seung HS, Shamir E, Tishby N. Selective sampling using the query by committee algorithm. Machine learning. 1997;28(2–3):133–168.
  16. 16. Tiwari P, Viswanath S, Kurhanewicz J, Sridhar A, Madabhushi A. Multimodal wavelet embedding representation for data combination (MaWERiC): integrating magnetic resonance imaging and spectroscopy for prostate cancer detection. NMR Biomed. t2012 Apr;25(4):607–619.
  17. 17. Xu Y, van Beek EJ, Hwanjo Y, Guo J, McLennan G, Hoffman EA. Computer-aided classification of interstitial lung diseases via MDCT: 3D adaptive multiple feature method (3D AMFM). Academic radiology. 2006;13(8):969–978. Available from: http://www.sciencedirect.com/science/article/pii/S1076633206002716. pmid:16843849
  18. 18. Sertel O, Kong J, Shimada H, Catalyurek U, Saltz JH, Gurcan MN. Computer-aided prognosis of neuroblastoma on whole-slide images: Classification of stromal development. Pattern Recognition. 2009;42(6):1093–1103. Available from: http://www.sciencedirect.com/science/article/pii/S0031320308003439. pmid:20161324
  19. 19. Basavanhally A, Ganesan S, Feldman M, Shih N, Mies C, Tomaszewski J, et al. Multi-Field-of-View Framework for Distinguishing Tumor Grade in ER+ Breast Cancer from Entire Histopathology Slides. IEEE Transactions on Biomedical Engineering. 2013; Available from: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6450064. pmid:23392336
  20. 20. Good P. Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypothesis. New York: Springer-Verlag; 1994.
  21. 21. Haralick RM, Shanmugam K, Dinstein IH. Textural features for image classification. Systems, Man and Cybernetics, IEEE Transactions on. 1973;SMC-3(6):610–621. Available from: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4309314.
  22. 22. Doyle S, Feldman M, Tomaszewski J, Madabhushi A. A boosted bayesian multiresolution classifier for prostate cancer detection from digitized needle biopsies. Biomedical Engineering, IEEE Transactions on. 2012;59(5):1205–1218. Available from: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5491097.
  23. 23. Peng H, Long F, Ding C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 2005;27(8):1226–1238. Available from: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1453511.
  24. 24. Elston C, Ellis I. Pathological prognostic factors in breast cancer. I. The value of histological grade in breast cancer: experience from a large study with long-term follow-up. Histopathology. 1991;19(5):403–410. pmid:1757079
  25. 25. Meyer JS, Alvarez C, Milikowski C, Olson N, Russo I, Russo J, et al. Breast carcinoma malignancy grading by Bloom-Richardson system vs proliferation index: reproducibility of grade and advantages of proliferation index. Modern pathology. 2005;18(8):1067–1078. Available from: http://www.nature.com/modpathol/journal/vaop/ncurrent/full/3800388a.html. pmid:15920556
  26. 26. Scheidler J, Hricak H, Vigneron DB, Yu KK, Sokolov DL, Huang LR, et al. Prostate cancer: localization with three-dimensional proton MR spectroscopic imaging-clinicopathologic study. Radiology. 1999 Nov;213(2):473–480. pmid:10551229
  27. 27. Kurhanewicz J, Swanson MG, Nelson SJ, Vigneron DB. Combined magnetic resonance imaging and spectroscopic imaging approach to molecular imaging of prostate cancer. J Magn Reson Imaging. 2002 Oct;16(4):451–463. pmid:12353259
  28. 28. Breiman L. Heuristics of instability and stabilization in model selection. The annals of statistics. 1996;24(6):2350–2383. Available from: http://projecteuclid.org/euclid.aos/1032181158.
  29. 29. Wu G, Chang EY. Class-boundary alignment for imbalanced dataset learning. In: ICML 2003 workshop on learning from imbalanced data sets II, Washington, DC; 2003. p. 49–56.
  30. 30. Al-Kadi OS. Texture measures combination for improved meningioma classification of histopathological images. Pattern recognition. 2010;43(6):2043–2053. Available from: http://www.sciencedirect.com/science/article/pii/S0031320310000373.