Advertisement
Research Article

Identification of a 5-Protein Biomarker Molecular Signature for Predicting Alzheimer's Disease

  • Published: September 03, 2008
  • DOI: 10.1371/journal.pone.0003111

Abstract

Background

Alzheimer's disease (AD) is a progressive brain disease with a huge cost to human lives. The impact of the disease is also a growing concern for the governments of developing countries, in particular due to the increasingly high number of elderly citizens at risk. Alzheimer's is the most common form of dementia, a common term for memory loss and other cognitive impairments. There is no current cure for AD, but there are drug and non-drug based approaches for its treatment. In general the drug-treatments are directed at slowing the progression of symptoms. They have proved to be effective in a large group of patients but success is directly correlated with identifying the disease carriers at its early stages. This justifies the need for timely and accurate forms of diagnosis via molecular means. We report here a 5-protein biomarker molecular signature that achieves, on average, a 96% total accuracy in predicting clinical AD. The signature is composed of the abundances of IL-1α, IL-3, EGF, TNF-α and G-CSF.

Methodology/Principal Findings

Our results are based on a recent molecular dataset that has attracted worldwide attention. Our paper illustrates that improved results can be obtained with the abundance of only five proteins. Our methodology consisted of the application of an integrative data analysis method. This four step process included: a) abundance quantization, b) feature selection, c) literature analysis, d) selection of a classifier algorithm which is independent of the feature selection process. These steps were performed without using any sample of the test datasets. For the first two steps, we used the application of Fayyad and Irani's discretization algorithm for selection and quantization, which in turn creates an instance of the (alpha-beta)-k-Feature Set problem; a numerical solution of this problem led to the selection of only 10 proteins.

Conclusions/Significance

the previous study has provided an extremely useful dataset for the identification of AD biomarkers. However, our subsequent analysis also revealed several important facts worth reporting:

1. A 5-protein signature (which is a subset of the 18-protein signature of Ray et al.) has the same overall performance (when using the same classifier).

2. Using more than 20 different classifiers available in the widely-used Weka software package, our 5-protein signature has, on average, a smaller prediction error indicating the independence of the classifier and the robustness of this set of biomarkers (i.e. 96% accuracy when predicting AD against non-demented control).

3. Using very simple classifiers, like Simple Logistic or Logistic Model Trees, we have achieved the following results on 92 samples: 100 percent success to predict Alzheimer's Disease and 92 percent to predict Non Demented Control on the AD dataset.

Introduction

Recently, Ray et al. [1] made a significant contribution to the quest of finding a superior molecular test for an earlier diagnosis of Alzheimer's disease (AD). The method appears to have significantly improved on the state-of-the-art and, as a consequence, their results attracted immediate worldwide attention. Using the abundance of 120 signalling proteins on a training set of 83 archived plasma samples, they produced an 18-protein signature. On two separate test sets of 92 (“AD” Alzheimer's samples against control) and 47 (“MCI” mild cognitive impairment samples) the signature was able to show an overall effectiveness of 81% and 91% for AD predictability.

We started this project by analysing the dataset made available and we are glad to report that we have been able to perfectly reproduce their mathematical methods and results from the available datasets. However, our subsequent analysis also produced several important facts worth reporting: using an integrative bioinformatics approach, we identified a 6-protein signature that halves the number of errors in prediction of the previously proposed signature (on the “AD” dataset.), when using the same classifier (PAM). A 5-protein signature (which is a subset of the 18-protein signature of Ray et al.) has the same overall performance. Finally, using more than 20 different classifiers available in the widely-used Weka software package [2], our 5-protein signature has, on average, a smaller prediction error indicating the independence of the classifier and the robustness of this set of biomarkers (i.e. 96% accuracy when predicting AD against non-demented control).

The 6-protein signature is composed of the abundances of IL-1α, IL-3, IL-6, EGF, TNα and G-CSF. We remark that IL-6 was not selected by Ray et al. in the preliminary gene selection, and as a consequence it is not part of their 18-protein signature. Recognising that the importance of IL-6 as a biomarker for AD is debatable and that many classifiers do not make use of its abundance to inform decisions, we also present our results of a 5-protein signature that ignores IL-6.

Results

Base case–analysis of the performance of randomly selected signatures

Before reporting our experimental results, it was important to understand the worst possible performance results that a set of k proteins can have when they are selected at random (from the available 120 proteins under study). We showed results of two experiments that aim at quantifying this. We showed the classification performance of 20 signatures with 18 proteins selected at random with a uniform distribution (obviously, we have selected 18 as is the same number of proteins as the signature proposed by Ray et al.). Analogously, we performed the same experiment now constrained to select only six proteins chosen at random (as we will later present comparative results using signatures that only employ 6 and 5 proteins).

The two different collections of 20 sets of randomly generated signatures were chosen using an equal probability for each of the 120 proteins in the set (obviously, not allowing repetitions and constrained to have either 18 or 6 different proteins in total). For this experiment, we decided to use a random forests algorithm (RF) as a base classifier (we are using the algorithm implemented in [3] for reproducibility purposes), generating 150 trees. As the chosen classifier also has a stochastic nature, for each signature we ran 10 experiments with different seeds, and the results we found are quite interesting.

For these twenty 18-protein signatures the average error over the 92 samples considered on the “AD” test set, is 15.13 meaning an 84% effectiveness, see Table 1. For the 6-protein case, an average error of 30.5% was observed meaning that an expected lower value of 67% effectiveness was found, see Table 2. With these results we can infer that the original selection of the 120 genes is quite remarkable for revealing biomarkers for prediction of clinical AD. Since a random selection with a simple, yet robust, classification method allows us to find “good” 18-protein predictor with only a random selection procedure restricted to these 120 proteins. Table 3, Figure 1 and Figure 2 resume the experiment.

thumbnail

Figure 1. Histograms of the number of errors of the random forest classifier using 20 randomly selected signatures with 18 proteins.

The arrow indicates the results under the same conditions of the 18-protein signature proposed by Ray et al.

doi:10.1371/journal.pone.0003111.g001
thumbnail

Figure 2. Histograms of the number of errors considering the random forest classifier and the 20 randomly selected signatures with 6 proteins.

The arrow indicates the results under the same conditions of our 6-protein signature.

doi:10.1371/journal.pone.0003111.g002
thumbnail

Table 1. Number of errors from the 18-genes randomly selected signatures on the “AD” validation test set.

doi:10.1371/journal.pone.0003111.t001
thumbnail

Table 2. Number of errors from the 6-genes randomly selected signatures on the “AD” validation test set.

doi:10.1371/journal.pone.0003111.t002
thumbnail

Table 3. Random experiments report.

doi:10.1371/journal.pone.0003111.t003

It is remarkable that by choosing 18 proteins at random we were able to obtain a very good signature, at least for this classifier, under the conditions explained above. Perhaps the reason of obtaining such good signatures is that a smaller number of proteins, that all signatures have in common, is all that it is needed for predictive molecular signature. Figures 1 and 2 show the relation between the considered signatures with 18 and 6 proteins and the random ones.

Computational studies: Results obtained with four different signatures

We report all the results obtained using a set of 24 classifiers which have been selected from the Weka software suite [3], aiming at sampling different algorithmic methodologies in current practice. These classifiers are applied having as input the four different signatures with the same training set. To ensure reproducibility of our reported methods, no parameter was modified from the classifier's default setting from Weka's downloaded code. In this way we were not biasing the experiment with ad hoc parameter selection and we ensure the complete reproducibility of our claims. We are also aware that better results are possible when adjusting the parameters of each classifier considering only the samples of the training set. Nevertheless, with these tests our objective is to show the robustness of our methods to discovery biomarkers, by showing the independence of the signature performance from the selected classifier.

It is interesting to note that the mathematical model and algorithms we have used have pointed at Interleukin-6 and included it in the 10-protein signature. It is well known that IL-6 with other cytokines have been the subject of many studies of biomarkers for Alzheimer's disease [4][6]. Using an integrative bioinformatic approach, described in the next sections, we draw our attention to a smaller signature. The 6-protein signature was obtained by the analysis of the protein-relation graph and interestingly enough, IL-6 is also included in this new core signature. Finally, in the 5-protein signature, IL-6 is excluded to provide another comparison and the five proteins now become a proper subset of the 18 original proteins uncovered by Ray et al. Table 4 presents the genes included in each signature, indicating the protein name, Entrez GeneID and official name.

thumbnail

Table 4. Protein name for each signature used in the computational experiment.

doi:10.1371/journal.pone.0003111.t004

Tables 5, 6, 7 and 8 show the results of the 24 classifiers for all the signatures considered. The classifiers marked with a star have a random component; therefore the average of ten runs with different seeds is reported. Finally, Tables 9 and 10 summarize the results.

thumbnail

Table 5. Report of the results of the 24 classifiers when using the 18-Protein biomarker.

doi:10.1371/journal.pone.0003111.t005
thumbnail

Table 6. Report of the results of the 24 classifiers when using the 10-Protein biomarker.

doi:10.1371/journal.pone.0003111.t006
thumbnail

Table 7. Report of the results of the 24 classifiers when using the 6-Protein biomarker.

doi:10.1371/journal.pone.0003111.t007
thumbnail

Table 8. Report of the results of the 24 classifiers when using the 5-Protein biomarker.

doi:10.1371/journal.pone.0003111.t008

The results of our 5-protein signature are reported in Table 8. When considering the “AD” test set, average results (over 24 classifiers) are obtained by the 5-protein signature, 96% when predicting AD and 90% when predicting non-demented control. It is also worth mentioning that there are four different classifiers achieving almost 100% accuracy (i.e. having a number of errors smaller or equal to 1) for predicting AD on the “AD” test set. These results are achieved without losing accuracy when predicting non-demented controls on the same dataset.

In Table 9, a feature of the experiments it is worth commenting: all the signatures drop at least 30% in accuracy when considering the “MCI” dataset. This is understandable since the classifiers have no sample labelled “MCI” in the training set.

thumbnail

Table 9. Average results for each signature over 24 classifiers.

doi:10.1371/journal.pone.0003111.t009

The best overall result, considering both test sets, is obtained by the 6-protein and 5-protein signatures. They present 18 errors and in both signatures this result is obtained twice when using the LMT and Simple Logistic classifiers (Tables 7 and 8).

In Table 10, the standard deviations of the number of errors are almost constant for all signatures, in all datasets. This reinforces our previous claim, the poor performance of the signatures on the “MCI” dataset is related to the fact that the signatures were not trained to identify between AD and MCI.

thumbnail

Table 10. The standard deviation of each test is shown on this table.

doi:10.1371/journal.pone.0003111.t010

To present the experiment results in another form, we compared the performance of each signature in each test. Table 11 presents the comparison between the signatures when considering all the test sets (“AD”+“MCI”) totalling 139 samples. It is remarkable that the 5-protein signature not only has a better average performance, but also presents the best result on 16 of the 24 algorithms used for classification (the number of errors highlighted in bold text indicates the best performance for this particular classifier).

thumbnail

Table 11. Number of errors for each classifier when considering both test sets together (139 samples).

doi:10.1371/journal.pone.0003111.t011

In Table 12, the same comparison is made but only considering the “AD” test set. Once again, it is possible to visualize the performance of the 5-protein signature, obtaining not only the best average result but also the best individual results, presenting 3 errors on 3 occasions.

thumbnail

Table 12. Number of errors for each classifier when considering the “AD” test set (92 samples).

doi:10.1371/journal.pone.0003111.t012

Finally, Table 13 presents the same analysis for the “MCI” test set. In this case the most remarkable observation is the lack of quality to predict MCI-AD. The improved performance of the largest signatures is related to the fact that the signatures have more genes, and because they were not trained to distinguish between MCI patients, the use of more proteins allows a slightly better performance. Nevertheless, even the best signature for this case (a 10-protein signature) presents a poor performance when compared with the previous results.

thumbnail

Table 13. Number of errors for each classifier when considering the “MCI” test set (47 samples).

doi:10.1371/journal.pone.0003111.t013

Discussion

In conclusion, it is clear that the experiment performed by Ray et al. provided an extremely useful dataset for the identification of Alzheimer's disease biomarkers. We have uncovered a robust 5-protein signature with near 97% of accuracy to predict AD against non-demented controls using their data. Our signature has less than one third of the proteins than the one proposed in the original paper, and at least the same level of prediction performance.

The next step on this important quest is to set up an independent experimental procedure that now considers samples with mild cognitive impairment (but without AD) in the training set. We do not agree with the methodology of using a training set without MCI to select biomarkers to differentiate between AD and MCI [1]. This has not been done and warrants further investigation. Only in this way we can uncover useful biomarkers to discriminate between AD and MCI.

On the positive side, our methods reveal the true predictive potential of testing for Alzheimer's disease using this panel of signalling proteins. We also believe that our methods show promise and warrant their application in other settings. It is clear that Alzheimer researchers can benefit directly from our identification of more robust biomarkers. The method is revealed to be useful, simple yet very powerful, and warrants its application in other multifactorial diseases.

Methods

Our methodology consisted of the application of an integrative data analysis method. We used four steps: a) abundance quantization, b) feature selection, c) literature analysis, d) selection of a classifier algorithm which is independent of the feature selection process. These steps were performed without using any of the test datasets. For the first two steps, we used the application of Fayyad and Irani's discretization algorithm [7] for selection and quantization, which in turn creates an instance of the (alpha-beta)-k-Feature Set problem [8][10]. Fayyad and Irani's method filtered only 14 out of 120 proteins of the training set (i.e. those proteins for which no threshold was selected were filtered out). After quantization, samples 7, 43 (AD, “Alzheimer's Disease”) and 48 (NDC, “Nondemented Control”) of the training set were “in conflict”, which means that they have quantized values (for all 14 proteins selected) which are the same although they belong to different classes. These conflicts are then removed, i.e. the three samples of the training set are eliminated and we apply our algorithms to the remaining 80 samples of the training set. Numerical solution of the (alpha-beta)-k-Feature Set problem led to the selection of only 10 proteins, Table 4. For a detailed explanation of the methods and other applications, readers can check our referenced publications and references therein [11][13].

To guarantee the reproduction of all our experiments, we use algorithms from the Weka Package [3] as classifiers. All the classifiers were used with the default parameters; we are convinced that better results could be found if adjustments are made in each classifier (considering only its result over the training set).

The first signature we uncovered contains 10 proteins, see Table 4. Using the Pathway Studio software [3], we generated an undirected graph of the known ‘direct relations’ of these 10 proteins. Each node in the graph corresponds to a protein and an edge exists if the Pathway Studio software produced a ‘direct relation’, indicating important association already observed in the life sciences literature. On this graph we looked for its maximum clique (Fig. 3a). We denote this graph as G = (V,E). Each vertex in V has a one-to-one correspondence with a protein. Each pair of vertices are connected by an edge in E, if and only if, there are many direct relations between the proteins reported in the literature. A clique in G is a subset X of V such that its induced graph G[X] is complete. In other words, we are looking for the maximum subset of proteins, in which all pairs of proteins already have a direct relationship identified between them, thus we consider this set the core of our 10-protein signature (this core has the 6-proteins listed above, see Fig. 3b).

thumbnail

Figure 3. Classification and prediction of clinical Alzheimer's diagnosis in subjects with Alzheimer's disease.

(a) An undirected graph, where each node corresponds a different protein belonging to the 10-protein signature we identified; each edge indicates the existence of a direct relation as obtained by searching the PubMed database, (using the Pathway Studio software). (b) Identification of the maximum clique of the graph, uncovering a robust 6-protein signature; each node on the clique has a direct relation with each other. Simple Logistic was used to classify and predict Alzheimer's (AD) and non-Alzheimer's class, in the training set (c), the blinded test set ‘AD’ (d). All the results are shown in a confusion matrix, for the training set a 10-fold cross-validation was applied 10 times, in both cases Simple Logistic was used with the default parameters of Weka package. All the p-values were calculated using the Fisher exact test.

doi:10.1371/journal.pone.0003111.g003

Our first benchmark test for this 6-protein signature was done using Simple Logistic (SL) [14], perhaps the simplest classifier from the Weka software suite. With our 6-protein signature, SL had a performance of 86% after applying 10 times 10-fold cross-validation over the training set (Fig. 3c). When considering the “AD” test set, our 6-protein signature with SL was able to make a classification with 97% of accuracy. For AD samples we achieved 100% positive agreement and for NDC samples a 92% negative agreement (Fig. 3d).

When using the second test set (labelled “MCI”), that includes samples that had an initial diagnosis of mild cognitive impairment, the performance of all signatures increases the number of errors. It is reasonable to expect that our very trimmed classifiers are going to have some degradation of performance, as they have not been trained to distinguish confirmed AD samples from those that have MCI. When using the same signature to differentiate between AD and other samples of MCI patients, the occurrence of more errors is an expected outcome (Table 9). In spite of this fact, the overall performance of all signatures seems very robust.

Acknowledgments

The authors would like thank the editors for their useful comments that have helped to improve the presentation of these results.

Author Contributions

Conceived and designed the experiments: MGR PM. Performed the experiments: MGR PM. Analyzed the data: MGR PM. Contributed reagents/materials/analysis tools: MGR PM. Wrote the paper: MGR PM.

References

  1. 1. Ray S, Britschgi M, Herbert C, Takeda-Uchimura Y, Boxer A, et al. (2007) Classification and prediction of clinical Alzheimer's diagnosis based on plasma signaling proteins. Nat Med 13: 1359–1362.
  2. 2. Witten IH, Frank E (2005) Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann.
  3. 3. Ariadne Genomics I (2007) Pathway StudioTM. 5.0 ed.
  4. 4. Bruunsgaard H, Andersen-Ranberg K, Jeune B, Pedersen AN, Skinhoj P, et al. (1999) A high plasma concentration of TNF-alpha is associated with dementia in centenarians. J Gerontol A Biol Sci Med Sci 54: M357–364.
  5. 5. Finch CE, Morgan TE (2007) Systemic Inflammation, Infection, ApoE Alleles, and Alzheimer Disease: A Position Paper. Current Alzheimer Research 4: 185–189.
  6. 6. Magaki S, Mueller C, Dickson C, Kirsch W (2007) Increased production of inflammatory cytokines in mild cognitive impairment. Experimental Gerontology 42: 233–240.
  7. 7. Fayyad UM, Irani KB (1993) Multi-Interval Discretization of Continuos-Valued Attributes for Classification Learning. In: Bajcsy R, editor. Morgan Kaufmann. pp. 1022–1029.
  8. 8. Berretta R, Costa W, Moscato P (2008) Combinatorial Optimization Models for Finding Genetic Signatures from Gene Expression Datasets. In: Keith JM, editor. Bioinformatics, Volume II: Structure, Function and Applications. Humana Press.
  9. 9. Berretta R, Mendes A, Moscato P (2007) Selection of Discriminative Genes in Microarray Experiments Using Mathematical Programming. Journal of Research and Practice in Information Technology 39: 13.
  10. 10. Cotta C, Sloper C, Moscato P (2004) Evolutionary Search of Thresholds for Robust Feature Set Selection: Application to the Analysis of Microarray Data. Applications of Evolutionary Computing 21–30.
  11. 11. Cotta C, Langston MA, Moscato P (2007) Combinatorial and Algorithmic Issues for Microarray Analysis. In: Gonzalez TF, editor. Handbook of Approximation Algorithms and Metaheuristics. Chapman & Hall/CRC. pp. 74.71–74.14.
  12. 12. Mendes A, Scott RJ, Moscato P (2008) Microarrays—Identifying Molecular Portraits for Prostate Tumors with Different Gleason Patterns. Clinical Bioinformatics 131–151.
  13. 13. Moscato P, Berretta R, Hourani Ma, Mendes A, Cotta C (2005) Genes Related with Alzheimer's Disease: A Comparison of Evolutionary Search, Statistical and Integer Programming Approaches. Applications on Evolutionary Computing 84–94.
  14. 14. Niels L, Mark H, Eibe F (2005) Logistic Model Trees. Mach Learn 59: 161–205.