Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Association of Protein Translation and Extracellular Matrix Gene Sets with Breast Cancer Metastasis: Findings Uncovered on Analysis of Multiple Publicly Available Datasets Using Individual Patient Data Approach

  • Nilotpal Chowdhury ,

    nilotpal.chowdhury@gmail.com

    Affiliation Department of Pathology & Laboratory Medicine, All India Institute of Medical Sciences, Rishikesh, Uttarakhand, India

  • Shantanu Sapru

    Current address: Department of Radiation Oncology, Dr. Ram Manohar Lohia Institute of Medical Sciences, Lucknow, Uttar Pradesh, India

    Affiliation Department of Radiotherapy, All India Institute of Medical Sciences, Rishikesh, Uttarakhand, India

Abstract

Introduction

Microarray analysis has revolutionized the role of genomic prognostication in breast cancer. However, most studies are single series studies, and suffer from methodological problems. We sought to use a meta-analytic approach in combining multiple publicly available datasets, while correcting for batch effects, to reach a more robust oncogenomic analysis.

Aim

The aim of the present study was to find gene sets associated with distant metastasis free survival (DMFS) in systemically untreated, node-negative breast cancer patients, from publicly available genomic microarray datasets.

Methods

Four microarray series (having 742 patients) were selected after a systematic search and combined. Cox regression for each gene was done for the combined dataset (univariate, as well as multivariate – adjusted for expression of Cell cycle related genes) and for the 4 major molecular subtypes. The centre and microarray batch effects were adjusted by including them as random effects variables. The Cox regression coefficients for each analysis were then ranked and subjected to a Gene Set Enrichment Analysis (GSEA).

Results

Gene sets representing protein translation were independently negatively associated with metastasis in the Luminal A and Luminal B subtypes, but positively associated with metastasis in Basal tumors. Proteinaceous extracellular matrix (ECM) gene set expression was positively associated with metastasis, after adjustment for expression of cell cycle related genes on the combined dataset. Finally, the positive association of the proliferation-related genes with metastases was confirmed.

Conclusion

To the best of our knowledge, the results depicting mixed prognostic significance of protein translation in breast cancer subtypes are being reported for the first time. We attribute this to our study combining multiple series and performing a more robust meta-analytic Cox regression modeling on the combined dataset, thus discovering 'hidden' associations. This methodology seems to yield new and interesting results and may be used as a tool to guide new research.

Introduction

Microarray analysis has revolutionized our understanding of breast cancer. A molecular classification of breast cancer based on the expression of certain genes has gained acceptance in the last two decades. Luminal A, Luminal B, HER2 positive and Basal subtypes are the major subtypes identified in multiple independent cohorts[15], and this classification is being refined further by identification of more subtypes[68]. Analyses done on microarrays have suggested that genes related to the cell cycle are of great prognostic importance[914].

However, to the best of our knowledge, a comprehensive analysis concentrated on finding prognostically important cellular pathways and processes in the different breast cancer subtypes has not been done. The few studies which have been done have either tried to extract a prognostic gene signature[1522] or have focused on a small number of pathways[12,23,24]. Furthermore, pathways which are prognostically important independent of the cell cycle related genes have also not been looked at. A significant short coming of existing gene signature studies using microarray technology are methodological flaws related to non-correction of batch effects in microarray data analysis, which may potentially reduce their robustness[25,26].

In order to understand the natural history of breast cancer, it is important to exclude factors that may impact on survival, and which may have changed over the past few decades. We realized that node negative tumors afforded us the best opportunity to study long term survivors and their gene signatures. At present, breast cancer patients are given a variety of treatments, and the responsiveness of these treatments may have an effect on the clinical end point selected, as well as may themselves be associated with gene expression[2730]. Including patients receiving systemic therapy and those deserving but not receiving any such therapy would increase the heterogeneity of the analysis, thus rendering any modeling non-robust, and the results difficult to interpret. This problem is compounded further by the non-availability of the exact treatment regimen in most genomics data sets. The possibility of different treatment regimens being associated in a variable, non-uniform manner with gene expression renders the task of appropriate modeling and interpretation of results even more difficult. In order to exclude such treatment related effects, we selected only systemically untreated patients for our study. This would have the advantage of studying the true tumor related natural history un-confounded by treatment related effects. We included only those microarray data sets from which the batch in which the analysis was done could be deduced, so that the batch effects could be controlled for while analyzing the data.

The present study focused on finding significant pathways and processes associated with distant metastasis in node negative breast cancer. We tried finding not only the main processes associated with distant metastasis in the entire dataset, but also those which were significant independent of cell cycle. Analysis was also carried out on the different molecular subtypes separately.

Methods

Data sources and study selection

The Gene Expression Omnibus (GEO) database was queried in June 2014 for node negative breast cancer using the search terms “breast cancer” and “breast neoplasms”, combining them with Boolean operator "OR", filtered to include only Gene Expression Series (“gse”), for organism “Homo sapiens” containing “Expression profiling by array”. An initial search resulted in 1142 series. From the 1142 series, only those having 50 or more samples were included. This resulted in 317 data series. Of these 317 data series, only 5 contained arrays exclusively from node-negative breast cancer patients without any systemic adjuvant therapy. The rest 312 were excluded because they did not meet our inclusion criteria of node negative patients not receiving any systemic adjuvant therapy and having follow up data of a minimum of 05 years. Of the five remaining series, one did not contain information about the batch or date in which the microarrays were analyzed, and was excluded. Finally we were left with four series, viz. GSE2034[31], GSE5327[32], GSE7390[33], and GSE11121[34], all having confirmed node negative patients who did not receive any form of systemic therapy. The flow chart summarizing the steps in study selection is given in Fig 1.

thumbnail
Fig 1. The PRISMA Flow diagram showing the method behind the inclusion of studies for the analysis.

https://doi.org/10.1371/journal.pone.0129610.g001

Data extraction and synthesis

The raw microarray data from each series was downloaded and pre-processed separately, as recommended by Hahne et al [35]using the frozen Robust Multi-array Average (fRMA) algorithm[36]. The different series were then combined together with respect to our primary end point of interest Distant Metastasis Free Survival (DMFS) or distant recurrence, depending on coding of particular data series. The occurrence of distant metastasis was taken as the event of interest for which time-to-event analyses were done. The series were censored at 10 years after diagnosis.

Prior to using a Cox regression model, the probes were collapsed to genes, by the WGCNA package[37]. Probes which gave the maximum average expression were selected to represent the gene.

Only the four major molecular sub-types were investigated using the PAM50 classifier[5], i.e. Basal, HER2 Positive, Luminal A and Luminal B subtypes.

Six different analyses were conducted combining data from the four studies as follows: (a) Univariate analysis on the entire combined dataset, (b) Multivariate analysis on the combined dataset adjusting for expression of cell cycle related genes, (c) Univariate analysis on the HER2 subtype extracted from the combined dataset, (d) Univariate analysis on the basal subtype extracted from the combined dataset, (e) Univariate analysis on the Luminal A subtype extracted from the combined dataset, and, (f) Univariate analysis on the Luminal B subtype extracted from the combined dataset.

To adjust for expression of cell cycle related genes for the multivariate analysis done in analysis (b) listed above, the AURKA module score[12] as implemented in the genefu package in Bioconductor and R was added to the Cox regression model.

Each analysis (a-f above), was evaluated by Cox regression models which included the gene expression as a fixed effects variable and one of the following as a random effects variable: i) Batch (as estimated from the scan date of the original microarray analysis), ii) Study series, iii) Centre or institution, iv) Batch as a nested effect in the Series, and, v) Batch as a nested effect in the Centre.

A sixth model (vi) was also performed as a control which did not include the batch or centre (a simple marginal model with the gene expression as the sole variable).

Therefore, in each analysis (a-f above), for each gene, six different Cox regression models were evaluated. For each analysis, the model which gave the lowest median Akaike Information Criterion (AIC)[38] value was selected as the most appropriate model for that particular analysis.

The coefficients obtained from the most appropriate Cox regression models (for each analysis) were then ranked according to their value. Since the standard GSEA (Gene Set Enrichment Analysis) currently lacks the software implementation for adjusting for batch effects, as well as being incapable of performing time-to event analysis, hence the coefficients obtained were subjected to a pre-ranked GSEA[39,40].

The following downloaded gene set collections were used in the present study: i) The gene sets representing the Canonical pathways (presented as C2: CP gene set collections), which included gene sets from the Biocarta[41], KEGG,[42,43] Reactome[44] and Protein Interaction Database[45] repositories, and, ii) The Gene Ontology[46,47] gene sets (presented as C5 gene set collections).

A Family Wise Error Rate (FWER)-adjusted p-value of ≤0.05 was selected as our level of significance for the pre-ranked GSEA. Ten thousand permutations were used (default being one thousand); the default setting was kept for all the other parameters of the GSEA study. The gene sets which were found significant in the pre-ranked GSEA analysis were then extracted.

The entire workflow is described in Fig 2.

thumbnail
Fig 2. Flow chart of present analysis till the Pre-ranked GSEA to find DMFS associated gene sets.

https://doi.org/10.1371/journal.pone.0129610.g002

Visualization of the relationship between the different significant gene sets from the pre-ranked GSEA above was done by means of the Enrichment Map Cytoscape plugin[48], which groups the highly redundant gene sets in clusters. The Overlap coefficient[48], at a threshold of 0.5, was used for clustering the gene sets, as recommended in the Enrichment Map manual.

The "leading edges" of significant gene sets from each of the six analyses (a-f) were then studied after the pre-ranked GSEA. Genes which were the most commonly distributed among the leading edges of the significant gene sets were identified for each analysis.

A standard GSEA was performed again at the end to find the relationship between each gene set included in each analysis (a-f above) and the AURKA module score for each data series. For this analysis, the “Pearson” metric for ranking genes was used, as recommended by the GSEA user manual; no correction was made for the batch effects, and one thousand permutations were used in the analysis of each data series. A final adjusted p-value was calculated by combining the p-values obtained from the four studies by the Stouffers method[49]. The FWER was calculated by the method of Holm[50]. An FWER-adjusted p-value of ≤0.05 was selected as our level of significance.

Microarray preprocessing, categorization into subtypes and Cox regression were carried out by the statistical environment R v3.1.0[51] using the coxme[52], GEOquery[53], affy[54], genefu[55], hgu133a.db[56] and WGCNA[37] packages.

All the gene sets used in the GSEA were downloaded from the website of the Broad institute[57]:

The GSEA analyses were carried out by GSEA2-2.0.14 software downloaded from the Broad institute.

Results

The Clinical characteristics of the patients in the data series are given in Table 1, while the tumor-related characteristics are detailed in Table 2.

thumbnail
Table 1. Summary of patient characteristics of the four series analysed.

https://doi.org/10.1371/journal.pone.0129610.t001

thumbnail
Table 2. Summary of tumor characteristics of the four series analysed.

https://doi.org/10.1371/journal.pone.0129610.t002

The comparability of the expression values from the different samples was visualized using boxplots. fRMA pre-processed expression data were seen to be comparable between samples, with the distribution of the expression values between different arrays being similar to one another (S1 Fig). This supported the contention that the fRMA pre-processed data from different series could be combined together for analysis.

Of the various mixed models considered, the model with just batch effects as random variable (Cox regression model (i) in Methods above) gave the lowest AIC values for the analyses from (a)-(d) and analysis (f) above. For analysis (e), i.e. Univariate analysis on the Luminal A subtype, the cox regression model with Batch nested in Centre as RE Variable (model (v) in methods above) gave the lowest AIC values. These models (which gave the lowest AIC value for an analysis) were selected as the most appropriate model for that particular analysis.

Details of the AIC values are shown in S1 Table.

Note on interpretation of the GSEA results: GSEA is a competitive gene set analysis method. In this context, a gene set negatively associated with metastasis means that it is more negatively associated with metastasis than random gene sets formed from genes other than those in the gene set tested. A gene set positively associated with metastasis means that it is more positively associated with metastasis than random gene sets formed from genes other than those in the gene set tested.

I) Results of the analysis on the entire combined dataset

When the combined dataset was analyzed in a univariate setting (analysis (a) in Methods above), gene sets related to protein translation were found to be negatively associated with distant metastasis (good prognosis), as depicted in Table 3 and Fig 3; while gene sets related to the Cell cycle (i.e. proliferation-related gene sets) were the only gene sets to be significantly positively associated with metastasis (poor prognosis) (Table 4, Fig 3).

thumbnail
Fig 3. Enrichment Map visualization of DMFS-associated gene sets on Univariate analysis in the Combined Dataset.

Red nodes represent gene sets positively associated with metastasis. Blue nodes represent gene sets negatively associated with metastasis. Two nodes are connected by edges (solid green lines) only if the overlap coefficient between the nodes is greater than 0.5, and the edge thickness increases with increased Overlap coefficient.

https://doi.org/10.1371/journal.pone.0129610.g003

thumbnail
Table 3. Gene sets negatively associated with distant metastasis (good prognosis) on univariate analysis of the combined dataset.

https://doi.org/10.1371/journal.pone.0129610.t003

thumbnail
Table 4. Gene sets positively associated with distant metastasis (bad prognosis) on univariate analysis of the combined dataset.

https://doi.org/10.1371/journal.pone.0129610.t004

On performing the multivariate analysis adjusted for the Cell cycle by adding the AURKA module to the Cox regression model (analysis (b)), gene sets representing the integrin1 pathway and the proteinaceous extracellular matrix were positively associated with metastasis (Table 5).

thumbnail
Table 5. Gene sets positively associated with distant metastasis (bad prognosis) after adjustment for proliferation using the AURKA module score in the combined dataset.

https://doi.org/10.1371/journal.pone.0129610.t005

II) Results of the subtype analysis

The results for the gene sets significantly associated with distant metastasis in the individual molecular subtypes are detailed below.

Univariate analysis on HER2 tumors (Analysis (c) in Methods).

Gene sets related to interferon gamma signaling was the only gene set found negatively associated with distant metastasis (Table 6). No gene sets were found to be positively correlated with distant metastasis.

thumbnail
Table 6. Gene sets negatively associated with distant metastasis (good prognosis) in the different subtypes of breast cancer.

https://doi.org/10.1371/journal.pone.0129610.t006

Univariate analysis on Basal tumors (Analysis (d)).

Gene sets representing protein translation were the only sets significantly positively associated with metastasis (Table 7, Fig 4).

thumbnail
Fig 4. Enrichment Map visualization of DMFS-associated gene sets in the Basal Subtype.

Red nodes represent gene sets positively associated with metastasis. Blue nodes represent gene sets negatively associated with metastasis (none present in this figure). Two nodes are connected by edges (solid green lines) only if the overlap coefficient between the nodes is greater than 0.5, and the edge thickness increases with increased Overlap coefficient.

https://doi.org/10.1371/journal.pone.0129610.g004

thumbnail
Table 7. Gene Sets positively associated with distant metastasis (bad prognosis) in the different subtypes of breast cancer.

https://doi.org/10.1371/journal.pone.0129610.t007

Univariate analysis on Luminal A tumors (Analysis (e)).

Gene sets representing translation were negatively associated with distant metastasis (Table 6, Fig 5). One gene set representing Kinesins (thus possibly related to mitosis)were significantly positively associated with metastasis (i.e. negatively associated with survival) (Table 7, Fig 5)

thumbnail
Fig 5. Enrichment Map visualization of DMFS-associated gene sets in the Luminal A Subtype.

https://doi.org/10.1371/journal.pone.0129610.g005

Univariate analysis on Luminal B tumors (Analysis (f)).

Gene sets representing translation were negatively associated with distant metastasis (Table 6, Fig 6). Gene sets associated with cell cycle (proliferation) dominated the sets significantly positively associated with metastasis (i.e. negatively associated with survival) (Table 7, Fig 6).

thumbnail
Fig 6. Enrichment Map visualization of DMFS-associated gene sets in the Luminal B Subtype.

Red nodes represent gene sets positively associated with metastasis. Blue nodes represent gene sets negatively associated with metastasis. Two nodes are connected by edges (solid green lines) only if the overlap coefficient between the nodes is greater than 0.5, and the edge thickness increases with increased Overlap coefficient.

https://doi.org/10.1371/journal.pone.0129610.g006

III) Results of the leading edge analysis

Univariate analysis on the entire combined dataset.

Genes encoding for various ribosomal proteins were commonly distributed among the leading edges of the gene sets significantly associated with good prognosis, while genes encoding for replication proteins A1 and A3, DNA directed polymerase and Cyclin dependent Kinase 2 were commonly distributed among the leading edges of the significant gene sets associated with bad prognosis (Tables 8 and 9).

thumbnail
Table 8. Genes which are found in a large number of leading edges of significant gene sets negatively associated with distant metastasis (good prognosis) for each analysis.

https://doi.org/10.1371/journal.pone.0129610.t008

thumbnail
Table 9. Genes which are found in a large number of leading edges of significant gene sets positively associated with distant metastasis (bad prognosis) for each analysis.

https://doi.org/10.1371/journal.pone.0129610.t009

Multivariate analysis on the entire combined dataset.

Genes encoding for various types of collagens, fibrillin 1 and laminins were commonly distributed among the leading edges of the significant gene sets associated with bad prognosis (Table 9).

Univariate analysis on HER2 tumors.

Since only one gene set was found significant in our analysis, a leading edge analysis could not be done on HER2 tumors.

Univariate analysis on the Basal tumors.

Genes encoding for various ribosomal proteins were commonly distributed among the leading edges of the significant gene sets associated with bad prognosis (Table 9).

Univariate analysis on the Luminal A tumors.

Genes encoding for various ribosomal proteins were commonly distributed among the leading edges of the significant gene sets associated with good prognosis (Table 8). Only one gene set (Reactome_Kinesins) was found significantly associated with bad prognosis in Luminal A tumors; hence a leading edge analysis could not be done for bad prognosis gene sets in Luminal A tumors.

Univariate analysis on Luminal B tumors.

Genes encoding for various ribosomal proteins were commonly distributed among the leading edges of the gene sets significantly associated with good prognosis, while genes encoding for proteins involved in cell division (including cyclins, cyclin dependent kinases, proteasomes and replication factors) were commonly distributed among the leading edges of the significant gene sets associated with bad prognosis (Tables 8 and 9).

IV) Results of the GSEA performed for the relationship between the AURKA module score and the other gene sets

In each data series, as well as in the combined dataset, a strong negative association was found between the AURKA module score and the gene sets related to the extracellular matrix (Table 10). The association of the gene sets representing translation did not reach statistical significance (FWER ≤ 0.05), though they were consistently expressed in a direction opposite to that of the AURKA score.

thumbnail
Table 10. GSEA results for association with the AURKA module score of the non-proliferation related gene sets found to be statistically significant on the univariate and multivariate analyses shown in Tables 3, 4 and 5.

https://doi.org/10.1371/journal.pone.0129610.t010

Discussion

Most studies done on microarray data in breast cancer have found Cell cycle-related genes to have prognostic importance in breast cancer[914,58]; some studies have found the importance of immune related networks[9,16,19] and RNA splicing[9]. However, the prognostic importance of many processes may have been missed because: i) the aim of many of the studies was to find a gene signature predictive of outcome, and not the important cellular pathways and processes[1522], and, ii) other studies only studied a few processes[12,59]; or did not include gene sets from many databases[23].

In contrast to the above studies, the present study focused on biological pathways and processes. Different datasets were combined to make the results more robust, with adjustment for study-specific differences being made by random effects models. Furthermore, batch effects were also treated as random effects, in order to adjust for any potential confounding effects thereof. The aim of the present study was to find processes and pathways associated with distant metastasis in the different molecular subtypes of breast cancer, as well as those processes associated with DMFS which were important in breast cancer taken as a whole.

The important pathways and processes were estimated from the entire list of genes ranked in order of their prognostic strength in each molecular subtype as well as overall breast cancer cases analyzed. In order not to miss out any important processes from the analysis, multiple gene sets were tested. These gene sets represent biological processes curated by multiple independent reliable sources, as mentioned in Methods above.

The appropriate design in the present case is a time-to-event analysis. Of the many gene set analysis implementations, we had to select one which could make sense of a time-to-event analysis and yet adjust for batch effects. Therefore it was felt suitable to perform mixed-effects Cox regression on each gene after adjustment for Centre and batch effects as a random variable, and then analyze the ranked list of the Cox regression coefficients.

A pre-ranked Gene Set Enrichment analysis (GSEA)[39,40] was preferred to analyze the ranked list, since the alternative(over-representation analysis[60] using the hypergeometric test) is limited by an arbitrary cut-off [61]value for gene selection and the statistical significance is also related to the size of the gene-set in the analysis. It may be noted that the standard GSEA currently lacks the software implementation for adjusting for batch effects, as well as being incapable of performing time-to event analysis.

To counteract the false positives, a strict criterion for significance was adopted by keeping the Family-wise Error Rate (FWER) ≤0.05. Pre-ranked GSEA results use gene set permutation rather than the more stringent phenotype permutation used by standard GSEA. The GSEA User Guide[62] suggests a cut-off of False Detection Rate (FDR) of 0.05 for analyses done with gene set permutation. This study consists of multiple independent analyses, which necessitate a stringent cut off to exclude false positives. An FDR of 0.05 means that in any given single analysis, the total number of false positives would be 5%; however, an FWER ≤0.05 means that in any given single analysis, the probability of having even one single false positive is not more than 5%. Therefore, we have used this more stringent threshold than suggested by the GSEA user guide. An even stricter cut off may have resulted in missing out many significant gene sets. To make the results of the pre-ranked GSEA robust, ten thousand permutations were used for finding the prognostically important gene sets. However, for the standard GSEA used to find gene sets associated with the AURKA module score, we were more interested in the general direction of association in each of the four data series, hence the standard recommended one thousand permutations were thought to confer sufficient robustness to the results.

Probably the most striking aspect of the results is the association of genes involved in protein translation with distant metastasis in breast cancer. Although the importance of protein translation in the molecular biology of cancers is well known, the association of this process with the prognosis of breast cancers is under appreciated. To our knowledge, this is possibly the first study to report on the prognostic importance of gene sets related to protein translation in breast cancer subtypes. On univariate analysis of the entire dataset, gene sets representing protein translation and the immune system were the only gene sets positively associated with survival (negatively associated with metastasis, i.e. having a good prognosis). As expected, gene sets representing cell cycle (or proliferation) were found to be negatively associated with survival in this analysis The leading edges of the significant translation-related gene sets in the different analyses were predominantly populated by genes encoding for ribosomal proteins.

On adjustment for proliferation in the multivariate analysis, protein translation ceased to retain its significance. On first impression, it may seem that the association of protein translation with DMFS is spurious. However, on univariate analysis of the individual molecular sub-types, it was found that these gene sets were associated with DMFS in the Luminal A and Basal sub-types, even though in such analyses proliferation-related gene sets were not found to be associated with DMFS (Tables 6 and 7). Thus, protein translation seems to have an association independent of cellular proliferation with survival in certain molecular sub-types of breast cancer.

It was interesting to note that the direction of association of gene sets related to protein translation with metastasis was different in the different sub-types. In the Luminal A and B tumors, protein translation was found to be negatively associated with metastasis (good prognosis), but in the Basal subtype, it was found to be positively associated with the same (poor prognosis). This was corroborated by the high degree of overlap of the leading edge genes which were associated with favorable prognosis in Luminal A and B tumors, while being associated with a poor prognosis in Basal tumors.

Why translation should have differing prognostic effects in different sub-types is unclear. We offer a provisional hypothesis that the prognostic effect of translation might be related to the deranged proteins in the cells. Increased protein translation in Basal tumors might imply greater translation of mutated proteins which cause further dysregulation of growth (leading to a worse prognosis); while increased protein translation in the Luminal tumors might imply an increased expression of normal proteins which tends the cells towards homeostasis (leading to a better prognosis). Thus, different protein translation derangement mechanisms may confer different prognostic significance in the various breast cancer subtypes.

To our knowledge, ours is the first study to link protein translation as a process/pathway to clinical outcomes in breast cancer. However, since ours is an exploratory analysis, our findings need to be confirmed on larger datasets. However, such a novel finding is intriguing and may well give us insights about the oncobiology and natural history of breast cancer and its sub-types.

The different molecular subtypes have different expressions of proliferation-related genes, as a result of which the range of expression of these genes are reduced in the different subtypes. Therefore, it is not surprising that, in our study, cell cycle has a relatively lower prognostic importance in many of the individual molecular subtypes, as compared to analysis on the entire breast cancer cases. Even then, proliferation retained its prime position in Luminal B tumors. A gene set representing Kinesins also attained significance in Luminal A tumors. The genes in the leading edge of this gene set are all kinesins which play an important role in mitosis.

Only a few studies have found the extracellular matrix (ECM) to be of prognostic importance in breast cancer[9,63,64]. One study[9] found that a module made of ECM genes was related to a prognostic gene signature. However, the above mentioned study finally selected Proliferation, Immune Response and RNA splicing as the main cellular events predictive of outcome in breast cancer. Some proteins related to ECM were identified as having prognostic importance in breast cancer by network analysis[63]; even then, the importance of the ECM was understated. Only one study [64] found gene sets related to ECM to be of primary importance.

In our study, the ECM-related gene sets did not show prognostic association with breast cancer on any univariate analysis. However, when adjusted for proliferation (in the multivariate analysis), there was a positive association between the ECM-related gene sets and metastasis, underscoring a negative prognostic significance of the expression of ECM-related genes in breast cancer. This was bolstered by the finding of collagen-related genes being over-represented in the 'leading edge' analysis performed separately.

We investigated as to why the prognostic effect of ECM-related genes remained hidden in the univariate analysis, while being apparent in the multivariate setting. Considering that the ECM-related gene sets showed association with metastasis after adjustment for proliferation (Table 5), we hypothesized that there could be suppression of the effect of ECM-related genes due to association of their expression with proliferation-related genes. To test this hypothesis, standard GSEA was performed to assess the relationship between the proliferation-related AURKA module score and all the gene sets in individual data series. This analysis confirmed a strong negative association between the AURKA score and the ECM gene sets, i.e. increased expression of ECM-related genes is associated with decreased expression of the proliferation-related genes, and vice versa (Table 10). Since proliferation is known to be one of the strongest markers of poor prognosis, hence the decreased expression of proliferation-related genes might mask the poor prognostic association of the increased expression of ECM-related genes, thus explaining our seemingly paradoxical results. This negative correlation between the expression of ECM-related genes and proliferation may also be a possible explanation for the prognostic non-significance of ECM-related genes in other studies which were not adjusted for proliferation. The overexpression of ECM and integrins being associated with poor prognosis (Table 5) makes biological sense, as they represent a part of the molecular mechanism of metastasis as is presently known[65].

A few caveats while interpreting the results have to be made: i) The study consisted of systemically untreated node negative females pooled from various studies which differed in their time periods from the 1980s through the 2000s. This has the advantage of having a more homogeneous cohort compared to random datasets having a variety of systemic treatments which would have complicated the analysis greatly. However, this criterion may itself lead to a dataset consisting of a biased population.

After combining data from the four series, 499 patients had ER positive tumors, while 295 had PR positive tumors–the data series did not allow for an interpretation whether these were mutually exclusive or overlapping in some way. However, despite being hormone receptor positive, none received any adjuvant hormonal therapy, the reasons for the exclusion from systemic treatment being unknown. Similarly, there were 325 patients who had T2 or larger tumors, but who, for unknown reasons, did not receive chemotherapy. We do not know whether the characteristic of the patients which led to their exclusion from systemic therapy would be a confounding variable in itself, as well as the possibility that since they did not receive any systemic therapy that itself could have affected their metastasis free survival in any way. It is certainly probable that the characteristics of patients and tumors (for instance, preferential evolutionary clonal selection[66]) in the different datasets changed with time as systemic treatment became more prevalent and few patients were recommended no systemic treatment.

ii) Since the population consisted of patients who were systemically untreated, the present study gives a good picture of the natural history of breast cancer unmodified by any systemic iatrogenic factor. However, the very same characteristic causes it to be unsuitable for the formation of a prognostic gene signature; prognosis being a much more complex end process, having highly multi-factorial stochastic and causal elements to it. At present, breast cancer patients are given a variety of treatments, and the responsiveness of these treatments may themselves be associated with gene expression[2730].

iii) A third limitation of the study concerns the relatively small sample size of this study. Despite combining four large studies in the database, our study had 742 patients, and relatively little number of events of interest (i.e. distant metastases) had occurred at the time of data censoring. At the time of data censoring, a total of 200 events of interest had occurred– 52, 34, 47 and 59 for the Basal, HER2, Luminal A and Luminal B tumors respectively–which we feel might represent a complex interplay between differences in tumor biology and changing patterns of survival over the last few decades.

Conclusion

This study, in addition to confirming the known prognostic role of proliferation in breast cancer, to the best of our knowledge, reports for the first time, the hitherto unappreciated prognostic effect of translation in breast cancer and its various molecular sub-types. It also shows the opposing prognostic association of the translation associated genes in the various breast cancer subtypes. Finally, the study highlights the independent prognostic significance of the genes related to integrin1 pathway and the extracellular matrix. This is possibly the first study to demonstrate the confounding effect of proliferation. Importantly, this study showed how the significant prognostic effects of biologically meaningful pathways and processes may be hidden by their association with a prognostically strong confounder (proliferation).

This study was undertaken as we felt that previous studies had methodological flaws, and no one has yet reported any results by combining different series in a single large dataset using a meta-analytic approach for the different breast cancer subtypes. The primary aim of this study was to better understand the processes and pathways affecting prognosis in breast cancer, and possibly identify novel pathways of interest for further analysis and hypothesis generation, rather than making a predictive signature. Confirmatory studies, preferably on large datasets, are needed to validate the findings of this study.

Supporting Information

S1 Fig. Boxplot of the entire microarray data before and after fRMA processing.

The y-axis of the data before pre-processing is plotted after log 2 transformation. Data after fRMA preprocessing are log 2 transformed during pre-processing, hence no further log transformation is done.The different colors represent different data series (orange for GSE2034, blue for GSE5327, yellow for GSE11121 and green for GSE7390).

https://doi.org/10.1371/journal.pone.0129610.s001

(TIFF)

S1 Folder. A zipped folder containing the R script for the analysis along with the follow up data, instructions for analysis, and the.rnk files as well as text files having the GSEA parameters with which to reproduce the results of the present study.

https://doi.org/10.1371/journal.pone.0129610.s002

(ZIP)

S1 Spreadsheet. Excel spreadsheet giving the full Leading edge results.

1 represents membership of the gene in the geneset, 0 denotes non-membership. A total denoting total number of leading edges the gene is part of is also given.

https://doi.org/10.1371/journal.pone.0129610.s003

(XLSX)

S1 Table. Supplementary Appendix 1 shows the AIC values for each Cox Regression model used in each analysis.

In each analysis, Gene Expression was the fixed effects variable. In analysis (b) the AURKA module score was an additional fixed effects variable. For any given analysis, the particular model having the least median AIC values (indicated in bold red) was chosen as the best model, for the respective analysis, to be used in the pre-ranked GSEA. (RE = Random effects).

https://doi.org/10.1371/journal.pone.0129610.s004

(DOCX)

S2 Table. Summary of results obtained in the six analyses showing the association of various gene sets with distant metastasis.

https://doi.org/10.1371/journal.pone.0129610.s005

(DOCX)

Author Contributions

Conceived and designed the experiments: NC SS. Performed the experiments: NC. Analyzed the data: NC SS. Contributed reagents/materials/analysis tools: NC. Wrote the paper: NC SS.

References

  1. 1. Perou CM, Sørlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, et al. Molecular portraits of human breast tumours. Nature. 2000 Aug 17;406(6797):747–52. pmid:10963602
  2. 2. Sørlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci U S A. 2001 Sep 11;98(19):10869–74. pmid:11553815
  3. 3. Sorlie T, Tibshirani R, Parker J, Hastie T, Marron JS, Nobel A, et al. Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Natl Acad Sci U S A. 2003 Jul;100(14):8418–23. pmid:12829800
  4. 4. Sørlie T, Wang Y, Xiao C, Johnsen H, Naume B, Samaha RR, et al. Distinct molecular mechanisms underlying clinically relevant subtypes of breast cancer: gene expression analyses across three different platforms. BMC Genomics. 2006 Jan;7(1):127.
  5. 5. Parker JS, Mullins M, Cheang MCU, Leung S, Voduc D, Vickery T, et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol. 2009 Mar 10;27(8):1160–7. pmid:19204204
  6. 6. Chanrion M, Fontaine H, Rodriguez C, Negre V, Bibeau F, Theillet C, et al. A new molecular breast cancer subclass defined from a large scale real-time quantitative RT-PCR study. BMC Cancer. 2007 Jan;7(1):39.
  7. 7. Prat A, Parker JS, Karginova O, Fan C, Livasy C, Herschkowitz JI, et al. Phenotypic and molecular characterization of the claudin-low intrinsic subtype of breast cancer. Breast Cancer Res. 2010 Jan;12(5):R68. pmid:20813035
  8. 8. Pusztai L, Mazouni C, Anderson K, Wu Y, Symmans WF. Molecular classification of breast cancer: Limitations and potential. Oncologist. 2006;11(8):868–77. pmid:16951390
  9. 9. Reyal F, van Vliet MH, Armstrong NJ, Horlings HM, de Visser KE, Kok M, et al. A comprehensive analysis of prognostic signatures reveals the high predictive capacity of the proliferation, immune response and RNA splicing modules in breast cancer. Breast Cancer Res. 2008 Jan;10(6):R93. pmid:19014521
  10. 10. Wirapati P, Sotiriou C, Kunkel S, Farmer P, Pradervand S, Haibe-Kains B, et al. Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures. Breast Cancer Res. 2008 Jan;10(4):R65. pmid:18662380
  11. 11. Venet D, Dumont JE, Detours V. Most random gene expression signatures are significantly associated with breast cancer outcome. Rigoutsos I, editor. PLoS Comput Biol. Public Library of Science; 2011 Oct;7(10):e1002240. pmid:22028643
  12. 12. Desmedt C, Haibe-Kains B, Wirapati P, Buyse M, Larsimont D, Bontempi G, et al. Biological processes associated with breast cancer clinical outcome depend on the molecular subtypes. Clin Cancer Res. 2008 Aug 15;14(16):5158–65. pmid:18698033
  13. 13. Haibe-Kains B, Desmedt C, Sotiriou C, Bontempi G. A comparative study of survival models for breast cancer prognostication based on microarray data: does a single gene beat them all? Bioinformatics. 2008 Oct 1;24(19):2200–8. pmid:18635567
  14. 14. Chen J, Sam L, Huang Y, Lee Y, Li J, Liu Y, et al. Protein interaction network underpins concordant prognosis among heterogeneous breast cancer signatures. J Biomed Inform. 2010 Jun;43(3):385–96. pmid:20350617
  15. 15. Liu JC, Voisin V, Bader GD, Deng T, Pusztai L, Symmans WF, et al. Seventeen-gene signature from enriched Her2/Neu mammary tumor-initiating cells predicts clinical outcome for human HER2+:ERα- breast cancer. Proc Natl Acad Sci U S A. 2012 Apr 10;109(15):5832–7. pmid:22460789
  16. 16. Teschendorff AE, Miremadi A, Pinder SE, Ellis IO, Caldas C. An immune response gene expression module identifies a good prognosis subtype in estrogen receptor negative breast cancer. Genome Biol. 2007 Jan;8(8):R157. pmid:17683518
  17. 17. Yau C, Esserman L, Moore DH, Waldman F, Sninsky J, Benz CC. A multigene predictor of metastatic outcome in early stage hormone receptor-negative and triple-negative breast cancer. Breast Cancer Res. 2010 Jan;12(5):R85. pmid:20946665
  18. 18. Staaf J, Ringnér M, Vallon-Christersson J, Jönsson G, Bendahl P-O, Holm K, et al. Identification of subtypes in human epidermal growth factor receptor 2—positive breast cancer reveals a gene signature prognostic of outcome. J Clin Oncol. 2010 Apr 10;28(11):1813–20. pmid:20231686
  19. 19. Rody A, Holtrich U, Pusztai L, Liedtke C, Gaetje R, Ruckhaeberle E, et al. T-cell metagene predicts a favorable prognosis in estrogen receptor-negative and HER2-positive breast cancers. Breast Cancer Res. 2009 Jan;11(2):R15. pmid:19272155
  20. 20. Rody A, Karn T, Liedtke C, Pusztai L, Ruckhaeberle E, Hanker L, et al. A clinically relevant gene signature in triple negative and basal-like breast cancer. Breast Cancer Res. 2011 Jan;13(5):R97. pmid:21978456
  21. 21. Sabatier R, Finetti P, Cervera N, Lambaudie E, Esterni B, Mamessier E, et al. A gene expression signature identifies two prognostic subgroups of basal breast cancer. Breast Cancer Res Treat. 2011 Apr;126(2):407–20. pmid:20490655
  22. 22. Griffith OL, Pepin F, Enache OM, Heiser LM, Collisson EA, Spellman PT, et al. A robust prognostic signature for hormone-positive node-negative breast cancer. Genome Med. 2013 Jan;5(10):92. pmid:24112773
  23. 23. Sabatier R, Finetti P, Mamessier E, Raynaud S, Cervera N, Lambaudie E, et al. Kinome expression profiling and prognosis of basal breast cancers. Mol Cancer. 2011 Jan;10:86. pmid:21777462
  24. 24. Hallett RM, Dvorkin-Gheva A, Bane A, Hassell JA. A gene signature for predicting outcome in patients with basal-like breast cancer. Sci Rep. 2012 Jan;2:227. pmid:22355741
  25. 25. Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet. Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.; 2010 Oct;11(10):733–9. pmid:20838408
  26. 26. Kupfer P, Guthke R, Pohlers D, Huber R, Koczan D, Kinne RW. Batch correction of microarray data substantially improves the identification of genes differentially expressed in rheumatoid arthritis and osteoarthritis. BMC Med Genomics. 2012 Jan;5(1):23.
  27. 27. Mulligan JM, Hill LA, Deharo S, Irwin G, Boyle D, Keating KE, et al. Identification and validation of an anthracycline/cyclophosphamide-based chemotherapy response assay in breast cancer. J Natl Cancer Inst. 2014 Jan 1;106(1):djt335. pmid:24402422
  28. 28. Dressman HK, Hans C, Bild A, Olson J a, Rosen E, Marcom PK, et al. Gene expression profiles of multiple breast cancer phenotypes and response to neoadjuvant chemotherapy. Clin Cancer Res. 2006 Feb 1;12(3 Pt 1):819–26. pmid:16467094
  29. 29. Haibe-Kains B, Desmedt C, Di Leo A, Azambuja E, Larsimont D, Selleslags J, et al. Genome-wide gene expression profiling to predict resistance to anthracyclines in breast cancer patients. Genomics Data. 2013 Dec;1:7–10.
  30. 30. Ma X-J, Wang Z, Ryan PD, Isakoff SJ, Barmettler A, Fuller A, et al. A two-gene expression ratio predicts clinical outcome in breast cancer patients treated with tamoxifen. Cancer Cell. 2004 Jun;5(6):607–16. pmid:15193263
  31. 31. Wang Y, Klijn JGM, Zhang Y, Sieuwerts AM, Look MP, Yang F, et al. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005 Jan 19;365(9460):671–9. pmid:15721472
  32. 32. Minn AJ, Gupta GP, Padua D, Bos P, Nguyen DX, Nuyten D, et al. Lung metastasis genes couple breast tumor size and metastatic spread. Proc Natl Acad Sci U S A. 2007 Apr 17;104(16):6740–5. pmid:17420468
  33. 33. Desmedt C, Piette F, Loi S, Wang Y, Lallemand F, Haibe-Kains B, et al. Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series. Clin Cancer Res. 2007 Jun 1;13(11):3207–14. pmid:17545524
  34. 34. Schmidt M, Böhm D, von Törne C, Steiner E, Puhl A, Pilch H, et al. The humoral immune system has a key prognostic impact in node-negative breast cancer. Cancer Res. 2008 Jul 1;68(13):5405–13. pmid:18593943
  35. 35. Hahne F, Huber W, Gentleman R, Falcon S. Processing Affymetrix Expression Data. Bioconductor Case Studies. New York, NY: Springer New York; 2008. p. 25–46.
  36. 36. McCall MN, Bolstad BM, Irizarry RA. Frozen robust multiarray analysis (fRMA). Biostatistics. 2010 Apr 1;11(2):242–53. pmid:20097884
  37. 37. Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008 Jan;9(1):559.
  38. 38. Akaike H. A new look at the statistical model identification. IEEE Trans Automat Contr. IEEE; 1974 Dec 1;19(6):716–23.
  39. 39. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. National Academy of Sciences; 2005;102(43):15545–50. pmid:16199517
  40. 40. Mootha VK, Lindgren CM, Eriksson K-F, Subramanian A, Sihag S, Lehar J, et al. PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet. 2003 Jul;34(3):267–73. pmid:12808457
  41. 41. Nishimura D. BioCarta. Biotech Softw Internet Rep. Mary Ann Liebert, Inc.; 2001 Jun 5;2(3):117–20.
  42. 42. Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 2012 Jan;40(Database issue):D109–14. pmid:22080510
  43. 43. Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M. The KEGG resource for deciphering the genome. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D277–80. pmid:14681412
  44. 44. Croft D, O’Kelly G, Wu G, Haw R, Gillespie M, Matthews L, et al. Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res. 2011 Jan 9;39(Database issue):D691–7. pmid:21067998
  45. 45. Schaefer CF, Anthony K, Krupa S, Buchoff J, Day M, Hannay T, et al. PID: the Pathway Interaction Database. Nucleic Acids Res. 2009 Jan 1;37(Database issue):D674–9. pmid:18832364
  46. 46. Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, et al. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D258–61. pmid:14681407
  47. 47. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. Nature America Inc.; 2000 May;25(1):25–9. pmid:10802651
  48. 48. Merico D, Isserlin R, Stueker O, Emili A, Bader GD. Enrichment map: a network-based method for gene-set enrichment visualization and interpretation. Ravasi T, editor. PLoS One. Public Library of Science; 2010 Jan;5(11):e13984. pmid:21085593
  49. 49. Darlington RB, Hayes AF. Combining independent p values: extensions of the Stouffer and binomial methods. Psychol Methods. 2000 Dec;5(4):496–515. pmid:11194210
  50. 50. Gordon AY, Salzman P. Optimality of the Holm procedure among general step-down multiple testing procedures. Stat Probab Lett. 2008 Sep 15;78(13):1878–84. pmid:19759804
  51. 51. R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria; 2014.
  52. 52. Therneau T. coxme: Mixed Effects Cox Models. 2012.
  53. 53. Davis S, Meltzer P. GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics. 2007;14:1846–7.
  54. 54. Gautier L, Cope L, Bolstad BM, Irizarry RA. affy—-analysis of Affymetrix GeneChip data at the probe level. Bioinformatics. Oxford, UK: Oxford University Press; 2004;20(3):307–15. pmid:14960456
  55. 55. Haibe-Kains B, Schroeder M, Bontempi G, Sotiriou C, Quackenbush J. genefu: Relevant Functions for Gene Expression Analysis, Especially in Breast Cancer. 2013.
  56. 56. Carlson M. hgu133a.db: Affymetrix Human Genome U133 Set annotation data (chip hgu133a).
  57. 57. GSEA Team, MSigDB Scientific Advisory Board. GSEA | MSigDB | MSigDB Collections [Internet]. 2013 [cited 2014 Oct 1]. Available from: http://www.broadinstitute.org/gsea/msigdb/collections.jsp
  58. 58. Cheng W-Y, Ou Yang T-H, Anastassiou D. Development of a prognostic model for breast cancer survival in an open challenge environment. Sci Transl Med. 2013 Apr 17;5(181):181ra50. pmid:23596205
  59. 59. Al-Ejeh F, Simpson PT, Sanus JM, Klein K, Kalimutho M, Shi W, et al. Meta-analysis of the global gene expression profile of triple-negative breast cancer identifies genes for the prognostication and treatment of aggressive breast cancer. Oncogenesis. Nature Publishing Group; 2014 Jan;3(4):e100.
  60. 60. Huang DW, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009 Jan 1;37(1):1–13. pmid:19033363
  61. 61. Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol. Public Library of Science; 2012 Jan 23;8(2):e1002375. pmid:22383865
  62. 62. GSEA Team, MSigDB Scientific Advisory Board. GSEA User Guide [Internet]. 2014 [cited 2014 Oct 1]. Available: http://www.broadinstitute.org/gsea/doc/GSEAUserGuideFrame.html
  63. 63. Chuang H-Y, Lee E, Liu Y-T, Lee D, Ideker T. Network-based classification of breast cancer metastasis. Mol Syst Biol. EMBO and Nature Publishing Group; 2007 Jan 16;3:140. pmid:17940530
  64. 64. Woltmann A, Chen B, Lascorz J, Johansson R, Eyfjörd JE, Hamann U, et al. Systematic pathway enrichment analysis of a genome-wide association study on breast cancer survival reveals an influence of genes involved in cell adhesion and calcium signaling on the patients’ clinical outcome. PLoS One. Public Library of Science; 2014 Jan 2;9(6):e98229. pmid:24886783
  65. 65. Bogenrieder T, Herlyn M. Axis of evil: molecular mechanisms of cancer metastasis. Oncogene. 2003 Sep 29;22(42):6524–36. pmid:14528277
  66. 66. Greaves M, Maley CC. Clonal evolution in cancer. Nature. Nature Publishing Group; 2012 Jan 19;481(7381):306–13. pmid:22258609