Research Article

Using Canonical Correlation Analysis to Discover Genetic Regulatory Variants

  • Melissa G. Naylor mail,

    Affiliation: Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts, United States of America

  • Xihong Lin,

    Affiliation: Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts, United States of America

  • Scott T. Weiss,

    Affiliation: Channing Laboratory, Brigham and Women's Hospital, Boston, Massachusetts, United States of America

  • Benjamin A. Raby,

    Affiliation: Channing Laboratory, Brigham and Women's Hospital, Boston, Massachusetts, United States of America

  • Christoph Lange

    Affiliations: Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts, United States of America, Channing Laboratory, Brigham and Women's Hospital, Boston, Massachusetts, United States of America

  • Published: May 13, 2010
  • DOI: 10.1371/journal.pone.0010395



Discovering genetic associations between genetic markers and gene expression levels can provide insight into gene regulation and, potentially, mechanisms of disease. Such analyses typically involve a linkage or association analysis in which expression data are used as phenotypes. This approach leads to a large number of multiple comparisons and may therefore lack power. We assess the potential of applying canonical correlation analysis to partitioned genomewide data as a method for discovering regulatory variants.

Methodology/Principal Findings

Simulations suggest that canonical correlation analysis has higher power than standard pairwise univariate regression to detect single nucleotide polymorphisms when the expression trait has low heritability. The increase in power is even greater under the recessive model. We demonstrate this approach using the Childhood Asthma Management Program data.


Our approach reduces multiple comparisons and may provide insight into the complex relationships between genotype and gene expression.


The usefulness of examining associations between genetic markers and gene expression is due to the immediate and direct relationship between the gene expression phenotype and DNA sequence variation. As Rockman and Kruglyak stated, “The road from genotype to phenotype runs through gene expression” [1].

Most studies of transcriptional regulation have relied on univariate tests to find significant associations, each between a single genetic marker and a single expression probe. In genomewide association scans, one can easily imagine having a million single nucleotide polymorphisms (SNPs) and thousands of expression probes. The number of tests required to search for associations between individual SNPs and individual probes severely reduces power.

Canonical correlation analysis (CCA) is a statistical method that can reduce the number of tests by using multiple phenotypes and genotypes in each test. CCA compares two sets of variables (in this case, a set of SNP genotypes and a set of expression levels) to assess the correlation between them [2]. CCA finds a linear combination of the genotypes and a linear combination of the expression levels such that the correlation between the two is maximized. As it is, CCA cannot be applied to all SNPs and expression probes in a genomewide association study since the number of variables is greater than the number of subjects. Two modifications of CCA have recently been proposed for use with genetic marker and gene expression data: penalized CCA [3] and sparse CCA [4]. These methods are computationally intensive and are sometimes sensitive to starting parameters. Futhermore, they have not been simulated or applied to datasets as large as a genomewide association study. Here, we assess the power of a more straightforward approach: partitioning the data so that CCA can be applied to each subset of the data. CCA is used to construct one association test for a group of gene expression traits and a group of SNPs, thus reducing the burden of multiple comparisons. Using CCA in this way, we can not only pick out regions of the genome in which genotype is highly associated with gene expression, but can also potentially discover more complex relationships among the variables by examining the coefficients of the linear combinations with maximum correlation. We examine the potential of CCA to assess correlation between SNP data and expression data using simulations and a data analysis.


Canonical Correlation Analysis

In the th proband, , we denote the expression measurements to be used in a single test by , and the genotypes at the SNPs by . The corresponding matrices of expression profiles and genotypes are given by and . Without loss of generality, we assume the number of measured profiles, , is less than or equal to the number of SNPs, . (If not, the notation for profiles and the genotypes can be switched.) and must each be less than the number of subjects, .

We denote the covariance matrix of gene expressions by , the covariance matrix of genotypes by , and the covariance of gene expressions and genotypes by . can be thought of as a measure of the linkage disequilibrium among SNPs. In the application, all three variance/covariance matrices are obtained from their empirical estimators based on and . Note that and must be chosen to be small enough to ensure that the covariance matrices are not nearly singular.

The maximum correlation between the set of expression profiles and the set of SNPs is given by the maximum eigenvalue of the matrix where [5]. The weights for the linear combination of the expression profiles and for the combination of marker scores are given by and , respectively, where the vector is defined to be the eigenvector associated with the maximum eigenvalue of and , the eigenvector associated with the maximum eigenvalue of .

This yields the maximally correlated linear combinations, often referred to as the first pair of canonical variates:
Further canonical variates that maximize correlation between two linear combinations, under the constraint that they are independent of all preceding canonical variates, can easily be calculated. The eigenvalues of , , are the cannonical correlations for the sets of canonical variates.

Several tests have been developed to assess whether the two linear combinations derived using CCA are truly correlated. Bartlett proposed the following likelihood ratio test of versus , i.e., whether the covariance between the gene expression and genotypes is zero [6]:
This is equivalent to testing whether all the canonical correlations equal zero.

Simulation Methods

To apply CCA in our simulations, we chose a region of the genome containing three genes and picked out 20 tag SNPs to simulate. Using CCA, we found two linear combinations, one of the expression profiles and one of the SNP data, such that the correlation between the two linear combinations is maximized. The observed correlation between the two sets was then tested for association, using the Bartlett test statistic.

To simulate realistic genotypes, we mimicked the linkage disequilibrium structure of 20 tag SNPs in the HapMap CEU sample from a 350 kb region of chromosome 2 (positions 118250000–118600000, NCBI build 36, Figure S1). Tag SNPs were chosen such that each SNP not chosen had with at least one tag SNP in the HapMap CEU sample. To do this we used Haploview software v3.32 [7]. To estimate the linkage disequilibrium structure between each pair of consecutive SNPs, we estimated the haplotype frequencies for each pair of consecutive SNPs using the expectation-maximization (EM) algorithm implemented in the haplotype procedure of SAS/Genetics™ software.

We simulated the allele of the first SNP as a Bernoulli random variable with probability equal to the SNP's estimated allele frequency. Each consecutive allele in the 20 SNP haplotype was simulated conditional on the previous SNP's allele (using the conditional probabilities from the SAS haplotype procedure). haplotypes were simulated.

Three genes lie in the chromosome 2 region: DDX18, CCDC93 and INSIG2. To determine a realistic correlation structure for the expression levels of these three genes, we examined microarray expression data from a study on aging done at Children's Hospital of Boston [8]. The covariance matrix,
and mean expression levels,
were then used to simulate expression levels for three genes assuming a multivariate normal distribution.

Simulations assuming varying heritabilty of the expression trait, sample size (200 or 400 individuals), and genetic model (additive or recessive) were run to assess type one error rate and power. For each combination of heritability, sample size, and genetic model, at least 10,000 replications were run with 20 SNPs and three gene expression levels generated for each individual each time. CCA was performed using the cancor function in R [9], which standardizes the variables in order to reduce the chance that the magnitude of the coefficients would be unduly influenced by variables with larger magnitudes. We first simulated independent SNP and expression data to determine the size of Bartlett's likelihood ratio test for no correlation between the two variables. To model an association between a single SNP and a single gene we allowed the mean expression level to vary according to the genotype of a single SNP. Gene expression levels were modeled as multivariate normal with covariance equal to the covariance matrix from the Children's Hospital Boston data. The mean was set equal to zero for genes unaffected and equal to the genetic effect size times a function of the genotype for affected genes. For the additive genetic model, the effect size was multiplied by the number of copies of the risk allele. For the recessive model, the effect size is multiplied by two if and only if there are two copies of the risk allele. Simulating a recessive model is equivalent to simulating a dominant model using the opposite allele. Since we were not interested in simulating a signal between any particular SNP and gene, we randomly chose which SNP and gene would be correlated in each replication, giving all possible pairs equal probability. All simulations were performed in R v2.2.1 [9].


Simulation Results

Under the null hypothesis, the estimated probability of a type one error using Bartlett's likelihood ratio test decreases with increasing sample size (Table 1). For sample sizes greater than 60, the type one error is near 5%. Under the alternative hypothesis of association, the power to detect a significant correlation with Bartlett's test is compared with the power to detect the simulated association by regressing the expression quantitative trait locus (eQTL) of interest on the number of copies of the risk allele using Bonferroni correction to adjust for 60 pairwise tests (the number of pairwise tests needed to test each of the three eQTLs with each of the 20 SNPs) (Table 2). The type one error of the univariate test, 0.2%, is lower than that of CCA because the Bonferroni correction is overly conservative.


Table 1. Type one error.


Table 2. Power.


We assessed the effectiveness of CCA by considering both the power to detect an association between the SNPs and eQTLs and the ranking of the magnitude of the coefficient of the SNP of interest relative to the coefficients of all non-associated SNPs (Figure 1). More specifically, we considered the power to detect a correlation (using Bartlett's test) and have the SNP of interest be the top ranking SNP or be in the top five ranking SNPs (Figure 2). When the heritability of the eQTL is moderately high (i.e., greater than 0.10) both CCA and univariate regression have high power to detect association, with univariate regression slightly out-powering CCA. For eQTLs with low heritability, CCA has greater power than univariate regression in our simulations. This effect is even more pronounced for the smaller sample size (200 individuals) and for the recessive model.


Figure 1. Histograms of the rank of the SNP of interest in simulations.

Each panel represents a distinct set of simulations conditions: number of subjects, genetic model, and heritability. For each simulated dataset in which there was a significant correlation between SNPs and gene expression traits ( for Bartlett's test), SNPs were ranked according to the magnitude of their coefficients in the top canonical variate. Each panel is a histogram of ranks of the SNP of interest. The first column shows that the rank is uniformly distributed under the null hypothesis. The axis labels on the upper left plot apply to all plots.


Figure 2. Power for three methods of determining a significant finding.

The red line shows power to detect a correlation in the simulated region via Bartlett's test with . The blue line shows power to detect a correlation and have the SNP of interest be in the top five ranking SNPs (based on magnitude of SNP coefficients). The green shows power to detect a correlation and have the SNP of interest be the top ranking SNP. The black line shows power to detect an association between the SNP of interest and at least one of the three genes using univariate regression.


Data Analysis - Childhood Asthma Management Program

To apply CCA to a genomewide association study, we partitioned the expression data into groups of three consecutive probes and then tested for association between the probes and SNPs that fell within the window spanned by the three probes and a 200 kb margin on each side.

We applied CCA to a subset of the data from the Childhood Asthma Management Program (CAMP), a multi-center, randomized, clinical trial involving 1,041 children with asthma [10]. The subset consisted of 156 children with CD4+ peripheral blood lymphocyte gene expression data and genomewide SNP genotype data (Infinium II 550K SNP array) available. Although complete trios were available in this dataset, we only used the offspring data to mimic a population-based design.

We searched for local regulatory variants by partitioning the probes (measuring gene expression) into sets of three and considering all SNPs located within 200kb from the ends of the two outer probes. CCA was performed with a maximum of 20 SNPs per group and also with a maximum of 50 SNPs per group. For many probe trios there were more than the maximum number (20 or 50) of SNPs in the region. In these cases, we decreased the number of probes used until the number of SNPs in the region was low enough. If there were too many SNPs for a single probe, then the SNPs were divided into groups such that each group was as close as possible to the maximum number of allowed SNPs. SNPs were always partitioned by genomic position. For each group of SNPs and nearby gene expression traits, we applied Bartlett's likelihood ratio test. For each test significant after Bonferroni correction, we examined the canonical variates (Table 3). As a comparison, each probe-SNP pair was tested for association using univariate regression (any SNP located within 200kb of the end of the probe was considered).


Table 3. Canonical correlation analysis of Childhood Asthma Management Study with maximum number of SNPs (ms) equal to 20 and 50.


For the analysis with a maximum of 20 SNPs per group, 908 of the SNPs found significant using univariate regression were located in a group of SNPs that was significantly correlated using CCA. Using a maximum of 50 SNPs, there were 575 such SNPs. Although the number of tests significant after Bonferroni correction was much lower for CCA than for univariate regression (412 for 20 SNPs or 177 for 50 SNPs vs. 1749), the number of SNPs that were in a significant region was very high (6434 for 20 SNPs or 5669 for 50 SNPs).

About the Data

The Childhood Asthma Management Program (CAMP) was a 4.5 year multi-center clinical trial of childhood asthmatics aged 5–12 designed to evaluate the long-term efficacy and safety of inhaled anti-asthma medications. Following completion of the clinical trial, subjects have been reevaluated yearly as a longitudinal follow-up of the natural history of childhood asthma. RNA was obtained from peripheral blood CD4+ lymphocytes collected during a follow-up visit. CD4+ lymphocytes were isolated by positive selection using anti-CD4+ antibody-coated microbeads (Miltenyi Biotec, Auburn, CA). RNA was extracted using the QIAGEN RNeasy Mini Protocol. Expression profiles were generated with Illumina HumanRef8 v2 BeadChip oligonucleotide arrays (Illumina, San Diego CA) according to protocol. Arrays were read using the BeadArray scanner (Illumina) and analyzed using BeadStudio (version 3.1.7) without background correction. Raw expression intensities were processed using the lumi package (Du et al., 2008), with background adjustment with RMA convolution (Irizarry et al., 2003), transformation for variance stabilization, and combined-sample quantile normalization. 20,589 transcripts (gene expression traits) were considered for association testing. Adequate DNA for genomewide genotyping was available for all members of 156 parent-child trios of self-reported white ancestry. Genotyping was performed using the Illumina Infinium II HumanHap550 Genotyping BeadChip. Genotyping was performed by Illumina (San Diego, CA) according to protocol. All downstream data analysis was performed locally at the Channing Laboratory. Genotype evaluation and cleaning was performed using PLINK. Marker quality was assessed using a variety of measures including Illumina GC scores, ability to map genomic position of assay sequences to unique sites, parent-offspring genotype incompatibilities, and genotype completion rates.


CCA has the potential to be a powerful tool for identifying relationships between genotype and gene expression. It out-powered pairwise univariate regression in simulations where one SNP affected the expression of one gene when heritability was low. This difference in power was largest under the recessive model with a sample size of 200. Univariate regression with Bonferroni corrected pvalues tends to be overly conservative when heritability is low. Our results suggest that CCA may be useful for locating regions with association under difficult circumstances such as small sample size, small effect size, and recessive genetic models. Furthermore, the simple and intuitive method of picking the SNP with the largest coefficient (disregarding sign) is a fairly well-powered way to recover which SNP was truly correlated among the set of SNPs found significant by CCA.

This study has several limitations. CCA is a method designed to detect linear relationships between normally distributed variables and may not be sensitive to nonlinear relationships between SNPs and gene expression traits. Also, the simulations were based on a single region and only one SNP was associated with a single eQTL. In the data analysis, no gold standard existed for determining which SNPs were truly associated with an eQTL. We focused on applying CCA to SNPs and expression quantitative trait loci (eQTLs) in the same region of the genome; however, this method could straightforwardly be applied to sets of SNPs and eQTLs defined by gene pathways or any other criteria. Furthermore, one could choose one set of eQTLs (or SNPs) and test for canonical correlation with each set of the partitioned regulatory variants (or associated eQTLs).

CCA could also straightforwardly be used as a screening method to find regions of strong regulation and reduce multiple comparisons. For instance, if the data include parental genotypes, then, in a preliminary screening stage, the expected offspring genotypes (given the parental genotypes) could be used in place of the true genotypes in CCA. Promising areas of the genome could then be further tested using FBAT [11]. In this case, the screening stage would not compromise the significance of the second stage since the FBAT conditions on parental genotypes and phenotypes. CCA might also be useful for determining hotspots of genetic regulation and as a way to search for regulatory variants controlling the expression of genes in pathways.

CCA is a method to be considered in genetical genomics studies. It reduces multiple comparisons, is more powerful under some of the scenarios where univariate regression methods are underpowered, and may yield further insight into the true relationships between multiple SNPs and eQTLs by providing canonical variates.

Supporting Information

Figure S1.

Linkage disequilibrium (measured by D′) between 20 tag SNPs. The color red indicates D′ = 1 and a LOD score≥2. Blue indicates D′ = 1 and LOD<2. Pink indicates D′<1 and LOD≥2. White indicates D′<1 and LOD<2.


(2.27 MB TIF)


We thank all subjects for their ongoing participation in this study. We acknowledge the CAMP investigators and research team, supported by NHLBI, for collection of CAMP Genetic Ancillary Study data. Special thanks to Anne Plunkett, Teresa Concordia, Debbie Bull, Denise Rodgers, and D. Sundstrom for their assistance with sample collection; to Huiqing Yin-DeClue, Ph.D, Michael McLane and Chris Allaire for their assistance with T cell isolations and RNA preparation; and to Ankur Patel for his assistance running the microarrays. All work on data collected from the CAMP Genetic Ancillary Study was conducted at the Channing Laboratory of the Brigham and Womens Hospital under appropriate CAMP policies and human subject protections.

Author Contributions

Conceived and designed the experiments: MGN XL BAR CL. Performed the experiments: MGN. Analyzed the data: MGN. Contributed reagents/materials/analysis tools: STW. Wrote the paper: MGN BAR.


  1. 1. Rockman MV, Kruglyak L (2006) Genetics of global gene expression. Nat Rev Genet 7: 862–872.
  2. 2. Hotelling H (1936) Relations between two sets of variates. Biometrika 28: 321–377.
  3. 3. Waaijenborg S, de Witt Hamer PCV, Zwinderman AH (2008) Quantifying the association between gene expressions and dna-markers by penalized canonical correlation analysis. Stat Appl Genet Mol Biol 7: Article3.
  4. 4. Parkhomenko E, Tritchler D, Beyene J (2009) Sparse canonical correlation analysis with application to genomic data integration. Stat Appl Genet Mol Biol 8: Article 1.
  5. 5. Johnson RA, Wichern DW (2001) Applied Multivariate Statistical Analysis. Prentice Hall; 5 edition.
  6. 6. Bartlett MS (1941) The statistical significance of canonical correlations. Biometrika 32: 29–37.
  7. 7. Barrett JC, Fry B, Maller J, Daly MJ (2005) Haploview: analysis and visualization of ld and haplotype maps. Bioinformatics 21: 263–265.
  8. 8. Lu T, Pan Y, Kao SY, Li C, Kohane I, et al. (2004) Gene regulation and dna damage in the ageing human brain. Nature 429: 883–891.
  9. 9. R Development Core Team (2009) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Available: ISBN 3-900051-07-0.
  10. 10. (1999) The childhood asthma management program (camp): design, rationale, and methods. childhood asthma management program research group. Control Clin Trials 20: 91–120.
  11. 11. Steen KV, McQueen MB, Herbert A, Raby B, Lyon H, et al. (2005) Genomic screening and replication using the same data set in family-based association testing. Nat Genet 37: 683–691.