The authors have declared that no competing interests exist.
Conceived and designed the experiments: AP NT SD. Performed the experiments: CA AP SD. Analyzed the data: CA AP SD. Contributed reagents/materials/analysis tools: NT AP SD. Wrote the paper: NT AP CA.
Gene-based tests of association are frequently applied to common SNPs (MAF>5%) as an alternative to single-marker tests. In this analysis we conduct a variety of simulation studies applied to five popular gene-based tests investigating general trends related to their performance in realistic situations. In particular, we focus on the impact of non-causal SNPs and a variety of LD structures on the behavior of these tests. Ultimately, we find that non-causal SNPs can significantly impact the power of all gene-based tests. On average, we find that the “noise” from 6–12 non-causal SNPs will cancel out the “signal” of one causal SNP across five popular gene-based tests. Furthermore, we find complex and differing behavior of the methods in the presence of LD within and between non-causal and causal SNPs. Ultimately, better approaches for
Increasingly, in the analysis of SNP microarray data, SNPs are aggregated into sets representing genes, pathways, or other biologically meaningful sets. Set-based tests are then conducted in addition to testing for genotype-phenotype association using single marker approaches. The set-based approach is part of a general trend in statistical genetics to leverage
One promising approach to address these limitations of traditional, agnostic, single marker analyses of SNP microarray data considers testing biologically meaningful sets of SNPs. The two main purposes of this approach are to gain power through (1) aggregating true genotype-phenotype signal across the members of the set and (2) reducing the multiple-testing penalty by reducing the number of tests of significance being conducted. A number of recent approaches for the analysis of common variant SNP data using sets of SNPs have been proposed
While SNP-set methods have frequently been cited in the methodological literature as improving power relative to single-marker tests, in practice, these methods have remained somewhat under-utilized. Much of the literature for SNP-set testing methods applied to common and/or rare variants has considered the question of how to aggregate SNP genotype-phenotype signals statistically and which approaches are most powerful. Less focus has been given to the question of how SNPs should be aggregated into sets. While there are numerous biologically relevant sets (e.g. pathways, genes), we will focus the remainder of our attention on gene-based sets since, to date, this is arguably the most common way to aggregate SNPs. As would be expected, in most situations gene-based tests of association assign all SNPs within a gene (intragenic SNPs) to that gene for the test.
Methods vary, however, when considering inter-genic SNPs (SNPs that exist outside of the start and stop positions of a gene's exonic and intronic regions). The most common approach to assign SNPs to genes is the “Window” approach. In essence, the window approach extends the start and stop positions of the gene an arbitrary amount, ranging from 5 to 500 kb. Typically the window size is the same for both ends of a gene, but it can differ
To date, little work has considered the pros and cons of the various SNP-gene assignment approaches and their potential impact on the performance of gene-based tests. We consider the impact of the inclusion of non-causal SNPs and LD structure on set-based tests of association for common SNP variants using simulated genotype and phenotype data. In each of these scenarios, we consider five SNP-set tests of association, representing two broad classes of tests
In order to assess the impact of different methods of assigning SNPs to genes on gene-based tests of association, we simulated genotype and phenotype data. In the following paragraphs we describe the data simulation process. There were four separate genotype simulations conducted as part of this analysis: (1) A simulation of independent genotypes; (2) A simulation of genotypes with LD between non-causal variants only; (3) A simulation of genotypes with LD between causal and non-causal variants; and (4) A realistic genotype simulation involving complex LD structure. In all cases, in order to generate the samples used in the analysis, a large population of genotypes was simulated assuming HWE. Five hundred random case-control samples were generated for each simulation setting.
The simulation with no LD covered 288 separate settings, as follows. Sets of SNP genotypes contained 1, 2, 4, or 8 causal SNPs. The relative risk (defined later) of each SNP set was 1.25 or 2.00, the total sample size (split evenly between cases and controls) was either 2000 or 4000, and the causal SNPMAF was either 5% or 30%. Thus, there were a total of different ways to generate causal SNPs in the set. For each of the 32 causal SNP settings, we considered 9 different non-causal SNP settings (0,2,4,8 or 32 non-causal SNPs at either 5% or 30% MAF), for a total of 288 ( = 9×32) total settings. To simulate a situation with no LD, all SNP genotypes were simulated independently.
Genotypes were also simulated to include LD structure between the non-causal SNPs. Specifically, non-causal SNP genotypes were recreated for each of the settings described in the previous section, assuming all non-causal SNPs were in the same LD block, or one of two separate LD blocks. LD blocks were in either low (correlation,
A total of 896 simulation settings were considered. The settings were closely related to those used for the simulation with no LD. Specifically, the same 32 combinations of parameters for causal SNPs were used, along with the same options for non-causal SNPs with the added component that non-causal SNPs were either all in the same high or low LD block, or non-causal SNPs were evenly split between two separate high, low or high and low LD blocks.
We also considered LD structure between causal and non-causal SNPs. In this scenario, we assumed each causal SNP was in its own LD block, and that each non-causal SNP was in exactly one LD block with a causal SNP. For each LD block, every non-causal SNP was correlated with the causal SNP to the same degree (either
We also used the observed LD structure in a sample of real genotype data as a starting point for simulation. We started by inferring phased haplotypes and population haplotype frequencies in a ∼900 kb region using fastPHASE
As shown in
Disease status was simulated following a method similar to that of Li and Leal
The probability of disease given an observed genotype at each causal SNP can then be simulated using a Bernoulli random variable for each causal SNP, with probability
All simulated data was analyzed using five recently proposed SNP-set tests, namely GATES
The GATES method
The VEGAS procedure
Gauderman et al.
Gauderman et al.
As part of our evaluation of the performance of different SNP-set methods, we also extended a previous analysis by our group to an analysis of heart disease causing SNPs in the Framingham heart study sample. Details of the sample, genotyping technology and gene assignments are provided elsewhere
In order to discuss the practical implications of our simulation analyses in situations where genome-wide application of SNP-set tests will occur, we explored SNP data as provided by HapMap. Specifically, we considered the Phase 3 CEU HapMap (HapMap 3 draft release #2, NCBI B36) sample, representing individuals of northern and western European ancestry
We first consider the 288 simulation settings for which there is no LD between SNPs. We summarized the general impact of simulation parameters on power through the use of a multiple regression model predicting power by each of the 6 simulation parameters as main effects (relative risk, number of causal SNPs, number of non-causal SNPs, MAF of causal SNPs, MAF of non-causal SNPs, and sample size). A separate model was fit for each of the five gene-based testing methods. Main effects models yielded
As expected, for all five tests, as the relative risk, sample size, or MAF of causal SNPs increased, the power of all tests significantly increased with similar estimated regression coefficients across the five methods (details not shown). The MAF of non-causal SNPs was not significantly related to power for any test except LR-PC, where power decreased as the MAF of non-causal SNPs increased. Changes in power across the other two simulation settings, the number of causal and non-causal SNPs, are the focus of our analysis here. First, for all five tests, as the number of non-causal SNPs increased, power decreased; while power increased with the addition of causal SNPs to the analysis. For all of our simulation settings, and averaged across the five SNP test sets, power declined by an average absolute amount of 0.0026 for each additional non-causal SNP included in the test, but can be as high as 0.0095.
Gene-Based Test | Estimated absolute power loss for one additional non-causal SNP# | Estimated absolute power gain for one additional causal SNP | Estimated number of non-causal SNPs which cancel out the power gained from one causal SNP |
GATES | 0.0025 | 0.0160 | 6.3 |
VEGAS-SUM | 0.0028 | 0.0299 | 10.6 |
VEGAS-MAX | 0.0025 | 0.0151 | 6.1 |
LR-PC | 0.0023 | 0.0267 | 11.5 |
LR | 0.0027 | 0.0305 | 11.4 |
Using a multiple regression model including all parameter and simulation results.
Assuming all other parameters are held constant.
Next, we considered the impact of LD between the non-causal SNPs. The LD simulation parameters we added to the model were amount of LD (
For all five tests, power decreased very little when moving from one to two LD blocks; the main effect term in each of the regression models was non-significant. On the other hand, the amount of LD observed (high or low) was significantly related to the observed power of the test in three of the five models, though the direction of effect was different for different tests. For the GATES test, high LD between non-causal SNPs yielded significantly more power than low LD. A similar trend was observed for LR, though it was not statistically significant. For both VEGAS-max and VEGAS-sum increased LD was associated with significantly lower power. The LR-PC test performed very poorly in situations with high-LD and a large number of non-causal SNPs. Further investigation found that this approach was eliminating the causal variants from the analysis since principal components on the genotype matrix found more than 80% of the correlation in genotypes explained by non-causal variants alone.
Next, we considered the impact of LD between causal and non-causal SNPs. Regression models similar to those described in previous sections were used to assess overall trends in power in relation to the simulation parameters. Overall, the seven main effects terms explained most of the observed variability in power (model
When analyzing all SNPs simultaneously (causal and non-causal), the addition of non-causal SNPs was not related to changes in power for four of the five methods. The lone exception was VEGAS-max which yielded lower power with larger numbers of non-causal SNPs. We note that in all simulation settings considered in this analysis, all non-causal SNPs are in LD with a causal SNP. As seen earlier when investigating the relationship between power and amount of LD, different methods yielded different results. In this case, three of the five methods (GATES, VEGAS-sum and LR) showed significant power gain with higher levels of LD, while VEGAS- max showed significant power loss with higher levels of LD, while LR-PC showed an insignificant change in power.
A follow-up analysis which considered only non-causal SNPs in order to illustrate a situation where causal SNPs were not genotyped (e.g., if only using tagSNPs) showed similar patterns of association in almost all cases. The two exceptions were that the VEGAS-max test no longer showed significant power loss as LD increased, and the LR-PC test showed significant power gain as LD increased.
We now focus on how power changes for the different tests in a realistic LD situation (as depicted
We also considered the inclusion of intergenic regions around the SNP using a combined window-LD thresholding approach. In particular, only SNPs within a given window of the gene that exhibited LD of at least
We applied GATES, VEGAS-SUM, VEGAS-MAX, LR-PC and LR to sets of SNPs in and around VSTM4, a 13 kb gene located at approximately 49.9Mb on chromosome 10.
Region (SNP-set) | Total Number of SNPs | Additional SNPs with p-value less than 0.002 (p-value)1 | GATES | VEGAS-SUM | VEGAS-MAX | LR-PC | LR |
VSTM4 | 5 | rs12245255 (0.00016) | 0.0006 | 0.0034 | 0.011 | 0.0021 | 0.02 |
VSTM4+/−5kb | 6 | rs4298825 (0.003) rs4488117 (0.002) | 0.0011 | 0.0023 | 0.035 | 0.0046 | 0.14 |
VSTM4+/−10kb | 10 | rs6537494 (0.0016) | 0.0014 | 0.0014 | 0.060 | 0.0047 | 0.20 |
VSTM4+/−15kb | 15 | rs4240498 (0.0008) rs7074818 (0.0016) | 0.0020 | 0.0007 | 0.16 | 0.013 | 0.22 |
VSTM4+/−25kb | 25 | none | 0.0030 | 0.0010 | 0.24 | 0.015 | 0.12 |
VSTM4+/−50kb | 38 | none | 0.0034 | 0.0024 | 0.33 | 0.041 | 0.003 |
1. To find the total number of significant SNPs in each SNP-set include all significant SNPs located in and above the row of interest.
Location (bp dow nstream of VSTM4) | SNPID | rs7074818 | rs4240498 | rs6537494 | rs4298825 | rs4488117 | rs12245255 |
11696 | rs7074818 | 1.00 | 0.90 | 0.94 | 0.37 | 0.38 | 0.84 |
10038 | rs4240498 | 0.90 | 1.00 | 0.96 | 0.37 | 0.40 | 0.90 |
9824 | rs6537494 | 0.94 | 0.96 | 1.00 | 0.40 | 0.41 | 0.90 |
2915 | rs4298825 | 0.37 | 0.37 | 0.40 | 1.00 | 0.94 | 0.42 |
312 | rs4488117 | 0.38 | 0.40 | 0.41 | 0.94 | 1.00 | 0.45 |
Intragenic | rs12245255 | 0.84 | 0.90 | 0.90 | 0.42 | 0.45 | 1.00 |
To provide a genome-wide view of the LD structure as it pertains to gene-based tests of association, we analyzed the LD structure of HapMap data.
Gene-based tests are being applied with increasing frequency to common SNPs (MAF>5%) directly measured by SNP microarrays or imputed in GWAS as an alternative to single-marker tests. Despite the promise that aggregating the signal from multiple causal variants will improve power and reduce multiple testing penalties, these methods have generally performed poorly in practice. In our analysis we investigated a variety of realistic factors potentially associated with power across two major classes of gene-based tests. First, we confirmed that all gene-based tests considered here illustrate well-known and expected relationships between power and sample size, relative risk, MAF and number of causal variants. Furthermore, we found that the inclusion of non-causal variants was detrimental to power for all methods. In fact, on average, it only took 6–12 independent non-causal variants to “cancel out” the effect of a single causal variant. This implies that unless more than 10–15% of all independent variants are causal, gene-based tests of common variants may be relatively ineffective. Complex LD structure, the differing behavior of different statistical methods to that LD structure, and variations in the impact of relative risk/MAF/sample size means that we should be hesitant to generalize that result to all situations. However, the fact remains that non-causal SNPs are substantially impacting power of gene-based tests.
The impact of non-causal SNPs is compounded when we consider that many investigators include inter-genic SNPs in gene-based analyses. If no causal SNPs are present in the inter-genic space, then researchers should only include inter-genic SNPs that are in LD with SNPs inside the gene – and, in this case, it is only beneficial for certain methods (e.g., GATES, LR), while this approach appears detrimental to other methods (e.g., VEGAS). As more and more genomic information becomes available, utilizing LD information in gene-based tests is becoming more practical than ever.
Window-based approaches are only reasonable when causal SNPs are in the inter-genic space. Of course,
However, perhaps even more importantly than SNPs in the inter-genic space, is the impact that better prioritization of intragenic SNPs will have on power. For example, as we anticipate more and more sequence data available, we can anticipate that (a) all causal SNPs will be typed and (b) that predicted functional impact of SNPs can be integrated into the analysis. For example, given exonic sequence data, it may be practical to include only non-synonymous inter-genic SNPs in the analysis, thus increasing the causal to non-causal SNP ratio and, potentially, improving statistical power. Additionally, if all SNPs are typed (directly sequenced or imputed), then we longer will need to rely on tag SNPs (non-causal SNPs in LD with causal SNPs) to capture the causal signal. Further consideration is needed to explore how gene-based tests should be applied to common variants when investigating sequence data.
Recently, given the advent of next-generation sequencing data, gene-based testing methods which incorporate both common and rare variants have been proposed. Further work is needed to see how the conclusions found here apply to those methods. However, the effect of non-causal variants is likely the same since methods which focus only on rare variants have been shown to suffer power loss in the presence of non-causal variants (e.g., Li and Leal 2008). Ultimately, methods are needed which are more robust to the inclusion of non-causal variants. A promising approach has recently been proposed by Liu et al. (unpublished manuscript).
As more knowledge of “typical” genetic architectures becomes available, more sophisticated analyses comparing single and multiple marker methods will be possible that can explicitly consider the tradeoff of multiple testing penalties for power in the presence of differing numbers of causal variants, their relative risks and allele frequency distribution, as well as the impact of non-causal variants.
Lastly, our analysis considers only five of a very large, and growing, set of gene-based tests. Notably, we only considered self-contained tests and did not consider competitive tests in our analysis. We use the GATES/VEGAS tests as representatives of tests that combine single marker p-values and use LD structure to account for correlation between genotypes. LR and LR-PC were selected as representatives of gene-based tests requiring the full genotype-phenotype matrix and use regression or regression-like approaches to assess significance of a set of markers. Given the disparate relationships between LD structure and power, even between the methods selected here mean that some caution is needed when projecting our conclusions beyond these methods.
Our analysis suggests that one reason for the poor performance of gene-based tests of association for common variants is due to limited power in the presence of a large percentage of non-causal variants. This finding suggests that window-based methods of assigning SNPs to genes should not be used, especially in light of increasing knowledge of the human genome. Methods are needed which are more robust to the inclusion of non-causal variants, though better a priori prediction of causal variants using bioinformatics methods will also substantially improve power.
(PDF)
We acknowledge the use of the Hope College parallel computing cluster for assistance in data simulation and analysis.