Accounting for Population Stratification in Practice: A Comparison of the Main Strategies Dedicated to Genome-Wide Association Studies

Matthieu Bouaziz; Christophe Ambroise; Mickael Guedj

doi:10.1371/journal.pone.0028845

Abstract

Genome-Wide Association Studies are powerful tools to detect genetic variants associated with diseases. Their results have, however, been questioned, in part because of the bias induced by population stratification. This is a consequence of systematic differences in allele frequencies due to the difference in sample ancestries that can lead to both false positive or false negative findings. Many strategies are available to account for stratification but their performances differ, for instance according to the type of population structure, the disease susceptibility locus minor allele frequency, the degree of sampling imbalanced, or the sample size. We focus on the type of population structure and propose a comparison of the most commonly used methods to deal with stratification that are the Genomic Control, Principal Component based methods such as implemented in Eigenstrat, adjusted Regressions and Meta-Analyses strategies. Our assessment of the methods is based on a large simulation study, involving several scenarios corresponding to many types of population structures. We focused on both false positive rate and power to determine which methods perform the best. Our analysis showed that if there is no population structure, none of the tests led to a bias nor decreased the power except for the Meta-Analyses. When the population is stratified, adjusted Logistic Regressions and Eigenstrat are the best solutions to account for stratification even though only the Logistic Regressions are able to constantly maintain correct false positive rates. This study provides more details about these methods. Their advantages and limitations in different stratification scenarios are highlighted in order to propose practical guidelines to account for population stratification in Genome-Wide Association Studies.

Citation: Bouaziz M, Ambroise C, Guedj M (2011) Accounting for Population Stratification in Practice: A Comparison of the Main Strategies Dedicated to Genome-Wide Association Studies. PLoS ONE 6(12): e28845. https://doi.org/10.1371/journal.pone.0028845

Editor: Thomas Mailund, Aarhus University, Denmark

Received: July 21, 2011; Accepted: November 16, 2011; Published: December 21, 2011

Copyright: © 2011 Bouaziz et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This study was funded by Pharnext SA, Paris, France and the Genome and Statistics Laboratory, Evry, France. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors declare that they received funding from Pharnext SA, Paris. This does not alter the authors' adherence to all the PLoS ONE policies on sharing data and materials.

Introduction

Genome-wide association studies (GWAS) have become a widely used approach for gene mapping of complex diseases. With the development of high throughput genotyping technologies many markers are available to conduct these studies. The most common study design is the case-control design using unrelated individuals. The relevance of the results of such large scale genetic studies is however questioned. Indeed certain biases arise when conducting a GWAS, leading to false discoveries. As a consequence, only few associations are consistently and convincingly replicated [1]. There can be many causes to such spurious findings and non-replications [2]–[4]. It is broadly considered that failure to account for the bias induced by population stratification is one of them. This phenomenon occurs when the sampling has been made within non genetically homogeneous populations, i.e. there are systematic differences in allele frequencies due to ancestry and the baseline disease risk are different between the actual subpopulations. This can lead to finding spurious associations or to missing genuine ones [5]–[8]. Accounting for population stratification has nowadays become a necessary step in the conduct of a GWAS, especially with the development of very large studies such as the ones undertaken by international consortia. These studies indeed gather many cohorts of cases and controls, not always matched, with different ancestries.

The most used association test to detect an association is Armitage's Trend test. This test statistic follows a distribution under the null hypothesis of no association. In case of population stratification, this distribution is inflated and the test statistic follows a non-central distribution. Several main approaches exist to account for population stratification in GWAS: Genomic Control [9], [10], Principal Component Analysis (PCA) based methods [11], [12], Regression models [4], [13], and Meta-Analyses. Genomic Control aims at correcting the Trend test statistic inflated null distribution by estimating an inflation factor, usually called , using many markers. In practice we usually consider that a inferior to 1.05 indicates that there is no stratification [14]. The main assumption of this method is that the inflation factor is the same for all markers. PCA-based methods use markers to define continuous axes of variation, called principal components, that reduce the data to few variables containing most of the information about the genetic variability. These axes often relate the spatial distribution of the ancestries of the samples. Using such methods, Price et al. propose an association test to account for stratification. It is implemented in the software Eigenstrat [11]. In practice, it is also common to use the principal components to adjust the results of the classical association test to correct for stratification. These models are Adjusted Logistic Regression models and other adjustments such as on the discrete population labels can be used. Another possible approach to deal with population stratification is to conduct the analyses within subpopulations considered homogeneous and to combine the results with Meta-Analysis methods, such as Fisher's or Stouffer's Z-score methods [15]. It is also possible to use Structured Association methods to work around the stratification issues [16], [17]. These approaches aim at inferring the structure of the population using parametric models. The software Structure proposes this sort of approach [16]. A corresponding association test is available in the software Strat [18] but it is not as often utilized in practice. Note that other methods accounting for stratification, less used in practice, can be consulted in [19]–[27].

The potential of each approach to correct for population stratification depends actually on many factors such as the degree of stratification or the degree of sampling imbalance. This corresponds to situation where the proportions of cases and controls are not the same within the subpopulations. Three types of population structures can be highlighted [26]: discrete structures, admixed populations and hierarchical structures. Discrete structures correspond to cohorts composed of several discrete populations (e.g. African and Caucasian cohorts). Admixture structures pertain to cohort where the samples have admixed ancestries (e.g. African American). Hierarchical structures combine both discrete and admixture structures. The type of population structure is a very important parameter as it has a variable influence on all the methods, rendering them more or less efficient.

Many reviews and comparison articles looking at approaches to account for population stratification examined the potential of the methods [14], [28]–[32]. They focused on certain parameters affecting the stratification such as the sampling imbalance, the minor allele frequency of the disease susceptibility locus or the sample size. Most of them did not however exhaustively considered the different types of population structures. The study that we propose in this paper carefully analyzes this very parameter. We propose a comparison of the mainly used methods by considering a large panel of stratification scenarios corresponding to the different types of population structures. Our study differ from the recent comparison proposed in [32] by the methods considered and the type of simulations conducted. In our study numerous stratified datasets are simulated based on real data so that the structures of the population is well controlled and the data are similar to the ones used in real situations. We are interested in determining which methods tend to perform well, in term of false positive rate and power, under various situations. More precisely we aim at providing practical indications regarding which method(s) should be used with a given structure of the population as they account properly for the stratification bias. We address these questions for unstructured populations, admixed populations, discrete and hierarchical ones. Also, we propose a solution for situations where the sampling design has led to subpopulations only composed of cases or controls that haven't been genetically matched.

Materials and Methods

First, we present the different methods that we decided to compare. Then we describe our process to simulate genetic data under various stratification scenarios. We provide precisions on the comparison strategy as well, i.e. how we estimated the statistical indicators that are the false positive rates and powers of the methods.

A large panel of strategies compared

We decided to compare the performances of six broadly used strategies to account for stratification. First, we focused on the Genomic Control (GC) [9] and on the test proposed by Price et al. implemented in Eigenstrat (Eig) [11]. Then, we included adjusted Logistic Regressions (Reg). A large number of types of adjustments can be considered. We decided to focus on the mainly used in practice: adjustment on the five first principal components resulting from a PCA (Reg PCs), adjustment on the real population labels when this information is precisely known (Reg Real Pop) and adjustment on estimated population labels (Reg Est Pop). These latter labels were estimated using the method of Lee et al. [33]. We also studied one Meta-Analysis approach based on Fisher's score (Meta). Finally, we considered Armitage's Trend test, that does not account for stratification, as a reference to assess the level of stratification in the data.

Several additional adaptations of the Genomic Control, Regressions and Meta-Analysis where investigated as well. Since their results did not turned out to be significantly different from the original approaches, we will only consider them in the Discussion section. The six main methods investigated and their alternatives are detailed in Method S1, and a R script is available on demand.

Simulation model

Our simulation model follows approaches previously used [34]–[36] and is based on the diplotype frequencies of real data sets. These frequencies are used as an empirical distribution of the range of possible diplotypes. Simulating this way leads to genetic patterns similar to those found in real data and therefore allows us to finely control the type of population structure. That way, we first simulate several datasets corresponding to the subpopulations of origin. Then we randomly mate each subpopulations and apply a genetic model to generate diseased and healthy samples. To simulate discrete subpopulations, the populations of origin are independently mated and for admixed populations we mate these populations with each other. The final subpopulations simulated are mixed together to produce a cohort of individuals with population structure. The type of population structure depends on the original datasets selected and the parameters of the model.

The genetic model is based on Wright's model [37] applied to a bi-allelic marker with susceptibility alleles A and a. Let , and be the frequencies of genotypes aa, aA and AA defined bywhere is the minor allele frequency of the SNP and is the consanguinity coefficient that we consider null hereafter so that the Disease Susceptibility Locus (DSL) is under Hardy-Weinberg equilibrium.

We then want to compute the genotype frequencies of the DSL for cases and controls and , i = 0, 1 or 2, using the disease prevalence , the penetrances , and of the genotypes and the mode of inheritance of the disease. The main modes of inheritance can be defined by considering the relative risk , i = 1, 2 byUsing , and and the Bayes formulas we can easily derive the desired frequencies.(1)

Data sources and stratification scenarios

We simulated our data according the model described in the previous section and using the HapMap (http://hapmap.ncbi.nlm.nih.gov/downloads/genotypes/2010-08_phaseIIIII) populations. 5,500 SNPs, with minor allele frequencies higher than 5%, were randomly chosen in equal number on each of the non sexual chromosomes. We only considered SNPs present on an Affymetrix GeneChip Human Mapping 500K so that these SNPs are those commonly used in GWAS. Then, for each of our stratification scenario, some of the HapMap populations were used to simulate our final data with 5,500 SNPs and one DSL following an additive model and randomly located among the available loci.

We aimed at covering several situations as it may be harder to account for stratification with closely related populations than with very distant ones. Therefore, to get an exhaustive assessment of the strategies we considered several scenarios corresponding to different types of population structure: no structure, admixed populations, discrete structures with populations more or less genetically close, and a hierarchical structure. The proportions of cases and controls simulated are different in the subpopulations so that the design is not a simple random sampling. This and the differences between the populations ascertain that we induced and controlled a bias due to population stratification.

The different scenarios that we considered are described hereafter and graphically represented in Figure 1. In addition, Table S1 gives the simulation parameters for these scenarios.

Download:

Figure 1. Population structures of the different scenarios.

Samples are represented on the first two principal components (PCs) estimated on the genotype data.

https://doi.org/10.1371/journal.pone.0028845.g001

Scenario 1: One homogeneous population.

With only one such population there is no stratification. The idea is to determine if the methods accounting for stratification are reliable when there are applied to a non-stratified population. Individuals from Han Chinese in Beijing, China (CHB) are used to simulate these data.

Scenario 2: Admixture.

We considered an admixture of two originally close populations: Chinese in Metropolitan Denver, Colorado (CHD) and Han Chinese in Beijing, China (CHB) are used.

Scenario 3: Two fairly distant discrete populations.

The two relatively distant discrete populations are Utah residents with Northern and Western European ancestry from the CEPH collection (CEU) and Toscans in Italy (TSI).

Scenario 4: Two very distant discrete populations.

The two very distant discrete populations are Han Chinese in Beijing, China (CHB) and Utah residents with Northern and Western European ancestry from the CEPH collection (CEU).

Scenario 5: Hierarchical structure.

The hierarchical structure is composed of five populations: Yoruba in Ibadan, Nigeria (YRI), Luhya in Webuye, Kenya (LWK), Han Chinese in Beijing, China (CHB), Gujarati Indians in Houston, Texas (GIH) and Utah residents with Northern and Western European ancestry from the CEPH collection (CEU).

Scenario 6: Varying proportions of cases/controls.

This scenario uses the same populations as scenario 4 but with a varying proportion of cases between the two subpopulations. The proportion of controls is fixed and equal in the two populations while the proportion of cases is taken with a (r, 1 - r) ratio, with r varying. When this proportion is of 0 then all the cases are in the CEU population that is the less affected by the disease. When it is of 1 then all the cases are in the most affected population (CHB). Our goal is to observe the behavior of the methods in function of the degree of sampling imbalance and to look at whether they tend to perform well in the extreme case where all the cases come from only one of the populations. In this latter case, it is also of interest to determine if the best solution to account for population stratification is not to consider only the cohort composed of both cases and controls by excluding the samples that are not matched. The answer to this issue is particularly useful for large studies where controls with different ancestries are used to match the genotyped cases.

Comparison strategy

We used a statistical framework to analyze the potential of the main approaches investigated that focuses on their false positive rates, also referred to as type-I-error rates, and powers. A statistical definition of these notions is provided in Method S2.

Note that population stratification is said to lead to spurious associations but also to mask true associations. This second effect is more tricky to observe but the statistical power can be useful to do so. As it corresponds to the proportion of SNPs that have been detected associated when they were, a loss of power between a situation with no stratification and a situation with stratification means that SNPs that used to be correctly detected in the first situation are no longer in the second. This corresponds to missing associations.

Both false positive rate and power can be expressed in function of the test statistic. However the distribution of this statistic is not always obvious so we prefer using the p-values instead. Thus the false positive rate becomes and the power . In our simulations, each dataset is simulated with one disease susceptibility locus, for which the degree of association is controlled, and 5,500 additional SNPs to assess the population structure. By placing ourselves under the null hypothesis, of no association, then under the alternative hypothesis, of association, we can respectively assess both false positive rate and power of the methods. To do so, we use a Monte-Carlo method and assess the same quantitywhere # represents the cardinal function and B the number of simulated datasets.

All the DSL simulated, whether it is under the null hypothesis or the alternative, are differentiated. This implies that for all the population structures, one DSL is simulated per subpopulation. These DSL are excluded of the mating process the populations are then submitted to to reach the disired type of structure. That way, the properties of the DSL such as the relative risk are conserved whatever populaltion structure is simulated.

Note that only methods with equivalent false positive rate can be compared in term of power. This implies that a method with high power is no better than one with low power if the first one did not maintain a correct false positive rate.

We simulated data for several DSL relative risks ranging from 1 (no association) to 2.5 (strong association). For each relative risk a number of B = 2,000 datasets were simulated to get an accurate estimation of the statistical quality indicators. We genuinely estimated the indicators with this process as we controlled the degree of association through the simulation model. Note that there is an equivalence between the false positive rate and the power when the relative risk is of 1. A level was chosen for all the tests. Data simulations and comparison of the strategies were performed using the software R (http://cran.r-project.org).

Results

The results of the comparison are presented in this section for each scenario (Figures 2 to 7). Table S2 summarizes the estimations of for the different scenarios. These estimations were conducted according to the methodology indicated in Method S1 by considering the median of Armitage's trend test statistics.