The authors have declared that no competing interests exist.
Conceived and designed the experiments: RR JH. Performed the experiments: RR SR. Analyzed the data: RR SR. Wrote the paper: RR SR JH.
The number of identified genetic variants associated to complex disease cannot fully explain heritability. This may be partially due to more complicated patterns of predisposition than previously suspected. Diseases such as multiple sclerosis (MS) may consist of multiple disease causing mechanisms, each comprised of several elements. We describe how the effect of subgroups can be calculated using the standard association measurement odds ratio, which is then manipulated to provide a formula for the true underlying association present within the subgroup. This is sensitive to the initial minor allele frequencies present in both cases and the subgroup of patients. The methodology is then extended to the χ2 statistic, for two related scenarios. First, to determine the true χ2 when phenocopies or disease subtypes reduce association and are reclassified as controls when calculating statistics. Here, the χ2 is given by
Advances in genotyping technology have allowed for large scale genome wide association studies using up to millions of SNPs in cohorts of several thousand cases and controls. The data produced contains a wealth of information, which often results in the discovery of new gene associations with a given disease. However, despite the tremendous advances in technology, meta-analyses of large cohorts are required to identify new disease associated genes, which often have small effect sizes.
Complex diseases are defined as those which have multiple genetic components as well as environmental interaction
Multiple sclerosis (MS) is a complex autoimmune disorder which may have either different disease mechanisms and/or genetic background; that is, the genetic factors influencing an individual’s predisposition may vary. The Rothman pie model of sufficient causes postulates that subgroups of disease may exist within “pies” of a predetermined number of genetic and environmental factors
Complex diseases may be multiple disorders with similar phenotypic manifestations, or a disorder with multiple genetic causes (subclasses). Each of the subclasses may be a result of combinations of similar, or unique, predisposing genes.
The existence of genetic subgroups of disease likely confounds identification of genes contributing to the predisposition of complex disorders. A simulation study using reasonable values for samples size, effect size and allele frequencies estimated the effect of subgroups and modifier gene on detection thresholds
A common method for improving detection of genetic association is to stratify disease samples based on clinical characteristics. For example, in RA patients with the presence of antibodies to citrullinated peptide antigens (ACPA+) display clear differences in association from ACPA-
The failure of genome-wide association studies (GWAS) to identify new variants with strong effects on disease predisposition has led to the search for “missing heritability”. It has been proposed that detection of more variants and/or rare variants may be useful, particularly by conducting large scale sequencing of patient samples
Within a syndrome of diseases or one with several genetically distinct predispositions, an effect strong enough to alter the OR of the total sample may have a much higher effect in the genetic subgroup in which it is a predisposing element. A basic assumption of this relationship is that allele frequencies are altered within one or more subgroups, and remain similar to controls in “non-affected” subgroups.
If a single nucleotide polymorphism (SNP) has a certain genotype (e.g. AA, Aa/aa, aa) in all cases of one of n genetically distinct subclasses of disease, the OR will reflect an overall regression to the population’s allele frequencies at that SNP. Assume that a SNP has allele counts in cases given by
The difference in allele frequencies for a SNP considered classically (top) and as a complex disorder with subgroups (below). In the example, a hypothetical subgroup denoted Subclass 1 contains allele counts
The underestimation in OR can be measured as a ratio of the “true” OR of the subclass to the OR of the entire group
Allele 1 | Allele 2 | |
Allele 1 | Allele 2 | |
This effect of error factor on OR is plotted for various minor allele frequencies separately in
OR error factor present with various minor allele frequencies. Each curve represents the range of OR error factors due to subgroups for a given MAF (minor allele frequency) observed in all cases. The Y-axis indicates the OR error factor corresponding to an increase in the subgroup MAF (as compared with all cases) given by the X-axis. For example, in the second curve corresponding to a MAF of 0.10 in cases, a subgroup with an increase of MAF of 0.20 (or 0.30 MAF) would have an OR in that subgroup approximately four times that reported for the overall group.
A second and related application of this rationale is to determine the error in association measured due to the presence of disease subtypes or non-genetic causes of disease, usually denominated phenocopies. These subsets of disease may be distinguishable from other clinical groups and contain a distinct etiology or are a different disorder.
Phenocopies have a measureable effect on the χ2 statistic calculated. To investigate the potential for omitting relevant SNPs due to this (Type II error), we assume that some proportion of cases are separate disease subtypes or not genetic in nature, and call this term σ. In order to estimate the true allele frequencies of a given SNP in the relevant cases, we remove the phenocopies, and add them to controls with the previously determined control frequency. We recalculate the allele frequency that was present in the remaining cases, and can determine the χ2 value which corresponds to the true frequency of the SNP in these cases. We have original observed and expected allele counts as follows: observed as given previously in
Allele 1 | Allele 2 | |
To find the ratio of error in χ2 values, we state that the χ2 value of the new allele distribution is χ2n. The ratio
an | bn |
cn | dn |
The function
The first term is equal to
This simplifies to
The function
The function
The presence of phenocopies overstates the impact of the χ2 statistic for the core disease group, which exists with other subtypes as a proportion of overall cases. Thus, reclassification of cases not within a particular subtype to controls assumes a lack of disease predisposition, which is clearly not true for correctly diagnosed patients. Therefore, a more conservative approach for calculating required sample sizes will be employed.
Next we examine cohorts with only a particular subgroup, or sum of subgroups, associated to the disease at a particular locus. Consider a SNP which is weakly associated to the disease, but wherein only a minority of cases are contained within the subgroup exhibiting the association. In this situation, we preserve the coding of the proportion in the subgroup, σ, but do not add the samples removed from a′n and b′n to c′n and d′n as illustrated in
a′n | b′n |
c′n | d′n |
A new term,
We now turn our attention to the relationship between the new statistic and the original one. In particular, how the original sample with given allele frequencies relates to the statistic of the subgroup. If allele frequencies remain fixed, how must the original sample size increase to report the same association?
In order to determine the increase in cases and controls required to replicate this statistic, without any reclassification of samples, a second 2x2 matrix is constructed to represent the new cohort size. A new variable, γ, is created which is a proportion by which the number of total samples must be increased in order to attain the χ2 statistic of the associated subgroup. Therefore, each term in the new matrix will have allele counts multiplied by γ (
ar | br |
cr | dr |
In order to determine sample sizes required, this must be equal to
To decide for what increase in sample size
If a+b = c+d, i.e. for equal numbers of cases and controls, then the general equation for γ can be given as
This represents the most generic case. It can be shown that γ increases as σ increases by taking the first derivative of
The derivative becomes
If ad-bc = 0 then
A python script to calculate γ for given values of a, b, c, d, and estimated σ via
The increase in samples (γ) required as heterogeneity, defined by σ, increases for given values of a, b, c, d, when these values approach being equal. For example, if a = 1000, b = 2000, c = 1200 and d = 1800, and γ is estimated as 40% of cases not in the subgroup on which a given SNP acts, a relative increase in samples of 2.1 is needed to attain similar association statistics in the entire cohort as that of the underlying subgroup.
This paper explores the consequences to association studies of the possibility of SNPs to confer disease predisposition in a subset of patients only. Two scenarios have been explored, including subgroups in which cases not included are omitted, and an OR error is calculated based on the remaining cases and all controls. These calculations can be extended to determine sample sizes required to compensate for cases not in the subgroup. An additional examination of phenocopies, moved from cases to controls to determine allele frequencies, was conducted and a function relating the true χ2 statistic to the original calculation was derived.
The scenarios described, namely phenocopies and subgroups, are related and the determination of which to select for calculating the effect on OR, χ2 statistic or sample size is somewhat subjective. However, some examples utilizing overlapping clinical and genetic observations in both settings will be discussed, which may provide
In myasthenia gravis (MG), approximately 10–15% of patients display thymomas, which typically predates the disease and is considered to cause the symptoms
In RA, ACPA+ disease appears to differ from ACPA-, with independent analysis of each group yielding different ORs across three independent cohorts
When a disease subtype is not suspected, or a common disease etiology cannot be ruled out, the subgrouping scenario without reclassification may be more appropriate. The effects of subgroups are difficult to quantify, since no such genetic subgroups have been indisputably identified for MS and related disorders, and examples of subgroup frequencies are purely speculative even within HLA associations. Some evidence comes from differing clinical characteristics and sub-phenotypes, which have been shown to have varying genetic associations in systemic lupus erythematosus (SLE)
For example, in MS the HLA-DR15 allele is strongly associated to disease (60% carriage rate cases, 30% carriage rate controls)
Original Data | Adjusted Data | ||||
Allele 1 (%) | Allele 2 (%) | Allele 1 (%) | Allele 2 (%) | ||
Cases (n = 1000) | 700 (35) | 1300 (65) | Cases (n = 200) | 200 (50) | 200 (50) |
Controls (n = 1000) | 650 (32.5) | 1350 (67.5) | Controls (n = 1000) | 650 (32.5) | 1350 (67.5) |
OR | 1.12 | OR | 2.08 |
In the example, the allele frequency in the subclass (50%) masks the full effect of association, and moving the remaining cases to controls gives an OR of 2.08 for the SNP in the subclass. This corresponds to an OR error factor of 1.86, which can also be determined by inspecting the second lowermost curve in
This corresponds to a
In this example, the increase in samples needed corresponds to the value of γ in
Examining from the reverse perspective illustrates the impact of underlying subgroups on the
Controls | Cases | |||||||
Heterogeneity percentage | MAF% | Allele 1 | Allele 2 | MAF% | Allele 1 | Allele 2 | OR | |
0% | 38% | 760 | 1240 | 30.0% | 600 | 1400 | 1.43 | 9.27E-08 |
10% | 38% | 760 | 1240 | 30.8% | 616 | 1384 | 1.38 | 1.64E-06 |
20% | 38% | 760 | 1240 | 31.6% | 623 | 1368 | 1.33 | 2.15E-05 |
30% | 38% | 760 | 1240 | 32.4% | 648 | 1352 | 1.28 | 0.000209 |
40% | 38% | 760 | 1240 | 33.2% | 664 | 1336 | 1.23 | 0.001524 |
50% | 38% | 760 | 1240 | 34.0% | 680 | 1320 | 1.19 | 0.008408 |
60% | 38% | 760 | 1240 | 34.8% | 696 | 1304 | 1.15 | 0.035452 |
70% | 38% | 760 | 1240 | 35.6% | 712 | 1288 | 1.11 | 0.115551 |
80% | 38% | 760 | 1240 | 36.4% | 728 | 1272 | 1.07 | 0.295186 |
90% | 38% | 760 | 1240 | 37.2% | 744 | 1256 | 1.03 | 0.601475 |
The
In order to determine if increased heritability of complex disorders might exist within subgroups, we conducted a simulation of two subgroups of MS by utilizing data from the 123 reported markers from the published meta-analysis, plus HLA
Based on our results, at least a portion of the “missing heritability” may be explained by incomplete penetrance of associated markers across disease cohorts due to subgroups or phenocopies. While fine mapping and sequencing may detect low frequency and rare variants contributing to disease, better methods to detect variants present within GWAS, but below detection thresholds, are required. Identification of subgroups of disease through promising approaches such as network and pathway analysis may determine interactions otherwise obscured by noise. New methods to combine low effect markers are required to build up subgroup classification across similar phenotypes.