The authors have declared that no competing interests exist.
Conceived and designed the experiments: WES DIB. Performed the experiments: WES. Analyzed the data: WES. Wrote the paper: WES. Wrote scripts: WES.
Genes of the vertebrate major histocompatibility complex (MHC) are of great interest to biologists because of their important role in immunity and disease, and their extremely high levels of genetic diversity. Next generation sequencing (NGS) technologies are quickly becoming the method of choice for high-throughput genotyping of multi-locus templates like MHC in non-model organisms.Previous approaches to genotyping MHC genes using NGS technologies suffer from two problems:1) a “gray zone” where low frequency alleles and high frequency artifacts can be difficult to disentangle and 2) a similar sequence problem, where very similar alleles can be difficult to distinguish as two distinct alleles. Here were present a new method for genotyping MHC loci – Stepwise Threshold Clustering (STC) – that addresses these problems by taking full advantage of the increase in sequence data provided by NGS technologies. Unlike previous approaches for genotyping MHC with NGS data that attempt to classify individual sequences as alleles or artifacts, STC uses a quasi-Dirichlet clustering algorithm to cluster similar sequences at increasing levels of sequence similarity. By applying frequency and similarity based criteria to clusters rather than individual sequences, STC is able to successfully identify clusters of sequences that correspond to individual or similar alleles present in the genomes of individual samples. Furthermore, STC does not require duplicate runs of all samples, increasing the number of samples that can be genotyped in a given project. We show how the STC method works using a single sample library. We then apply STC to 295 threespine stickleback (
The major histocompatibility complex (MHC) is a genomic region (or set of regions) unique to vertebrates that contains genes crucial for the proper functioning of the adaptive immune system. Of particular interest are the MHC class I and class II loci, which encode cell surface receptors that bind and present antigens (both self and non-self derived) to immune-effector cells
Although high-throughput, locus specific methods for genotyping of human MHC (HLA) loci have recently been made available
Given these challenges, sequencing MHC loci in non-model organisms has, until recently, been accomplished by 1) extensive bacterial cloning and direct sequencing
Given these limitations, researchers have recently taken advantage of next generation sequencing (NGS) technologies to genotype MHC loci
Next-generation sequencing has some drawbacks, however. Most notable, error rates for NGS are higher than Sanger sequencing. For example, 454 sequencing is subject to extensive homopolymer over- and underscoring
Recently developed NGS-based approaches for genotyping MHC in non-model organisms
Existing methods suffer from some inadequacies resulting from violations of the two assumptions listed above. The assumption that true allelic sequences will be more frequent than artifactual sequences will not always hold. Some true allelic sequences will be represented by relatively few reads in the sequencing run, either from stochastic sampling from the sample library (i.e from the larger pool of PCR amplicons subsequently selected for sequencing) or because some alleles amplify at low-efficiency relative to other alleles
The second assumption – that artifacts will be more similar to alleles than alleles are to each other – will obviously be violated whenever two alleles are relatively similar (i.e. <2 base pairs different). The first methods to apply next-generation sequencing to MHC genotyping either ignored sequence similarity altogether
Recently, Sommer et al.
To address the problems described above we present a new method for genotyping MHC using next-generation sequencing technologies that we call Stepwise Threshold Clustering (STC). STC takes a fundamentally different approach to genotyping MHC loci than have previously published methods
In this paper, we first outline the STC method in detail, showing how it works by applying it to reads generated from a single sample. We then applied STC to genotype MHC class IIβ loci in 295 stickleback fish (
Although STC relies on the same two assumptions about the frequency and similarity of allelic and artifactual sequences that previous methods do, it does not attempt to ascertain allelic status on a sequence by sequence basis. Rather, STC is based on the proposition that, for a single individual with N alleles, the sequences generated for that individual can be grouped into approximately N clusters of similar sequence reads. This is because, with the exception of PCR chimeras, artifactual sequences will tend to be minor deviations from the sequences of true alleles. The approach taken by STC is to identify those N clusters for each individual sample, and to determine the identities of the true alleles from those N clusters.
At the heart of the STC method is an algorithm that processes reads from each sample through successive rounds of clustering using increasingly stringent levels of sequence similarity. After each round, the resulting clusters are tested against two criteria to determine whether they correspond to one, and only one, of the original N alleles for that sample. First, a cluster must must contain enough reads relative to the sample library size (i.e. be large enough), which ensures rare but highly divergent sequences (e.g. PCR chimeras) are not counted as true alleles. Second, because the majority of reads in a given run are expected to be error free (estimated at 82% of total reads for 454 pyrosequencing; Huse et al. 2007), a cluster representing a single true allele should contain a single “dominant” allelic sequence consisting of the majority of reads in the cluster and a much smaller frequency of derived sequences that represent artifacts. A cluster containing more than one dominant sequence likely contains reads derived from more than one true allele, in which case the reads in that cluster are re-entered into the algorithm for further partitioning. Once the clustering rounds are complete, the final result of STC is a set of N clusters representing the N true allelic sequences for each sample.
The clustering algorithm does two things that help to solve the gray-zone and similar allele problems. First, clustering similar sequences together means that the more frequent artifacts are clustered together with the actual alleles from which they are derived, and thus artifacts will not necessarily be mistaken for alleles just because they are relatively common. Similarly, less frequent allelic sequences will end up forming their own distinct clusters with related artifactual sequences, even if the sizes of those clusters are relatively small. Second, by applying criteria to establish whether there is more than one dominant allele in a given cluster, true alleles with similar sequences can be differentiated from one another by refining the clustering until two legitimate clusters are formed. Clustering also has one additional advantage in that clusters that do not meet initial size criteria can be kept for cross-checking with known true alleles after clustering is complete. Such “small” clusters might represent artifacts – e.g. chimeras or divergent sequencing artifacts that appear early enough during PCR to generate many reads – or true alleles whose amplification efficiency is much lower than other alleles present in the original sample. If a given small cluster from one sample appears as a large cluster in multiple other samples, it can be assumed that the small cluster represents a true allele whose small cluster size was likely due to stochastic sampling effects. In essence, a strict size criteria can be applied during clustering to reduce the accumulation of false positives (small clusters that do not represent alleles), while cross-checking the resulting small clusters against the all samples can substantially reduce the number of false-negatives (true alleles not recognized as such).
This study was carried out in accordance with the protocol approved by the University of Texas Institutional Animal Care and Use Committee (permit # 07100201). Fish were collected using permit # SPR-0305-038 issued by the British Columbia Ministry of Environment.
Raw and processed data files and all scripts necessary for running the STC algorithm have been made available at the Dryad Digital Repository (
We collected 364 threespine stickleback (
Pair | UTM coordinates | Samples amplified | Samples genotyped |
Farewell Lake | 314926E, 5564416N | 114 | 90 |
Farewell Stream | 314004E, 5564614N | 50 | 44 |
Roberts Lake | 318053E, 5566856N | 133 | 105 |
Roberts Stream | 316975E, 5567731N | 67 | 56 |
UTM coordinates are zone 10U.
It has been previously estimated that stickleback could have as many as six
Each forward and reverse primer also contained a 15 bp barcode at the 5' end (
PCR reactions were performed in 50 ul total volume containing 25 ng of extracted DNA, 10 uL of 10X (-MgCl2) PCR buffer (Invitrogen), 300 µmol of MgCl2 10 µmol dNTPs, 20 µmol each of forward and reverse primers, and 1 unit of Platinum Taq DNA Polymerase (Invitrogen). The PCR program used for all samples was: initialize at 94°C for 120 seconds, 25 cycles of denature at 94°C (30 seconds), anneal at 57°C (30 seconds), and extension at 72°C (60 seconds), and a final elongation at 72°C for 240 seconds. Lenz and Becker
STC can be broken down into four phases: 1) sequence preparation, 2) sequence combination, 3) stepwise clustering, and 4) post-clustering processing (
Programs and software used to implement each step are given in italics. The initial sff file was parsed using proprietary Roche software at the University of Texas Genome Sequencing and Analysis Facility. Filtering of reads and parsing samples by barcodes was accomplished using a custom perl script. Phases 2–4 were implemented in a custom R script. All scripts necessary for running STC have been uploaded to the Dryad Digital Repository (
Term | Definition |
sample | one individual to be sequenced (human, fish, mouse, bird, etc.) |
sample library | all reads produced in a given sequence run for a single sample |
run library | all reads produced by a single sequencing run (includes all samples) |
read | individual, non-unique sequence produced during sequencing corresponding to the sequence of a single PCR amplicon |
sequence | unique read produced during a sequencing run (many reads can correspond to the same sequence), and indicated by an # (e.g. #130) |
true allele | sequence inferred to match an allelic sequence in the genome of a given sample |
artifact | sequence inferred to not match an allelic sequence in the genome of a given sample |
cluster | group of similar reads (or a single read) produced during phase 3 of STC |
dominant sequence | sequence with most reads in a given cluster |
subdominant sequence | sequence with second most reads in a given cluster |
dropped allele | an allele originating from a small cluster in phase 3 that was made a true allele during phase 4 |
missing allele | an allele present in the genome of the sample does not appear as a true allele after STC is complete |
The STC process starts with the raw read data in the form of a multiFASTA (.fna) file derived from the sff file generated by a single 454 sequencing run. We used a custom Perl script to parse this file (available at the Data Dryad Repository:
Previous approaches to genotyping typically flag sequences containing indels as artifactual and remove them from the analysis
The rationale behind this phase is straightforward: artifactual sequences contain important information about which sequences represent true alleles. This is especially true for reads obtained using 454 pyrosequencing, as the most common sequence errors generated during pyrosequencing are homopolymer insertion/deletion errors
To combine sequences we first divide all possible pairs of sequences in a sample library into five pair types: (I) pairs that differ by only an indel (II) pairs that differ by one insertion and one deletion (III) pairs differing by only one indel and one substitution (IV) pairs that differ by one indel and two substitutions (V) all other pairs, which are not combined. In types I, III and IV, the first member of the pair is always the sequence of the correct length. In type II, where the lengths are the same, the first member is always the more common sequence. We then apply three criteria to each pair to determine whether they should be combined. First, the first member of the pair must have the correct number of base pairs for the sequence of interest (e.g. 213 bp for our sequences). Second, the first member of the pair must be more common than the second member. Third, pairs can only be combined if the second, derived member is unique to that pair within that type. If, for example, one type III pairs contains sequences X and Z, and another type III pair contains Y and Z, then it is ambiguous whether Z is derived from Y or X and neither pair is combined. Finally, we note that, because every possible pair of sequences is evaluated before combining pairs, the length of the time for phase 2 increases with the square of the number of unique sequences present in the sample library.
The STC algorithm uses a variation on a formal Dirichlet process known colloquially as a Chinese restaurant table process. In the restaurant analogy, imagine 100 customers wish to enter a restaurant that can contain an infinite number of tables. The first customer enters the restaurant and sits at a table. The second customer enters and can choose to start a new table or to sit at the existing table. Every subsequent customer enters the restaurant and makes the same choice—sit at a new or existing table. Whether or not each new customer chooses to sit an existing table is directly proportional to how many customers are already at the table when the new customer enters. In a formal, discrete time restaurant table process objects (customers) start a new group (table) at some constant probability, or join an existing group (table) with a probability proportional to the size of each group (i.e. number of customers already seated at each table). The end result once all objects have been grouped is a set of groups that vary in the number of objects they contain.
Our clustering process uses a similar mechanism to form clusters by sequentially taking each read (customer) in a sample library and either using it to start a new cluster (table) or allowing it to join an existing cluster (table). The process is quasi-Dirichlet because the reads (customers) are not clustered using probability rules based on cluster size. Rather, reads are added to clusters based on sequence similarity criteria. Specifically, at the point at which a given focal read is introduced, each existing cluster is assigned the sequence of the most frequent read contained within said cluster. For example, if cluster A contains two reads corresponding to sequence X and one corresponding to sequence Y, the cluster takes on the identity of sequence X (i.e. cluster A is assigned the sequence of X). The given focal read then either added to the most similar existing cluster, assuming the similarity between the focal read and the cluster is above predefined similarity threshold (γ, see
Parameter | Name | Definition | Recommended starting value | Value used |
γ | similarity threshold | minimum similarity required between focal read and cluster for read to join cluster, increases in successive clustering rounds | start at minimum similarity between two reads in the given sample library (e.g. 60%) | Varies between samples |
θ | size threshold | minimum ratio of reads in a cluster to reads in a sample library necessary for the cluster to be classified as “good” | 1/maximum expected number of alleles/2 | 1/22 |
δ | dominance threshold | minimum ratio of dominant to subdominant reads necessary for the cluster to not be classified as “ambiguous” | 4∶1 | 4.55∶1 |
ε | common allele | an allele must be present in at least this many samples to be considered common for the purposes of cross-checking in phase 4 | 3 | 3 |
Recommended starting values are offered because the optimal values will vary both by data set and the degree to which the user wishes to balance false positives and false negatives. Values used for the data set presented here are given in “value used” column.
STC is designed to identify clusters representing alleles that are very dissimilar from other alleles before gradually breaking apart clusters representing very similar alleles. To accomplish this goal, the sequence similarity threshold (γ) for joining existing clusters is gradually increased over successive rounds of clustering. At the beginning of phase 3, γ takes the similarity between the most dissimilar reads for a given sample and increases by a set amount (e.g. 1%) during each successive round of clustering. This means that each successive read is more likely to form a new cluster in later rounds than in earlier ones. Clustering can, in theory, continue to γ = 100%. In practice it is much better to end slightly below this level (γ = 97%) in order to avoid separating artifactual sequences with minor errors from the clusters to which they belong.
Note that whether a focal read joins an existing cluster or starts a new cluster is entirely deterministic and depends only on the predefined sequence similarity threshold (which differentiates STC from a formal Dirichlet processes). However, during each round, each focal read is chosen at random from the remaining pool of reads until all reads are clustered, introducing a small amount of stochasticity into the final cluster configuration (sometimes two or three different configurations can occur depending on the order of clustering and value of γ). To account for variation introduced by random ordering of reads, we repeat the clustering 100 times during each round, starting with a randomly chosen sequence each time. The most common cluster assignment (the mode) for each read among the 100 replicates is used to determine to which cluster each read is assigned in each round.
At the end of each round, every cluster assumes the sequence identity of its most frequent read and is assigned to one of three categories – good, small, or ambiguous – based on the two criteria. The first, which we refer to as the size criterion, states that a cluster must contain a certain proportion, θ, of the sequence reads from the focal sample library. The second, the dominance criterion, states that the frequency of the most common (dominant) read in a cluster divided by the frequency of second most common read (subdominant) must be greater than the threshold ratio δ. Both θ and δ are set before clustering begins and do not change between rounds of clustering. Clusters that meet both criteria after each round are classified as good clusters and all reads contained therein are exempt from further rounds of clustering. Clusters that meet the dominance criterion but not the size criteria are considered small clusters and are also considered exempt. Clusters that do not meet the dominance criterion (e.g., contain two or more abundant sequences), potentially contain more than one true allele. These clusters are classified as ambiguous and are retained for the next round of clustering at a more stringent threshold whether or not they meet the size criteria.
Although the thresholds associated with the two criteria must be set heuristically by the user, they can be adjusted to better trade-off false positives and false negatives. Setting θ too low means some small clusters that actually derive from artifactual sequences may be categorized as good (an increase in false positives). Alternatively, setting θ too high will result in an increase in false negatives, although this increase can be mostly be offset during phase 4, meaning its better to set θ somewhat conservatively. For example, given no stochastic or amplification bias effects, the lowest proportional size of each cluster will simply be one divided by the maximum expected number of alleles (Nmax, assuming no homozygotes). Because there will inevitably be some clusters smaller than that the 1/Nmax ratio due to amplification bias or stochastic sampling of reads from the library, an appropriate starting value for θ would be to divide 1/Nmax again by some constant C to reduce the minimum size a bit further. For example, with 6 different MHC loci, Nmax = 12. An appropriate starting value for θ would be 1 divided by 12 divided by again 2 or 1/24. In this instance, the size criterion states that a cluster must contain at least 1/24 of the total reads for that sample. An appropriate starting point for δ is to consider that experiments have shown that about 18% of reads in a given run represent artifacts
Finally, after clustering is complete, it is possible that some clusters will remain classified as ambiguous. Either these clusters represent two very similar alleles, one allele with an additional frequent artefact, or zero true alleles. These clusters are dealt with as follows. Each ambiguous cluster is divided into two sub-clusters whose size (number of reads) is proportional to the relative frequencies of the two most frequent reads (dominant and subdominant) in the cluster. These two sub-clusters are then checked against the size criterion. If a sub-cluster passes the size criteria, then it is considered a good cluster. Otherwise the sub-cluster is classified as a small cluster. Additionally, it is sometimes the case that one of the top two sequences in a cluster will not be the correct length, in which case the third most frequent sequence is treated as the subdominant sequence and the same rules are applied.
Phase 4 consists of two stages: cross-checking all small clusters against good clusters and checking allele sequences for possible chimeras. As noted by Sommer et al.
Cross-checking small clusters in this way also has an added benefit, because the frequency at which a cluster drops out during phase 3 provides information as to whether the allele likely amplifies with low efficiency. Because such alleles will tend to drop out more frequently than other alleles for a given value of θ, cross-checking clusters also allows the user flexibility in adjusting the value of θ to control the rate of false positives and false negatives. Users can be relatively conservative with assigning the value of θ, knowing that dropped clusters (i.e. potential false-negatives) can be cross-checked and included in the final good cluster at the end. Once small clusters have been cross-checked against common good clusters, all good and dropped clusters for each sample are officially assigned true allele status, whereby the dominant sequence in each cluster is inferred to represent a true allelic sequence.
Lastly, true alleles are checked to see if they are likely to be PCR chimeras. Chimeric sequences can occur during the initial PCR stage when incompletely extended primer sequences subsequently anneal to a different template in a later cycle, or by template switching during extension. In either case, the chimeric daughter strand will resemble one parent sequence over one portion of its length, and a different parent sequence over the other portion. If the number of PCR cycles has been kept to a minimum during read generation
To check whether variation in the rate at which alleles were dropped during phase 3 was potentially related to amplification efficiency, we estimated the correlation between average relative cluster size (proportion of sample library reads in a cluster) and the rate of dropping (fraction of total occurrences where an allele was initially dropped in phase 3). For each allele, we estimated the relative cluster size both overall and only in samples where an allele was not dropped. Alleles that are amplified at lower efficiencies should have both smaller relative cluster sizes within sample libraries (even in cases where those alleles are not dropped) and elevated probabilities of dropping compared to other alleles. We therefore expected a negative correlation between cluster size and rate of dropping if low amplification efficiency was causing some alleles to drop out more than would be expected by due to stochastic effects alone.
Duplicate PCR reactions were run for 21 total samples. When both duplicates achieved the minimum sample library size, we compared the STC output for each duplicate to test the consistency of STC across different libraries for the same samples. We also cloned and sequenced four samples for verification of our genotyping method. We used the same PCR conditions used to amplify samples for pyrosequencing. The PCR products were purified using QIAquick PCR Purification Kits (Qiagen 28014) and cloned into a vectors using a pCR 2.1-TOPO TA kit provided by Life Technologies (K450001). After overnight growth, individual clones were amplified using M13 forward and reverse primers. Amplified clone sequences were purified using the same QIAquick kits and sequenced directly on an Applied Biosystems AB 3730 sequencer at the University of Texas ICMB DNA core facility. We originally targeted 100 clones per sample, but found that only 50–60 clones were needed to verify the most diverse sample.
Samples that yield fewer sequence reads than typical are prone to having alleles with zero corresponding sequence reads in their sample library (i.e. missing alleles) due to stochastic under-sampling of reads during sequencing. Moreover, the probability of such missing alleles will be magnified when those alleles also amplify will low efficiency. As a result, allelic diversities may be underestimated for some samples with small library sizes. To test for this possible bias, we estimated the correlation between library size (read number) and allele number within each of the four populations of stickleback. In any populations where the correlation was significant, we re-estimated the correlation using increasingly larger minimal samples sizes (up to 800 reads), to ask at what point the correlation was negligible or became statistically insignificant (keeping in mind that the power to detect a significant correlation will decrease as more samples are removed from the analysis due to an increased minimal sample size).
In order to help visualize the clustering process for our individual example sample library, we created a two dimensional plot of all the sequences present in that sample library using multidimensional scaling (MDS). Briefly, MDS attempts to project the N-dimensional distances between objects (i.e. sequences) into a two dimensional space. Sequences placed closer together in the space are more similar to each other than sequences placed further away. One advantage of using MDS to view sequence relationships is that it can be based on the same similarity matrix used by STC to cluster sequences, and thus provides a visual representation (though not complete replication) of the STC process. We used the
We used the results produced by the STC algorithm to make three different comparisons between our four populations. First, we used an ANOVA to determine whether our populations differed in the mean number of alleles per fish. Habitat (lake or stream), population pair (Roberts or Farewell), and their interaction were used as factors in the ANOVA. Second, it has been hypothesized that divergent selection among habitats (due to contrasting parasite communities) could lead to the divergence in MHC genotypes. We therefore tested whether our lake and stream populations differed significantly in overall MHC allele composition using the GLM-based approach advocated by Warton et al.
The most recently proposed method for using NGS to genotype MHC loci requires every sample to be run in duplicate
A single stickleback sample (hereafter, sample X) from the Roberts Lake population (sample ID 490 in
In this section, unique sequences will be referred to by the including a # at the beginning of the numerical sequence ID (i.e. #1234). Clusters of sequences (good, small, or ambiguous) are designated by an X at the beginning of a numerical sequence ID (i.e. X1234), where the ID refers to the dominant sequence in the cluster.
Of the 101 unique sequences in the sample library, 59 were of the correct length (213 base pairs). Thirty-three sequences were off by a single base pair (212 or 214 base pairs), 7 by two base pairs, and 2 by three or more base pairs (
We set the size threshold for accepting clusters as good clusters at θ = 1/22 = 0.045 and the dominant to subdominant ratio threshold at δ = 4/1. Note that these thresholds are only slightly different than the baseline thresholds suggested in the methods and were determined heuristically by re-running STC on a subset of samples to minimize false positives. The minimum sequence similarity among all pairs of sequences in the sample library was 60%, so we began clustering at γ = 60%.
Small black dots indicate the 101unique sequences present in the sample library. Sequences have been plotted on the first two MDS axes generated using the same similarity matrix used during clustering, such that more similar sequences are closer together. The color of the larger circles indicates the final status of each sequence, whereas the size of each circle is proportional to the number of reads in the sample library that match that sequence (see
Rounds | minimum similarity % (γ) | good | small | ambiguous | reads left | sequences left | good clusters added | small clusters added |
1–10 | 60–69 | 0 | 0 | 1 | 330 | 67 | - | - |
11–22 | 70–81 | 0 | 1 | 1 | 326 | 66 | - | #1238 |
23–26 | 82–85 | 0 | 1 | 2 | 326 | 66 | - | - |
27–28 | 86–87 | 0 | 1 | 3 | 326 | 66 | - | - |
39–30 | 88–89 | 0 | 2 | 3 | 323 | 65 | - | #28260 |
31–37 | 90–96 | 2 | 2 | 2 | 252 | 52 | #113, #298 | - |
38 | 97 | 4 | 2 | 1 | 140 | 23 | #103, #195 | - |
Good and small cluster numbers are cumulative. Ambiguous clusters indicates the number of ambiguous clusters created during each round that carried over to the next round. Clustering rounds where the results did not change are grouped together. Reads and sequences are left after (not before) clustering in a given round.
As expected, clustering all 330 reads at γ = 60% resulted in a single giant cluster. The dominant and subdominant sequences, #3111 and #103, were represented by 73 and 42 reads respectively. This single cluster clearly does not meet the dominance criterion (73/42 = 1.74, which is smaller than δ = 4). The same result was achieved when clustering their reads through γ = 69%. At γ = 70% sequence similarity, two clusters were formed. One cluster (dominant sequence #1238) consisted of a single sequence of 4 reads, which classified it as a small cluster (4/330 = 0.012, which is less than θ = 0.045). The other 326 reads were placed in a single large cluster whose dominant and subdominant sequences were again #311 and #103. This single ambiguous cluster was produced at similarity thresholds through γ = 81%.
At γ = 82%, two clusters were produced. The #3111/#103 cluster remained ambiguous as before. A second cluster with dominant and subdominant sequences #298 and #113 was also formed. However, this additional cluster did not meet the dominance threshold (30/27<4/1), and was also classified as ambiguous. At γ = 85% the cluster dominated by #3111 split into two clusters, with dominant/subdominant sequences of #3111/#5 and #103/#195 respectively. However, none of the three clusters passed the dominance criterion. At γ = 88%, an additional cluster consisting of 3 reads of a single sequence (#28260) was formed and classified as a small cluster (3/330<0.045).
At γ = 90%, four total clusters were formed. The aforementioned two clusters #3111/#5 and #103/#195 remained ambiguous. One of the other clusters, the cluster with the dominant sequence #298 met the size (41/330>0.045) and dominance (30/3>4) criteria and was classified as a good cluster. The cluster dominated by sequence #113 also met both criteria (31/330>0.045 and 27/1>4) and was also classified as a good cluster. At the final round of γ = 97%, one of the previous ambiguous clusters (#103/#195) split into two distinct clusters, both of which passed the criteria and were considered good clusters. At this point there were 4 good clusters corresponding to sequences #298, #113, #103 and #195, two small clusters corresponding to sequences #1238 and #26820, and one remaining ambiguous cluster with dominant and subdominant sequences #3111 and #5 that differed by only two base pairs (
To determine whether our remaining ambiguous cluster should be considered two separate clusters, we divided the total number of reads in the cluster (140) between the two dominant sequences in proportion to the number of reads for each sequence. The top two sequences accounted for 111 reads, of which sequence #3111 accounted for 73 reads (65.8%) and sequence #5 accounted for 38 reads (34.2%). Thus, the two hypothetical sub-clusters were assigned 65.8% and 34.2% of the total cluster reads, giving them 92 and 48 reads respectively. In this case, both sub-clusters had greater than 4.5% of the total reads in the sample library (335) and were classified as good clusters. Thus at the end of the phase 3 we were left with 6 good clusters representing sequences (X298, X113, X103, X195, X311, and X5), and two small clusters consisting of one rare sequence each (X1238 and X28260).
To cross-check our small clusters against common good clusters, we set ε = 3, which means that a sequence would have to be the dominant sequence of a good cluster in at least three other samples to be considered common. Of the two small clusters carried over from phase 3 in our example, good clusters containing #1238 were present in 3 other genotyped samples, whereas #28260 was not classified as a good cluster in any other sample. Therefore, we added the cluster representing sequence #1238 to the list of good clusters for sample X. The cluster containing #28260 was considered an artifact. At this point, we inferred that the dominant sequences in our seven good clusters represented true alleles originally present in sample X. None of the seven true alleles was classified as a chimera (
allele | total occurrences among all samples | occurrences with possible parent sequences | % of occurrences with parents | Classified as chimera? |
#1176 | 1 | 1 | 100 | yes |
#1487 | 1 | 1 | 100 | yes |
#1692 | 1 | 1 | 100 | yes |
#2720 | 1 | 1 | 100 | yes |
#3584 | 1 | 1 | 100 | yes |
#5875 | 1 | 1 | 100 | yes |
#375 | 2 | 1 | 50 | no |
#1337 | 4 | 4 | 100 | yes |
#562 | 4 | 2 | 50 | no |
#512 | 14 | 4 | 29 | no |
#1238 | 14 | 1 | 7 | no |
#182 | 15 | 7 | 47 | no |
#175 | 17 | 2 | 12 | no |
#262 | 24 | 1 | 4 | no |
#1 | 34 | 28 | 82 | no |
#195 | 38 | 15 | 39 | no |
#717 | 47 | 1 | 2 | no |
#103 | 75 | 2 | 3 | no |
We obtained 206,453 sequence reads from one quarter of a complete pyrosequencing run. The raw sff file has been deposited at the NCBI sequence read archive (accession number SRR1177032). After removing reads that had intra-primer sequences of less than 200 base pairs, we were left with 156,841 reads (76% of total reads). Of those, 136,861 (66% of total reads) could be assigned to a specific sample based on barcode sequences. Of the initial 385 individual PCR reactions (364 samples plus 21 duplicates), 359 had at least one read associated with them after initial filtering (
After STC was complete, we had identified 244 unique true alleles, of which 101 were present in only one sample (“singletons”,
Of the remaining true alleles, 218 were the correct length of 213 base pairs (
The overall average number of alleles per sample was 6.62 (mode: 6), although population averages ranged from 5.9 to 7.1 (
The red dashed lines indicate the mean number of alleles per sample for each population.
After phase 3 we had identified 9561 total small clusters among all of our samples, most of which were inferred to be sequencing artifacts. However, of those 9561 small clusters, 229 were classified as true (i.e. dropped) alleles after cross-checking against common good clusters during phase 4. The average number of such dropped alleles was 0.76 per sample (range: 0–4). Of the 301 genotyped samples, 164 (54%) had at least one dropped allele. Not all alleles were equally likely to be dropped. Of the 218 true alleles, 170 were never dropped (
Considering only the 48 alleles that were dropped at least once, we found a significant negative correlation between the percentage of samples in which the allele was dropped and the average cluster size for alleles (r = −0.64, P = 1×10−7,
Each point indicates a single allele. Only alleles that were dropped in at least one sample are plotted (n = 43). Cluster size is calculated as the average of the proportion of reads represented by the allele among all samples containing the allele. The solid line indicates the best-fit linear regression. The 95% confidence band for the regression is indicated in gray. Alleles associated with the “divergent” allele cluster (see
We sequenced bacterial clones from a total of four different samples (cloning results are summarized in
sample ID | good clusters | dropped alleles | total alleles | number of clones | total matches | alleles matched | mismatched clones |
383 | 2 | 0 | 2 | 53 | 53 | 2 | 0 |
468 | 6 | 1 | 7 | 57 | 57 | 7 | 0 |
1024 | 11 | 0 | 11 | 9 | 9 | 5 | 0 |
1368 | 6 | 1 | 7 | 5 | 4 | 4 | 1 |
A mismatched clone sequence is a sequence that did not exactly match one of the alleles identified by STC. The one mismatched clone in sample 1368 was a chimeric sequence of two other alleles.
We originally amplified 21 different samples in different PCR reactions in order to compare the repeatability of the STC algorithm on the same samples (hereafter duplicates are referred to as “B” samples,
sample ID | # reads (A) | # reads (B) | # of alleles (A) | # of alleles (B) | # of alleles shared | same alleles? | missing allele(s) |
1385 | 270 | 143 | 7 | 7 | 7 | yes | - |
1397 | 765 | 114 | 7 | 7 | 7 | yes | - |
403 | 127 | 196 | 9 | 8 | 8 | no | #83 |
82 | 400 | 440 | 7 | 6 | 6 | no | #655 |
1361 | 285 | 149 | 5 | 4 | 4 | no | #83 |
388 | 1000 | 109 | 6 | 3 | 3 | no | #83, #130, #162 |
1076 | 158 | 176 | 7 | 8 | 1 | no | - |
The correlation between sample library size (number of reads) and allele number ranged from r = 0.12 to r = 0.29 in the four populations (
Each point indicates a genotyped sample. Samples with more than 1000 reads were sub-sampled to 1000 reads. No B (duplicate) samples were included to avoid pseudo-replication. The lines indicate the best-fit linear regressions for each population. Solid lines indicate correlations that are statistically significant at α = 0.05. Dashed lines indicate correlations that are not statistically significant. The 95% confidence bands for each regression are indicated in gray.
To determine whether increasing the minimum sample library size could eliminate the correlations between library size and allele number we repeated the above correlation estimations for all four populations using a range of minimum samples sizes (80 to 600 reads). In all populations, the correlation coefficient decreased as the minimum sample size increased, although the extent and rate of decrease varied among populations (
The correlation coefficients shown in
Both population pair (Roberts vs. Farewell, F = 6.99, df = 1, P = 0.008) and habitats (lake vs. stream, F = 9.53, df = 1, P = 0.002) differed from each other in the mean number of alleles per individual fish (
Our populations also differed from each other in their overall MHC allele compositions (
Allele frequency is calculated as the proportion of individuals carrying each allele in each population. Points closer to the dotted line indicated alleles with frequencies similar in the lake and stream populations. Red circles indicate alleles that were significantly different in frequency between lake and stream habitats (within pairs). Blue circles indicate insignificant differences in frequency. Filled black circles indicate that the allele was significantly different after controlling for multiple comparisons (false discovery rate = 5%). Note that frequencies are plotted on a logit (non-linear) scale.
In addition to the overall significant differences, we also found a number of alleles within each pair that differed significantly in frequency (percentage of individuals where the allele was found) between habitats (
Of the alleles that were significantly different in each pair after controlling for multiple comparisons, our sub-sampling analysis indicated that genotyping only half as many samples would allow us to find the same significant differences between 0 and 91.8% of the time for individual alleles in Roberts (
Next generation technologies are quickly replacing traditional sequencing and conformation based approaches as the methods most suitable for sequencing multi-locus genes like those of the MHC
Previous approaches to genotyping MHC genes using next-generation sequencing technologies have typically struggled with two problems. First, there is a range of read frequencies within which there will be a substantial number of both true allelic sequences and artifactual sequences, making it difficult to adequately balance the rate of false positives and false negatives during genotyping. Second, it can be difficult to distinguish whether two similar, but relatively frequent, sequences represent two alleles or one allele and one artifact derived from that allele. One recent previous approach
One concern of previous authors has been to ensure that enough reads are present in a given sample library to ensure accurate genotyping. Previous methods have taken a probabilistic approach based on multinomial distributions to determine the minimum number of reads required to estimate a genotype at some level of confidence
Second, minimum size recommendations assume that identifying all alleles for all samples is necessary for accurately estimating all the parameters or testing all the hypotheses one might be interested in. This is not necessarily true. If the goal is to provide estimates of population level diversity (i.e. how many alleles are present in the population) having some samples with incomplete genotypes due to having fewer reads is unlikely to change the population estimate. In fact, in such situations it would be better to aim for genotyping more samples than for increasing the number of reads for each sample, which increases the chances of finding rare alleles. Alternatively, if the final goal is to compare individual level diversity with some other measured phenotype (i.e. parasite burden), missing an allele in a few samples may alter the effect size estimate only slightly. In those cases, one possibility would be to use read number as a covariate in downstream analyses, or weight individual allelic diversities by the number of reads. If the final goal is to use the presence or absence of a given allele in association tests with other phenotypes (i.e. parasite infection), a small number of false negatives due to incomplete genotyping will, at worst, reduce statistical power but is unlikely to bias associations one way or the other. Our overall results suggest that having at least 200 reads was sufficient for reducing the bias introduced by small library sizes, although even this number may be too large if we remove the highly divergent, very low efficiency alleles from the analysis. Ideally, we would recommend targeting at least 50 reads per allele for a sample with mean allelic diversity (i.e. ∼target 350 reads per sample if the expected average allele number of 7), or more as funds allow, because there will always be variability in the number of reads per sample in any given run.
Even if library sizes are adequate for genotyping, it is possible that some alleles may amplify with relatively lower efficiency, either because of differences in the primer or amplicon sequence, or because of differences in PCR conditions between reactions and runs. STC allows for the identification of such alleles in two ways. First, alleles that amplify with low efficiency will have, on average, lower relative cluster sizes than other alleles. Second, alleles that tend to amplify with low efficiency will be more likely to be dropped during phase 3 before being added back to genotypes during phase 4. We showed that cluster size and rate of dropping in phase 3 were moderately negatively correlated in our data set, and plotting the relationship suggested that a group of highly divergent alleles were especially likely to be dropped (
One of our divergent alleles (#83) has been previously reported by Lenz et al.
Aside from stochastic or amplification bias, cases where alleles are represented by only one or two reads could also result from sequencing errors in the barcodes such that reads are assigned to the incorrect sample (fairly unlikely given the redundancy of our barcodes). Alternatively, a very small amount of contamination during PCR setup could introduce alleles from other samples into the PC template. Previous approaches to genotyping have required at least three reads of a given sequence in at least two samples to consider it as a possible true allele
An important feature of STC is that it can be applied, in theory, to any data set of MHC allele sequences (or other multi-allelic amplicon from a gene family) generated by NGS technology as long as 1) many reads can be generated for individual samples and 2) individual sample libraries can be subset from the entire library using barcoding or other techniques. As other sequencing technologies increase their average read lengths, researchers will likely begin to shift away from pyrosequencing to alternative technologies (i.e. Illumina). There are a number of issues to consider when applying STC to data sets generated using such alternatives. First, Illumina sequencing typically produces many more reads per sequencing run than does pyrosequencing. However, sequencing facilities can often target a very specific number of sequences with barcoding, and thus users should be able to specify a smaller target number of reads based on the number of samples and desired coverage. Second, Illumina sequencing is more prone to substitution errors than to indel errors. In phase 2 of STC, sequences are combined when they are very similar but of different lengths, meaning with Illumina data few sequences may be combined. Future implementations of STC for Illumina data could skip phase 2 entirely, or phase 2 could be modified to combine sequences where a single substitution results in the creation of an open reading frame (i.e. a stop codon). Third, error rates tend to be an order of magnitude lower (0.3–4% versus 12%) than with pyrosequencing. This would mean that the proportion threshold (γ) will likely need to increased to something around 9∶1 rather than 4∶1 as currently implemented for pyrosequencing. One of the overall advantages of STC is its flexibility, giving the user the ability to adjust the STC parameters based on the number of samples, the number of amplified loci, and expected error rates. Although we have not yet implemented STC for non-pyrosequenced data, we see no reason why STC could not be applied directly to Illumina generated sequence data with minimal modifications to the STC protocol.
One of the advantages of STC is that all the reads present in the sample library contribute to genotyping. This could provide a number of potential advantages beyond simply genotyping. First, by grouping reads into clusters that derive from true alleles, a better estimate of the relative amplification efficiencies could be estimated than if relative efficiencies are estimated only from reads that match alleles directly
(TIFF)
(TIFF)
(TIFF)
(XLS)
(XLS)
(XLS)
(XLS)
(XLS)
(XLS)
(XLS)
(R)
(R)
We would like to thank Hans Hofmann for crucial discussion during the early stages of STC development and Scott Hueneke-Smith for his expertise with NGS technologies. We would also like to thank Emily Parham, Graham Segal, and Sam Thompson for performing the vast majority of the cloning.