The authors have declared that no competing interests exist.
Conceived and designed the experiments: VP LG MM. Performed the experiments: VP LG JF. Analyzed the data: VP LG AM. Contributed reagents/materials/analysis tools: VP LG JF. Wrote the paper: VP LG.
In hematopoietic stem cell transplantation, donor selection is based primarily on matching donor and patient HLA genes. These genes are highly polymorphic and their typing can result in exact allele assignment at each gene (the resolution at which patients and donors are matched), but it can also result in a set of ambiguous assignments, depending on the typing methodology used. To facilitate rapid identification of matched donors, registries employ statistical algorithms to infer HLA alleles from ambiguous genotypes. Linkage disequilibrium information encapsulated in haplotype frequencies is used to facilitate prediction of the most likely haplotype assignment. An HLA typing with less ambiguity produces fewer high-probability haplotypes and a more reliable prediction. We estimated ambiguity for several HLA typing methods across four continental populations using an information theory-based measure, Shannon's entropy. We used allele and haplotype frequencies to calculate entropy for different sets of 1,000 subjects with simulated HLA typing. Using allele frequencies we calculated an average entropy in Caucasians of 1.65 for serology, 1.06 for allele family level, 0.49 for a 2002-era SSO kit, and 0.076 for single-pass SBT. When using haplotype frequencies in entropy calculations, we found average entropies of 0.72 for serology, 0.73 for allele family level, 0.05 for SSO, and 0.002 for single-pass SBT. Application of haplotype frequencies further reduces HLA typing ambiguity. We also estimated expected confirmatory typing mismatch rates for simulated subjects. In a hypothetical registry with all donors typed using the same method, the entropy values based on haplotype frequencies correspond to confirmatory typing mismatch rates of 1.31% for SSO versus only 0.08% for SBT. Intermediate-resolution single-pass SBT contains the least ambiguity of the methods we evaluated and therefore the most certainty in allele prediction. The presented measure objectively evaluates HLA typing methods and can help define acceptable HLA typing for donor recruitment.
The Human Leukocyte Antigen (HLA) gene system on Chromosome 6 is one of the most polymorphic regions of the human genome and one of the most extensively studied regions due to its importance in transplantation and association with autoimmune, infectious and inflammatory diseases
The high polymorphism in HLA presents a challenge when it comes to typing HLA genes. The typing has historically been performed using serological antibody tests, which are able to identify HLA protein variants on the surface of the cell using antigen-specific antibodies
DNA-based methods identify HLA alleles by interrogating the nuclear DNA sequence and can result in different levels of ambiguity depending on typing methodology or test kits used. HLA typing methods, their corresponding formats and abbreviations used in this paper are given in
Typing Method | Description | Abbreviation | Example Typing | ||
|
Identifies HLA protein on the cell surface using antigen-specific antisera. | Broad antigen typing | SERO | A9, A28 | |
Split antigen typing | A24, A68 | ||||
|
Identifies HLA alleles by interrogating DNA | Allele family level | Two digit resolution | DNA2 | A |
Sequence specific oligonucleotides | Allele codes | SSO | A |
||
Sequence based typing | Single pass SBT | SBT | A |
||
Exact alleles | High resolution | A |
Since the expression has not been confirmed for the majority of alleles described in IMGT/HLA, we use high-resolution here to denote the amino acid sequence of the exons encoding the antigen binding domains.
In HSCT the selection of donors for a patient in need of a transplant is based primarily on HLA matching, and the lower the ambiguity of typing the easier it is to determine the probability of allele level match during the donor search
Some previous work has been done in measuring typing ambiguity – first a measure developed by Helmberg et al.
Both previous typing ambiguity studies used allele frequencies in their computations and, as we will later show, fail to demonstrate the advantage of linkage disequilibrium information contained in haplotype frequencies when it comes to reducing ambiguity and improving predictions of patient-donor matching. We use haplotype frequencies and show that the ambiguity is reduced considerably compared to using allele frequencies, proving that this advance in strategy for identifying matched donors has had a significant positive impact. In addition, we take this methodology further and use it to evaluate several different typing methods, and directly compare them with respect to the inherent ambiguity measured by entropy. To measure the impact of the typing method ambiguity in more relatable terms, we also developed a measure we call Confirmatory Typing (CT) Mismatch Rate, which gives the average probability across a set of patients that a mismatch would occur between the patient and donor when a high resolution confirmatory typing is performed on the ambiguously-typed donors in a uniformly-typed registry.
We show that entropy can be used to objectively compare methods of HLA typing to each other in terms of the information they provide, in the context of each individual population. Our results show that intermediate-resolution single-pass sequence-based typing (SBT) reported in genotype list format contains the least ambiguity and, therefore, the most certainty in allele prediction across all populations. We examine the benefit of using haplotype frequencies in entropy calculations versus allele frequencies. Neighboring HLA and non-HLA genes are highly correlated and major efforts have been directed at describing linkage disequilibrium (LD) across the region
The naming of HLA alleles is standardized and regulated by the World Health Organization (WHO) Nomenclature Committee for Factors of the HLA System
Before DNA-based HLA typing methods were developed, serological testing identified sets of alleles with similar reactivity. Two-digit level resolution is the lowest HLA typing resolution reported by typing laboratories today. For these lower-resolution formats, alleles in the same family as A*01:01 are reported as A1 using serological methods (abbreviated SERO), or as a truncated result of intermediate-level DNA-based typing (referred to as DNA2 in this text) as A*01 or A*01:XX.
A commonly used intermediate-resolution format is the one using NMDP
Sequence-specific oligonucleotides (SSO) typing results are reported in this format for this study, where each allele code represents two or more alleles. For example, an allele reported as A*01:AB can be either A*01:01 or A*01:02, and an allele reported as A*26:JGSJ can be any of the following three: A*25:13, A*26:01, A*26:52. Therefore, the number of combination for an ambiguous allele pair reported in this format increases multiplicatively, that is, the allele pair (A*01:AB, A*26:JGSJ) will have six possible pairwise combinations (A*01:01, A*25:13 or A*01:01, A*26:01 or A*01:01, A*26:52 or A*01:02, A*25:13 or A*01:02, A*26:01 or A*01:02, A*26:52). Ambiguous sequence based typing (SBT) is reported in the format of genotype lists for this study, that is, in the form of several possibilities for pairs of alleles an individual carries (A*24:02,A*68:01 or A*24:03,A*68:01 or A*24:04,A*68:01). Because in single-pass SBT results, some ambiguous genotype lists cross several allele families, allele codes could be used to represent all typing results. However, genotype list representation has the advantage of showing that some genotype possibilities, added implicitly when compressing to allele code format, are not possible.
We used high-resolution haplotype frequencies generated from unrelated donors from the National Marrow Donor Program (NMDP) database for four principal population categories defined by the United States census: African American (AFA), Caucasian (CAU), Hispanic (HIS) and Asian/Pacific Islander (API)
Population | Description | # of 3-locus Haplotypes | # of HLA-A Alleles | # of HLA-B Alleles | # of HLA-DRB1 Alleles |
|
African American | 3,049 | 68 | 107 | 59 |
|
Caucasian | 5,214 | 97 | 158 | 70 |
|
Asian-Pacific Islander | 2,157 | 56 | 102 | 62 |
|
Hispanic | 3,102 | 75 | 138 | 62 |
This table shows four population groups and their corresponding haplotype frequencies used for the simulation of samples in this study. The data contains frequencies for three-locus haplotypes (A∼B∼DRB1). The table also shows the number of unique HLA-A, HLA-B and HLA-DRB1 alleles present in the haplotypes for each population group.
To generate simulated typings for different HLA typing methods, we first sampled 2 haplotypes from high resolution population haplotype frequency data set
While we use HLA nomenclature Version 3 style formatting to describe HLA alleles in this paper, we simulated the four typing methods for Version 2.28 of the IMGT-HLA database, to more closely match the time in which the typing results used to generate the haplotype frequencies were reported. To generate serologic typing (SERO), we used the HLA dictionary, which allows each HLA allele to be mapped to a serologic equivalent (e.g. B*15:02 maps to B75)
Shannon's entropy quantifies the amount of uncertainty or disorder associated with a particular system, and is widely used in a variety of applications, such as genetics
Entropy can be thought of as the ambiguity or impurity present in a system of interest. If, in a set of typing results for one individual, all of them are equally likely (frequent) then the entropy is the highest as we have the least information to choose the most likely real genotype.
To illustrate the usefulness of entropy in measuring typing ambiguity we show an example of two typing results with the same number of ambiguities (
Typing 1 | Typing 2 | ||||
Ambiguities | Relative Frequency |
|
Ambiguities | Relative Frequency |
|
B*5702/B*5801 | 0.0923 | 0.3172 | DRB1*0301/DRB1*1301 | 0.9989 | 0.0016 |
B*5702/B*5802 | 0.0832 | 0.2985 | DRB1*0301/DRB1*1327 | 0.0005 | 0.0056 |
B*5702/B*5804 | 0.0001 | 0.0016 | DRB1*0304/DRB1*1301 | 0.0002 | 0.0024 |
B*5703/B*5801 | 0.4331 | 0.5229 | DRB1*0304/DRB1*1327 | 0.0000 | 0.0000 |
B*5703/B*5802 | 0.3907 | 0.5297 | DRB1*0306/DRB1*1301 | 0.0004 | 0.0044 |
B*5703/B*5804 | 0.0006 | 0.0063 | DRB1*0306/DRB1*1327 | 0.0000 | 0.0000 |
|
|
Shown here are typing results for two simulated subjects. They each have six ambiguous sub-types (allele-pairs), but very different entropies. Typing 1 has entropy H = 1.676 and Typing 2 has entropy H = 0.014. Both of these typings come from the same population sample (African American) in which the mean allele entropy for this simulated sample is H = 0.54.
The HLA haplotype frequencies we used in this study were estimated by the expectation-maximization algorithm (EM) described in
To compute locus entropy using haplotype frequencies, or
As a side note, an ambiguous typing with many possible alleles at each locus can result in a large combinatorial number of possible haplotype pairs. Given a fully heterozygous case of three-locus un-phased genotype with
Besides objective evaluation of typing methodologies employed in typing the HLA region, using the entropy approach to measure ambiguity has another application from a clinical perspective, namely its direct relationship to confirmatory typing (CT) mismatch rates. For a given patient, high resolution CT is done to confirm the patient-donor match from a selected set of donors. A case where the high resolution typings mismatch is called a CT mismatch. We compute CT mismatch rates on the same simulated donor sample by comparing the ambiguous typing and the exact haplotype pair that was used to generate that ambiguous typing. As described in a previous section, each ambiguous genotype can generate multiple HLA haplotype pairs, the true one being the haplotype pair we used to simulate the ambiguous typing. The CT mismatch rate for each locus is computed as the summation of frequencies of all allele pairs (computed using
Locus entropies obtained for SBT, SSO, allele family level DNA2 and SERO typing methods are shown in
This figure shows locus entropies obtained for SBT, SSO, DNA2 and SERO typing formats, within each population and for three HLA loci using allele frequencies, that is, the
AFA | API | CAU | HIS | |
|
0.0354 | 0.0668 | 0.0766 | 0.0787 |
|
0.2974 | 0.5247 | 0.49 | 0.5004 |
|
1.8501 | 2.023 | 1.056 | 2.1709 |
|
1.7735 | 2.2282 | 1.6548 | 2.9471 |
This table shows the locus entropy for SBT, SSO, DNA2 and SERO typing methods for all four populations using allele frequencies and averaged over the three loci, HLA-A, -B, -DRB1.
When we used haplotype instead of allele frequencies, we got the same ambiguity ranking (
This figure shows locus entropies obtained for SBT, SSO, DNA2 and SERO typing formats, within each population and for three HLA loci using haplotype frequencies, that is, the
This figure shows the comparison between
AFA | API | CAU | HIS | |
|
5.30E-04 | 0.0031 | 0.002 | 0.005 |
|
0.0923 | 0.2074 | 0.0477 | 0.1195 |
|
1.2685 | 1.6219 | 0.7277 | 1.7633 |
|
1.2488 | 1.3889 | 0.7205 | 1.7495 |
This table shows the locus entropy for SBT, SSO, DNA2 and SERO typing methods for all four populations using haplotype frequencies and averaged over the three loci, HLA-A, -B, -DRB1.
To demonstrate the impact of typing method ambiguity in a clinical setting we computed CT mismatch rates, which give the average probability that a mismatch would occur between a patient and donor when high resolution confirmatory typing is performed on the ambiguously typed donors in a uniformly typed registry. CT mismatch rates computed on the same sample of 1000 simulated donors are shown in
AFA | API | CAU | HIS | |
|
2.0135e-04 | 9.3340e-04 | 8.0012e-04 | 0.0015 |
|
0.0236 | 0.0530 | 0.0131 | 0.0324 |
|
0.3792 | 0.4286 | 0.4069 | 0.4366 |
|
0.3755 | 0.3724 | 0.1957 | 0.4334 |
This table shows the expected confirmatory typing (CT) mismatch rates for SBT, SSO DNA2 and SERO typing formats and four populations averaged across all three loci (HLA-A, -B, DRB1). CT mismatch rates describe the probability that a mismatch would occur between a patient and donor during high resolution confirmatory typing on the ambiguously typed donors in a uniformly typed registry.
We have shown that entropy can be used to objectively compare methods of HLA typing in terms of the information they provide. The calculation of per-locus entropy using haplotype frequencies has a direct application in measuring the impact of using haplotype frequencies to predict the likelihood of allele match for stem cell registry matching algorithms. The LD information contained in haplotype frequencies reduces the entropy considerably compared to using allele frequencies, showing that this strategy for identifying matched donors has a significant positive impact.
No objective quantitative comparisons between SBT and SSO methods have been available to date. Typing laboratories may choose SSO methods over SBT methods primarily based on cost savings achieved due to easier set-up, staff training, pre-packaged kits, and automation. However, these apparent cost savings may have a price of higher typing ambiguity. We have shown that single-pass SBT typing performs far better at distinguishing alleles compared to mid-1990's-era SSO typing. However, currently available SSO typing kits used for recruitment typing have more oligonucleotide probes and thus are able to distinguish more alleles, which may result in entropy as low as that of single-pass SBT. Given equal cost, registries should utilize laboratories that employ HLA typing methods that achieve lower entropy for their population.
Our objective measure of typing ambiguity can be advantageously applied to the continual improvement of all methods of HLA typing. Design of SSO kits could be done in silico using population haplotype frequencies and sequence information from the IMGT-HLA database
As new alleles are discovered, SSO kits are often altered to add more probes so that typing results do not cross allele families and thus meet current guidelines for acceptable recruitment typing. These new probes will not decrease entropy appreciably as the frequency of a newly discovered allele tends to remain very low.
Because of sample size limitations, many of the rare alleles described in IMGT-HLA were not observed in our samples. However, rare alleles do not have a significant impact on entropy calculations. Owing to the logarithmic nature of Shannon's entropy, an allele with a very small frequency
Our methods of HLA typing method evaluation can also be applied to next-generation sequencing technologies. Recently the Roche 454 sequencing platform has been employed for HLA typing in research rather than recruitment
A consideration specifically related to the HLA typing method is the representation of the ambiguous allele data derived from the HLA typing, which was also measured using entropy. The 2-digit DNA typing resolution is in practice a result of incomplete reporting of SSO, SSP, or SBT typing data. The higher entropy of this type of data shows the value in reporting the complete information available from the HLA typing platform rather than rounding to the allele family level. The genotype list representation yields a slightly lower entropy than the NMDP allele code representation. Genotype list representation allows for the exclusion of some genotypes that have been ruled out by the HLA typing method, but would still be included in the Cartesian product of the alleles listed in the NMDP allele codes.
Note that, in some populations, the HLA-B locus presents higher values of entropy when typed using 2-digit DNA methods than when typed at the serological level (
This analysis provides a path for defining acceptable HLA typing for recruitment as minimum requirements for entropy scores as a measure of typing ambiguity and for HLA data representation guidelines as a way to ensure that genotype lists are reported. Single-pass or highly automated SBT can result in HLA typings that cross allele families, which does not meet current minimum standards for recruitment typing at NMDP, yet we show that it provides a high-quality low-entropy HLA typing. In fact, we had to use the genotype list representation for the simulation of single-pass SBT typings because some allele combinations result in HLA typings for which no NMDP allele codes have been created due to required minimum standards that HLA allele codes do not generally cross allele families
We observe variation in entropy for the same HLA typing method between populations and loci. For example,
In addition to absolute differences in entropy between populations, we also observed differences between population groups in the effectiveness of using haplotype frequencies in decreasing haplotype entropy. Having higher levels of LD can improve the predictive capability of haplotype frequencies, and so African population samples with lower LD could have higher entropy than European populations, with higher LD, for this reason. In the opposite direction, higher HLA diversity would lead to higher entropy in African population samples than in European samples. The API sample may have relatively higher entropy than other population samples because the API frequency distribution constitutes an average of the frequency distributions of multiple distinct populations, and thus may be skewed more towards rare types than other populations in this study. If API entropy were evaluated using more detailed race subcategories (e.g. Japanese, Korean, Filipino, etc.), we would expect lower entropy values because the HLA diversity of each respective sub-region would be lower. The size of the population sample used to generate haplotype frequencies also plays a role in the entropy calculations in that a relatively larger sample, as we had for CAU compared to the other races, would give higher entropy. Because of these multiple confounding factors affecting entropy, we urge caution in using entropy as a measure to compare the HLA characteristics between samples of different ancestry. There are some caveats in that the simulation framework implicitly has no sampling error or estimation error in the haplotype frequencies. In practice, uncertainty in the frequency estimates will lead to higher entropy, so our results should be treated as a practical lower bound.
For interpretation of between-locus entropy differences, we turn to the history of HLA nomenclature in that the allele families and serologic types were defined primarily using European samples. The naming of allele families was based loosely on serologic categories, and at some point in history newly discovered serologic patterns were no longer used to split up allele families. The discovery of new alleles also has an impact on entropy in that some populations have not been well-characterized for HLA and some individuals may have as yet un-described alleles that can result in some hidden entropy. In evaluating entropy at the locus level, we see that at the allele family level, the HLA-DRB1 locus has a higher entropy than HLA-A and HLA-B loci. The number of allele families defined for HLA-A and HLA-B is higher than that of HLA-DRB1, giving a lower entropy for typing resolution at the 2-digit or serologic levels, all else being equal.
Stem cell registries have been accruing HLA typing results for over 25 years, with continual advancement in typing methods during this period. The proportion of donors typed by each method changes over time in a searchable registry due to new donor recruitment, roll-off of donors exceeding the maximum age, reporting of primary HLA data, prospective typing, and high resolution typing on behalf of patients. With analysis of changes in HLA typing data for each donor over their time on the registry, it becomes possible to chart decreasing entropy in HLA typing over time and determine which typing methods were primarily responsible for this decrease. Entropy could also be applied as a selection factor for prospective typing projects in which some donors are upgraded to lower ambiguity typings.
In summary, the application of Shannon's entropy as a measure of HLA typing ambiguity has benefits throughout the lifecycle of HLA typing: in reagent design, lab reporting standards, donor recruitment typing guidelines, and registry matching algorithm performance evaluation.
(DOC)
The authors would like to thank Pedro Cano for his inspiration of this work during a visit to NMDP Bioinformatics Research in June 2011. The authors would also like to thank Michael Wright for his careful proofreading of the manuscript and constructive suggestions that significantly improved the manuscript.