A Complete Analysis of HA and NA Genes of Influenza A Viruses

Weifeng Shi; Fumin Lei; Chaodong Zhu; Fabian Sievers; Desmond G. Higgins

doi:10.1371/journal.pone.0014454

Abstract

Background

More and more nucleotide sequences of type A influenza virus are available in public databases. Although these sequences have been the focus of many molecular epidemiological and phylogenetic analyses, most studies only deal with a few representative sequences. In this paper, we present a complete analysis of all Haemagglutinin (HA) and Neuraminidase (NA) gene sequences available to allow large scale analyses of the evolution and epidemiology of type A influenza.

Methodology/Principal Findings

This paper describes an analysis and complete classification of all HA and NA gene sequences available in public databases using multivariate and phylogenetic methods.

Conclusions/Significance

We analyzed 18975 HA sequences and divided them into 280 subgroups according to multivariate and phylogenetic analyses. Similarly, we divided 11362 NA sequences into 202 subgroups. Compared to previous analyses, this work is more detailed and comprehensive, especially for the bigger datasets. Therefore, it can be used to show the full and complex phylogenetic diversity and provides a framework for studying the molecular evolution and epidemiology of type A influenza virus. For more than 85% of type A influenza HA and NA sequences into GenBank, they are categorized in one unambiguous and unique group. Therefore, our results are a kind of genetic and phylogenetic annotation for influenza HA and NA sequences. In addition, sequences of swine influenza viruses come from 56 HA and 45 NA subgroups. Most of these subgroups also include viruses from other hosts indicating cross species transmission of the viruses between pigs and other hosts. Furthermore, the phylogenetic diversity of swine influenza viruses from Eurasia is greater than that of North American strains and both of them are becoming more diverse. Apart from viruses from human, pigs, birds and horses, viruses from other species show very low phylogenetic diversity. This might indicate that viruses have not become established in these species. Based on current evidence, there is no simple pattern of inter-hemisphere transmission of avian influenza viruses and it appears to happen sporadically. However, for H6 subtype avian influenza viruses, such transmissions might have happened very frequently and multiple and bidirectional transmission events might exist.

Citation: Shi W, Lei F, Zhu C, Sievers F, Higgins DG (2010) A Complete Analysis of HA and NA Genes of Influenza A Viruses. PLoS ONE 5(12): e14454. https://doi.org/10.1371/journal.pone.0014454

Editor: Justin Brown, University of Georgia, United States of America

Received: June 29, 2010; Accepted: November 29, 2010; Published: December 29, 2010

Copyright: © 2010 Shi et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This study was mainly supported by Science Foundation Ireland (PI grant 07/IN.1/B1783). FL was funded by CAS Innovation Program (KSCX2-YW-N-063) and NSFC (No. 30925008). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Influenza A virus is one of most important pathogens that infect humans, mammals, birds and poultry. The viral genomes are composed of eight separate segments encoding at least 11 proteins. Haemagglutinin (HA), encoded by the fourth segment, is an important glycoprotein and a major surface antigen which is responsible for attaching the virions to hosts, deciding the pathogenicity and virulence [1]. The sixth segment encodes another glycoprotein, neuraminidase (NA), which is a second major surface antigen associated with releasing newly produced viral particles, and drug resistance [1]. So far, 16 HA and 9 NA subtypes of type A influenza virus have been identified [2] and more than one hundred of the possible 144 HA-NA combinations have been found [3].

Some attempts have been made to analyze the phylogenetic diversity and distribution of type A influenza viruses [4]–[7]. Among them, Liu and colleagues published a “panorama phylogenetic analysis” of all the 16 HA and 9 NA subtypes and they divided them into 68 HA and 49 NA lineages and sublineages [6]. This study provided a comprehensive framework for studying the evolutionary and epidemiological history of type A influenza virus. However, this, like all previous analyses, just selected a small number of representative strains available for their phylogenetic analyses. In fact, they analyzed 1264 HA sequences and 1154 NA sequences. This limited sampling of sequences could make the results less conclusive and underestimate the real phylogenetic diversity of type A influenza especially for human H1N1 and H3N2 influenza viruses. In addition, the nomenclature system that they proposed was still ambiguous and less effective in that for some sequences it was hard find a lineage to which they belonged [5], [6].

The Influenza Virus Resource is a database of influenza sequences and associated information [8]. As of 31^th October, 2009, there were 22291 HA sequences and 13345 NA sequences of type A influenza viruses available. However, apart from subtype information, there is no other phylogenetic or genetic information available for the sequences. In particular, many sequences are not analyzed by anyone or some of them have been analyzed but there are no corresponding references in PubMed.

Pigs are regarded as the main intermediate host for avian influenza viruses to make the appropriate genetic changes in order to infect humans. They have both cell-receptors to match human and avian influenza viruses [9], [10] and several reports have provided genetic evidence to support this view [11], [12]. Influenza viruses of H1N1, H1N2 and H3N2 subtypes circulate widely in pig populations. Besides these, viruses of other subtypes, such as H3N1, H4N6, H5N1, H5N2 and H9N2 have been also reported to infect pigs [13]. However, the phylogenetic diversity of all swine influenza viruses (SIV) is not clear to date.

Apart from viruses from humans, birds and pigs, viruses have been isolated from other species, such as horses, dogs and mink. For example, highly pathogenic H5N1 avian influenza viruses have been isolated from tiger [14], leopard [14], cat [15], [16], dog [17], [18] and pika [19]. Although a few cases of infections caused by these extra species have been reported, phylogenetic diversity of these viruses has been seldom studied as a whole.

Migratory waterfowl of the world are natural reservoirs of avian influenza viruses (AIV) of all known subtypes. It has been long clear that AIV evolved into two separate lineages, the Eurasian and North American clades, due to geographic isolation [1] and there is limited virus exchange between them [20], [21]. Nonetheless, several inter-hemisphere transmission cases of AIV have been reported [22]–[24]. In particular, Olsen et al. proposed the global patterns of occurrence of influenza A virus in wild birds [25]. In addition, migratory birds are often regarded as being responsible for the wide and fast spread of highly pathogenic avian influenza (HPAI) H5N1 viruses in Eurasia and Africa [26]–[28], although this is still controversial [29]–[31]. Therefore, a complete analysis of inter-hemisphere transmission of AIV based on a large number of sequences is needed. To shed light on the above questions, a systematic and complete analysis using all HA and NA sequences available is needed. However, when the numbers of sequences exceed a few hundred, it becomes difficult to visualize and analyze a phylogenetic tree, which is a standard device to study the molecular evolution and epidemiology of viral outbreaks.

Principal Coordinates Analysis (PCOORD) [32] has been used by us in the past [33] and the accompanying software has been used to analyze virus sequence variation [34]. PCOORD takes a matrix of Euclidean distances between a set of objects and return a set of principal axes that try to preserve the distances. Multidimensional Scaling (MDS) methods work by finding a set of axes that minimize the “stress” between the original distance matrix and the distances between the plotted sequences [35]. If the distances are Euclidean, then this is known as classical MDS and is equivalent to PCOORD. The main difference is in the details of how the axes are calculated. MDS has also been used to visualize antigenic variation in influenza viruses [36] and there are fast algorithms and software available for applying this to very large data sets [37].

The main reason for not using PCOORD or MDS is the superior detail that can be seen by close inspection of phylogentic trees and the desire for cladistic classification schemes. The main reason in favor of using these methods is the ability to analyze and visualize data sets of more or less any size. A further advantage is the ability to visualize relationships where sequences which are intermediate between others are found such as happens if divergent virus sequences undergo recombination. Overall, we believe that a combination of methods is probably most useful where PCOORD is used to find the main groupings and to look for outliers and intermediates and where detailed groupings are either confirmed or discovered by phylogenetic analysis of subsets of sequences.

In this paper, we compiled the largest datasets of HA and NA genes of influenza and analyzed 85% of all HA and NA gene sequences available in GenBank by combining PCOORD and traditional phylogenetic trees. This work provides a framework for studying the molecular evolution and epidemiology of type A influenza and sheds light on the phylogenetic diversity of SIV and inter-hemisphere transmission of AIV.

Materials and Methods

Nucleotide sequences were downloaded from the Influenza Virus Resource at the National Center for Biotechnology Information (NCBI) (http://www.ncbi.nlm.nih.gov/genomes/FLU/FLU.html) by using each of the 16 HA and 9 NA subtypes as search queries on October 31^st, 2009. This gave 16 HA datasets and 9 NA datasets. Each dataset was aligned by Muscle [38] first and further adjusted manually in Bioedit [39]. The mature HA protein has two subunits, HA1 and HA2, connected by disulfide linkages [1]. HA1 has typically about 324 amino acids, while HA2 has about 222 amino acids. For sequences of H1 to H9 and H15, many of them were not full length and only HA1s were available. In order to include these sequences in the analysis, we removed the HA2 section from the alignment for those sequences which were full length. For the rest of the HA subtypes (H10 to H14, and H16), both HA1 and HA2 were used. In addition, sequences with more than 10 leading and/or 10 terminal gaps and lower quality sequences with more than 10 ambiguous nucleotide bases in each alignment were excluded from later analysis. This left, 18975 out of 22291 HA sequences and 11345 out of 13352 NA sequences to be analyzed. This is more than 85% of all the sequences that were available, at that time. Summary details of the HA and NA sequences from each subtype are given in Tables 1 and 2.

Download:

Table 1. Summary information and geographical origin for all HA sequences.

https://doi.org/10.1371/journal.pone.0014454.t001

Download:

Table 2. Summary information and geographical origin for all NA sequences.

https://doi.org/10.1371/journal.pone.0014454.t002

A distance matrix for every dataset was calculated using Kimura's two-parameter model [40]. This was performed using DNADIST in Phylip 3.68 [41]. These distance matrices were first analyzed to produce “ordinations”. These ordinations embed N nucleotide sequences as vectors into an M-dimensional space, where M≪N, such that the distances of the end-points of the vectors resemble the “true” distances d_ij obtained by DNADIST.

There are different approaches to determining these vectors. Often the ordination is formulated as an optimization problem, which in our case can be stated analytically in terms of a matrix eigen decomposition. Firstly, we transform the distance matrix into an association matrix with elements a_ij = −(1/2)d_ij². This association matrix is then centered by subtracting its row- and column-means and adding the overall mean. The final task is to find the eigenvectors of this centered matrix. As we are looking for a low-dimensional embedding (typically just 2- or 3-dimensional), we only have to find the eigenvectors corresponding to the M eigenvalues of the greatest magnitude.

Eigenvectors corresponding to the largest eigenvalues point in the directions of the greatest variability of the data. If the distances are sufficiently well behaved (a close approximation to Euclidean distances), then these M eigenvalues will be positive. If not all but only the leading eigenvectors are sought, one could employ a power-iteration. Convergence of the power method is geometric in the ratio of the first two eigenvalues. If these eigenvalues are similar in size, then convergence of the power method will be slow.

Calculating all the distances of N sequences has a time complexity of O(N²) while performing the singular value decomposition (SVD) has a time complexity of O(N³). In our study the largest N was 8662 and scalability is not yet an issue. For this work, therefore, we used the Python standard implementation of SVD from the NUMPY library and its dependencies. Our program for PCOORD is available on request from the authors. Alternatively, the SPACER program [33], written in Fortran and performing PCCORD, is available on-line from http://www.hiv.lanl.gov/content/sequence/PCOORD/PCOORD.html. However, for problems where N>10,000, alternative algorithms may become necessary. Such methods could be Split-and-Combine MDS (SCMDS) [37] or the Nyström method [42].

In the last step of PCCORD, we visualized the data in two dimensions by simply plotting the first versus the second or the first versus the third axes, with greatest associated eigenvalues. In each PCCORD figure, each dot represents for one sequence. The color of the dot signifies where the virus was isolated from and the shape indicates the host.

Next, phylogenetic trees were estimated using the same datasets. For datasets with more than 2500 sequences, Neighbor-Joining (NJ) trees [43] were constructed using the linux version of PAUP* 4b10 [44]. The distance model was set to HKY85 [45] and all other parameters were set to default. For datasets with 1500 to 2500 sequences, phylogenetic trees were estimated using the Maximum Likelihood (ML) method as implemented in PhyML [46]. The general time-reversible (GTR) model [47] was applied as the model of nucleotide substitution and base frequency was estimated by maximizing the likelihood of the phylogeny. In addition, the proportion of the invariable sites was estimated rather than fixed and four substitution rate categories were used with the gamma distribution parameter estimated to account for variable substitution rates among sites. All other parameters were set to default. For datasets of less than 1500 sequences, ML trees were also estimated in PhyML using the same parameters but with 100 bootstrap replicates. All the trees were visualized using Dendroscope [48]. It should be noted that for most of trees, we did not specify a sequence as an outgroup. However, for some trees whose clusters were not very clear, we selected a sequence from a neighboring group as an outgroup to root these trees.

The remainder of the analysis consisted of using the ordinations and trees to identify subgroups of sequences, with clear separation from the rest. The aim was to identify groups of clearly related sequences which we could use to ask questions about epidemiology. This was mainly done by visual inspection of PCOORD results and by using the bootstrap values from the trees. Sometimes, host and geography information was also used to help to define the groups. In most cases, the groups seen in the ordinations and trees agreed. When they disagreed, the trees were used. This process was repeated iteratively by re-applying PCCORD and phylogenetic analysis to smaller and smaller groups of sequences until the groups showed no clear subgroups and/or bootstrap values for subgroups became too low (Figure 1). The result was a small hierarchy of groupings for each HA and NA subtype. The subtypes were divided into groups and subgroups. The groups were named according to the subtype using the notation: Hngm where n was the HA subtype and m was the subgroup of HA sequences. Similarly NA sequences were named using the same notation Nngm. These groups might be subdivided into subgroups using a period “.” and a further integere.g. H2g2.3 is subgroup 3 of group 2 of subtype 2 of the HA sequences. This was similar to the notation system of Liu et al [6]. All the PCCORD figures, phylogenetic trees and other supporting materials were available from http://myosin.ucd.ie/~shiwf/influenzaclassification/index.html.

Download:

Figure 1. The workflow.

This figure takes HA sequences of H2 subtype as an example to illustrate the workflow. In step I, we carry out a PCOORD (Ai) and a phylogenetic analysis (Aii) using the H2HA dataset. Results from the two methods support their division into two groups, H2g1 and H2g2. In step II, we repeat step I using two sub-datasets, H2g1 (Bi and Bii) and H2g2 (Ci and Cii). Bi and Bii show the results from PCOORD and the phylogenetic tree using H2g1. Group H2g1 is further divided into 5 subgroups. Similarly, Ci and Cii display the results from PCOORD and the phylogenetic tree using H2g2, and this group can also be further divided into some smaller subgroups. In step III, we summarize the results into a table (D) and in a tree-like figure (E). Panel E summarizes the phylogenetic diversity of HA sequences of H2 influenza A virus and values after the underlines indicate the numbers of sequences in the group or subgroup. If there are groups can be further divided based on the results of step II, we will repeat step II until there are no distinctly separated groups or the bootstrap values are too low to support further sub-division.

https://doi.org/10.1371/journal.pone.0014454.g001

We summarized all the groupings in two large tables, one for HA (Table S1) and one for NA (Table S2). These are available from the website: http://myosin.ucd.ie/~shiwf/influenzaclassification/index.html. These tables form the basis of a database with one record for each sequence which stores basic information about each sequence, including its grouping. Table S1 has 16 sections which correspond to 16 HA subtypes, while Table S2 has 9. Each entry records 9 pieces of information. The first column gives the grouping from this analysis. The second column gives the GenBank accession number of the sequence. Columns 3 to 9 give the sequence name, virus host, viral HA and NA subtype, country of isolation, continent and time of isolation. Therefore, for each sequence in the present analysis, if you search using its GenBank accession number, you can find a clear and unambiguous group to which it belongs. In addition, a tree-like figure was drawn to show the simple hierarchy of the groupings for each HA and NA subtype. An example is shown in Figure 1E. The 23 trees for the 23 subtypes are available from http://myosin.ucd.ie/~shiwf/influenzaclassification/index.html.

Results

Phylogenetic diversity of HA and NA genes

In this paper we take 22291 HA sequences and 13345 NA sequences from GenBank. After removing short and low quality sequences, we analyze 18975 HA and 11362 NA sequences (Tables 1 and 2). By far the majority of HA sequences belong to subtypes H1, H3 H5, and H9 which account for ∼88% of the total. Approximately 88% of NA sequences belong to N1 and N2. In addition, most sequences are isolated from viruses from North America and Asia, and only a few are isolated from Africa and South America.

H14 and H15 are only represented by 4 and 9 sequences respectively and are not subdivided. All of the subtypes were divided into a total of 280 HA subgroups (Table S1) and 202 NA subgroups (Table S2). Many subgroups only include a few sequences and subgroups composed of one to five sequences account for 32.5% and 36.1% of all HA and NA subgroups, respectively.

We analyze the association of HA and NA subtypes in isolates in different HA and NA subgroups (Figure 2). 140 HA subgroups only include isolates associated with a single NA subtype, while 123 HA subgroups include sequences of at least two NA subtypes (Figure 2A). Similarly, 97 NA subgroups only include sequences associated with a single HA subtype, while 94 NA subgroups include sequences of at least two HA subtypes (Figure 2B).

Download:

Figure 2. Phylogenetic diversity indicated by the association of HA and NA subtypes.

Panel A shows the phylogenetic diversity indicated by the NA subtype distribution among HA subgroups, while panel B shows the phylogenetic diversity indicated by the HA subtype distribution among NA subgroups.

https://doi.org/10.1371/journal.pone.0014454.g002

Generally, most of the results of the classification are consistent with a recent analysis which attempted an overall classification of flu sequences [6]. However, this work is more detailed and comprehensive, especially for the bigger datasets, such as for subtypes H1, H3, H5, H9, N1 and N2.

Phylogenetic diversity of the main subtypes

In detail, H1 is divided into 64 subgroups (Table 3). First, it is divided into four major groups labeled H1g1 to H1g4 (Figure 3). H1g1 is the seasonal human H1N1 influenza group, which largely corresponds to h1.2 in Liu et al [6]. Apart from the 1918 sequences, Liu et al. further divided this lineage into three sublineages, h1.2.2, h1.2.3 and h1.2.5. However, we further divide this group into five subgroups, largely corresponding to the periods 1933∼1957, 1948∼1984, 1986∼2001, 1994∼2008 and 2004∼2009, respectively. In addition, these five subgroups have been further subdivided in our scheme. In particular, swine influenza viruses of H1N2 subtype isolated from Europe (H1g1.3) also fall within this group, and this is consistent with h1.2.4 reported by Liu et al [6]. H1g2 is composed of sequences from Eurasian pigs and worldwide birds. We further subdivide H1g2 into two subgroups rather than the three sublineages made by Liu et al [6]. In detail, H1g2.1, consisting of nine smaller subgroups, is composed of virus sequences from North American birds. H1g2.2 is composed of seven subgroups, with H1g2.2.1 to H1g2.2.3 including viruses from Eurasian birds and H1g2.2.4 to H1g2.2.7 including viruses from Eurasian pigs. This is better to clarify the avian origin of Eurasian swine influenza [49], [50]. H1g3 includes classical swine sequences (H1g3.1 to H1g3.3) and pandemic H1N1 human influenza (H1g1.4). Liu et al. subdivided the classical swine lineage into two sublineages, h1.3.1 and h1.3.2. However, we believe that viruses of h1.3.2 have diversified since the late 1990s and should be divided into two subgroups, H1g3.2 and H1g3.3. In particular, viruses of H1g3.3 are the most likely progenitors of the pandemic 2009 H1N1 virus HA genes [12]. Liu et al. classified the 1918 sequences within the seasonal human H1N1 lineage as h1.2.1 [6]. However, due to the discovery that HA of seasonal human H1N1 is not derived from Spanish flu directly [51], we define the 1918 sequence as an independent group, H1g4.

Download:

Figure 3. PCCORD of H1 subtype influenza viruses.

In this figure, each sequence is shown as a dot using shape to signify host and color to indicate geographic region (the continent this virus was isolated from). For Figure 4 to 8 and all PCCORD figures available from our webpage, we use the same shape and color coding.

https://doi.org/10.1371/journal.pone.0014454.g003

Download:

Table 3. Number of subgroups defined in this analysis.

https://doi.org/10.1371/journal.pone.0014454.t003

H3 includes two major groups which are, in turn, subdivided into 38 subgroups (Figure 4, Table 3). H3g1 is mainly composed of worldwide avian H3 sequences and human and swine H3N2 influenza sequences, while H3g2 is an H3N8 equine group. Liu et al. divided H3 into three lineages [6]. For human and swine H3N2 influenza, thousands of HA sequences were deposited in GenBank. Liu et al. just classified them into four sublineages, h3.1.3 to h3.1.6. However, we classify them into 13 subgroups H3g1.4 to H3g1.16 along with smaller subdivisions. Meanwhile, for h3.2 which was defined by Liu et al. with two sublineages, we classify them into seven subgroups, largely corresponding to the periods 1971∼1972 (H3g2.1), 1963∼1969 (H3g2.2), 1976∼1984 (H3g2.3), 1985∼1987 (H3g2.4), 1989∼2007 (H3g2.5), 1986∼2006 (H3g2.6) and 1999∼2008 (H3g2.7). The canine influenza viruses fall within H3g2.7 [52]. Apart from H3g1 and H3g2 which we define here, Liu et al. defined a third lineage which they named as h3.3 and further subdivided it into h3.3.1 and h3.3.2. However, h3.3.1 only contains one short sequence and is excluded from our analysis. In addition, h3.3.2 also contains just one sequence and we define it to be within H3g1.13 rather than a single lineage outside the two main groups and this is consistent with a recent report [53].