Effective Classification and Gene Expression Profiling for the Facioscapulohumeral Muscular Dystrophy

Félix F. González-Navarro; Lluís A. Belanche-Muñoz; Karen A. Silva-Colón

doi:10.1371/journal.pone.0082071

Abstract

The Facioscapulohumeral Muscular Dystrophy (FSHD) is an autosomal dominant neuromuscular disorder whose incidence is estimated in about one in 400,000 to one in 20,000. No effective therapeutic strategies are known to halt progression or reverse muscle weakness and atrophy. It is known that the FSHD is caused by modifications located within a D4ZA repeat array in the chromosome 4q, while recent advances have linked these modifications to the DUX4 gene. Unfortunately, the complete mechanisms responsible for the molecular pathogenesis and progressive muscle weakness still remain unknown. Although there are many studies addressing cancer databases from a machine learning perspective, there is no such precedent in the analysis of the FSHD. This study aims to fill this gap by analyzing two specific FSHD databases. A feature selection algorithm is used as the main engine to select genes promoting the highest possible classification capacity. The combination of feature selection and classification aims at obtaining simple models (in terms of very low numbers of genes) capable of good generalization, that may be associated with the disease. We show that the reported method is highly efficient in finding genes to discern between healthy cases (not affected by the FSHD) and FSHD cases, allowing the discovery of very parsimonious models that yield negligible repeated cross-validation error. These models in turn give rise to very simple decision procedures in the form of a decision tree. Current biological evidence regarding these genes shows that they are linked to skeletal muscle processes concerning specific human conditions.

Citation: González-Navarro FF, Belanche-Muñoz LA, Silva-Colón KA (2013) Effective Classification and Gene Expression Profiling for the Facioscapulohumeral Muscular Dystrophy. PLoS ONE 8(12): e82071. https://doi.org/10.1371/journal.pone.0082071

Editor: Ruben Artero, University of Valencia, Spain

Received: April 25, 2013; Accepted: October 21, 2013; Published: December 13, 2013

Copyright: © 2013 Gonzalez-Navarro et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: The authors gratefully acknowledge funding from the Universidad Autónoma de Baja California and Universitat Politècnica de Catalunya. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

The Facioscapulohumeral Muscular Dystrophy (FSHD) is an autosomal dominant neuromuscular disorder and the third most common inherited muscular dystrophy [1], [2]. Its incidence may vary in different places and probably in different racial groups, but recent estimates account for one in about 400,000 to one in 20,000 [3]. FSHD patients show progressive weakening and atrophy of the muscles in the face, slowly progressing to the shoulder, upper arm muscles and shoulder girdle, down to the stomach and lower limbs. Inability to flex the foot upward, foot weakness, and an onset of right/left asymmetry are also common symptoms [4], [5].

Although the FSHD is considered a relatively benign dystrophy, about 20% of the patients presenting this disorder are eventually restrained to a wheel chair. The age of onset is variable, being the second decade of life the most common stage where patients become symptomatic. In some cases, however, symptoms never develop even when the individual has the mutation associated with the FSHD.

No effective therapeutic strategies are known to either halt progression or reverse muscle weakness and atrophy in the FSHD [6]. However, there are a number of actions that can provide symptomatic and functional improvement in many patients. In particular, the use of assistive devices –such as braces, standing frames, or walkers– is of great help. Physical therapies like exercises in water, complemented by psychological support and speech therapy may also alleviate specially difficult life conditions.

It is known that the FSHD is caused by deletion of a subset of D4Z4 macrosatellite repeat units in the subtelomere of chromosome 4q [7]. D4Z4 modification needs to occur on a specific chromosomic background to cause the FSHD. More than 95% of patients with clinical FSHD have an associated D4Z4 deletion on the 4q35 chromosome. However, a small number of kindreds with clinically typical FSHD do not present this dynamic. A second FSHD locus has not yet been identified [8]. Recent advances involve the DUX4 gene, a retrogene sequence within D4Z4 that encodes a double homeodomain protein whose exact function is not entirely known. Although the proper mechanisms responsible for the progressive muscle weakness still remain unknown, the study of this gene could offer a possible therapeutic way [7].

It is generally believed that the monitoring of expression levels for thousands of genes simultaneously may lead to a more complete understanding of the molecular variations among different cell conditions. In the literature on machine learning, contributions concerning the analysis of gene expression FSHD data are very scarce, probably because of unawareness towards the highly rare diseases. The situation is aggravated by the absence of scientific data outside purely medical domains, in order to attack the problem from a different point of view. In contrast, there is now a vast body of available datasets about microarray gene expression analysis when focused to cancer diseases. Specifically, microarray gene expression databases have been used to discriminate between tumours or tumour subtypes, and to study biological properties of tumours –see, e.g., [9].

Over the last decade, Machine Learning (ML) has made significant inroads in the fields of bioinformatics and biomedicine [10]. Specifically, cancer research has applied a variety of ML algorithms for tumor prediction by associating expression patterns with clinical outcomes for patients with tumors [11]. The majority of this research has focused on building accurate classification models from reduced sets of features. Some of these analyses also aim to gain understanding of the differences between normal and malignant cells and to identify genes that are differentially regulated during cancer development. The importance of the validity and reproducibility of statistical analysis and reporting cannot be stressed enough [12].

Typically, a gene expression data set may consist of dozens of samples but with thousands or even a few tens of thousands of genes (acting as features, using the ML terminology). Predictive model construction using this very high ratio between number of features and number of samples is a delicate undertaking, prone to obtain unreliable readings. As a result, dimensionality reduction and in particular feature selection techniques may be very useful, as a way to reduce the problem complexity and lighten medical expert diagnosis.

Of special importance in a practical medical setting is the interpretability of the obtained solutions, something that limits the applicability of methods such as PCA or ICA (whose solutions involve weighted combinations of genes, instead of individual genes). Moreover, in a medical context, data visualization in a low-dimensional representation space may become extremely important, as it would help doctors to gain insights into this complex and highly sensitive domain. The development of predictive models able to discern between healthy and FSHD samples with minimal error rate and amenable to direct interpretation is thus a clear research goal. When predictive models use very low numbers of relevant genes, these genes are likely to be associated with the disease, and can be used as a starting mechanism for further dedicated study from a biological point of view.

The present study addresses all these issues in two FSHD databases (named, just for reference in this paper, as FSHD-DB1 and FSHD-DB2) to discern between healthy and FSHD samples (clinical cases). We report experimental results supporting the practical advantage of combining robust feature selection and classification in the analyzed FSHD datasets. The described method is able to unveil two groups of genes that yield very low mean cross-validation error. These genes can be used to build very simple decision procedures in the form of a decision tree.

Results and Discussion

FSHD-DB1 Database

The feature selection process in Algorithm 1 comes to a final solution in the form of a subset with only three genes and a 100% of mean 5×5 cv accuracy. This final subset is presented in Table 1 including its gene IDs and full names. It will be hereafter referred as the FSHD-DB1 model. In comparison, PAMR delivers a 96.8% of mean 5×5 cv accuracy with 2 genes (Table 2), and SVM-RFE delivers a comparable 99.4% mean 5×5 cv accuracy, using 5 genes (Table 3). As a further comparison, if we consider the two genes signaled as most relevant in the literature (DUX4 [7] and FRG1 [13]), the corresponding mean 5×5 cv accuracy of these two genes (taken together) is 84.65%.

Download:

Table 1. Best gene subset found using the proposed method and LDA as performance measure in FSHD-DB1 (the FSHD-DB1 model).

https://doi.org/10.1371/journal.pone.0082071.t001

Download:

Table 2. Best gene subset found using PAMR in FSHD-DB1.

https://doi.org/10.1371/journal.pone.0082071.t002

Download:

Table 3. Best gene subset found using SVM-RFE in FSHD-DB1.

https://doi.org/10.1371/journal.pone.0082071.t003

Visualization.

Data visualization in a low-dimensional representation space is extremely important to gain a better understanding of the solution delivered by the process. To visualize the result, the data corresponding to the FSHD-DB1 model are plotted using the three selected genes as axes, without any pre-processing method or projection technique –Fig. 1. In addition, the LDA decision boundary fitted in the whole data set is shown. The FSHD group presents a less compact distribution, while the Healthy group is clustered around a specific region of the representation space given by the three genes found. It can be seen that the two conditions are neatly separated.

Download:

Figure 1. LDA decision surface for the FSHD-DB1 model.

https://doi.org/10.1371/journal.pone.0082071.g001

Figure 2 shows a box plot for each gene in the FSHD-DB1 model. LAMP1 shows a mean value for FSHD samples of , against Healthy with mean ; DPF3 shows a more even expression level, FSHD with and Healthy with ; KPNA2 tends to up-regulate heavily in FSHD (mean , compared to Healthy with ).

Download:

Figure 2. Box plots for the expression levels of the genes in the FSHD-DB1 model.

https://doi.org/10.1371/journal.pone.0082071.g002

Figure 3 depicts a dendrogram of cases and standardized gene expression levels for the FSHD-DB1 model. Each case is identified with an ID number, prefixed by a letter indicating class membership, H for Healthy and F for FSHD. It is apparent that LAMP1 shows an up-regulation in most of the FSHD cases, as well as KPNA2; DPF3 shows a slightly diffuse expression level.

Download:

Figure 3. Clustering of the expression levels of the genes in the FSHD-DB1 model.

Left: by genes; Top: by samples.

https://doi.org/10.1371/journal.pone.0082071.g003

Nonetheless, it is noticed in Fig. 3 that the natural clusters do not necessarily correspond to labeled samples, and thus supervised information is needed to create accurate prediction models, even in this low-dimensional representation. Three clusters are discovered: a first one (H1 to H17), in which most (but certainly not all) of the samples belong to Healthy class; a second group (H18 to H22), containing three Healthy and two FSHD samples; finally, a third group (from F23 on) which is completely messed up. This result –although is certainly dependent on the limitations of clustering methods– alerts against using unsupervised feature extraction methods like PCA.

An interesting point to be emphasized in these graphic representations is that the FSHD-DB1 model clearly clusters the two conditions neatly –Fig. 1. We were therefore interested in ascertaining to what extent is this result stable and may thus constitute a good departing point for future studies. To this end, we performed two further investigations:

The first action was to change the resampling method to 10 times 10-fold cross validation (10×10 cv). This form of resampling entails a much higher computational cost; however, it has been suggested as adequate for small sample situations [14].
The second action was to analyze the statistical differences between FSHD vs. Healthy samples in the expression levels for the genes in the model. In addition, we explored the possibility that a single gene is able to (almost) perfectly separate the two classes by mere chance.

Statistical analysis.

We were interested in exploring the effect of changing the resampling method, keeping the same classifier (LDA in this case), in order to exclude this source of variation from the analysis. Remarkably, using 10×10 cv instead of 5×5 cv in Algorithm 1, it was found that the final result fully coincided with the FSHD-DB1 model.

In order to assess statistical significance of expression levels, the Mann-Whitney U-test (MWU) was used in the comparison between FSHD vs. Healthy samples in the model. This is a non-parametric hypothesis test for assessing whether one of the two conditions (FSHD in this case) tends to have larger values than the other.

For KPNA2, medians for the two groups Healthy and FSHD were and ; the distributions in the two groups differed significantly (Mann-Whitney W = , -value ).

For DPF3, medians for the two groups Healthy and FSHD were and ; the distributions in the two groups did not differed significantly (Mann-Whitney W = , -value ).

For LAMP1, medians for the two groups Healthy and FSHD were and ; the distributions in the two groups differed significantly (Mann-Whitney W = , -value ).

Therefore both KPNA2 and LAMP1 genes present high differences in the two conditions. Although these two genes are not equal, they present notable similarities. Spearman’s rank correlation coefficient is equal to . This fact will be used to simplify the FSHD-DB1 model.

One may still wonder about the probability of finding such a single gene like KPNA2–that separates the two conditions with one exception– by mere chance (Fig. 2). If a gene bears no relation with the disease, we could expect an arbitrary pattern for the distribution of the two conditions (healthy vs. FSHD cases) across the expressed values of the gene. The probability that one or more genes in 22,283 separates the two conditions (14 FSHD; 18 healthy) with only one exception is found to be around .

A final interpretable model.

Even though the LDA decision boundary in Fig. 1 depicts a clean separation between the two patient conditions, its application as a decision tool may not be straightforward. In this sense, decision trees are one of the preferred tools by experts in decision making processes. Moreover, the final selection of a gene subset may still provide few clues about the structure of the two conditions with respect to their expression levels. Some accuracy may be sacrificed for increased interpretability of the model.

Figure 4 shows a CART decision tree [15] built with the FSHD-DB1 model. The main question is on the expression level of gene KPNA2: the right branch corresponds to 13 (all but one) of the FSHD patients; the left branch corresponds to all of the 18 healthy ones plus the remaining FSHD patient. Moreover, one may wonder if there is a second gene, expressed such that it separates this specific patient from the 18 healthy ones, and indeed there is one: precisely DPF3. Whether this last patient is an outlier in a medical sense we cannot know, but it deserves further clinical investigation. Therefore, despite LAMP1 shows a markedly differential expression, it may be excluded from the decision flow.

Download:

Figure 4. Classification tree for the simplified model in the FSHD-DB1 database.

The boxes are leaves indicating the prediction, the numbers of cases for each condition, and the overall percentage of covered cases.

https://doi.org/10.1371/journal.pone.0082071.g004

Biological evidence.

In this section, we compile scientific knowledge about the two genes in the final subset, including their known primary functions in cellular process.

KPNA2.

KPNA2 is Karyopherin alpha 2 (RAG cohort 1, importin alpha 1). It is known that muscle functions are dependent on spatial and temporal control of gene expressions in myofibers. These are multinucleated cells that contain hundreds of nuclei spread across the length of the cell in a common cytoplasm. Their very important role is to control the transcriptional activity of several nuclei in a common cytoplasm [16].

Analyzing the role of karyopherin alpha (KPNA) and paralogs-specific roles of KPNA1 and KPNA2 during myogenesis, it has been found that these two genes do regulate myoblast proliferation. Particularly, KPNA2 regulates myotube size and myocyte migration [17]. Therefore, both may be involved in the nuclear transport of proteins [18], which has a key role in controlling gene expression in skeletal muscles.

DPF3.

DPF3 is D4, zinc and double PHD fingers, family 3. This gene belongs to the neuron-specific chromatin remodeling complex (nBAF complex), acting as a tissue-specific anchor between histone acetylations and methylations and chromatin remodeling [18], [19]. Experiments in human cardiac samples and mouse embryonic and adult hearts showed that it plays a role in heart and skeletal muscle development [20]. It also presents an up-regulated expression in patients with Tetralogy of Fallot, a congenital heart defect, partially characterized by muscular hypertrophy.

FSHD-DB2 Database

The feature selection process in Algorithm 1 comes to a final solution with six genes and 99.6% of mean 5×5 cv accuracy. This final subset is presented in Table 4 including its gene IDs and full names (of which two of them are yet unknown). It will be hereafter referred as the FSHD-DB2 model. In comparison, PAMR delivers a 70.4% of mean 5×5 cv accuracy with 3 genes (Table 5), and SVM-RFE delivers 85.2% mean 5×5 cv accuracy, using 5 genes (Table 6, of which three of them are unknown). This database contains DUX4 entries, corresponding to 4 isoforms. If we consider the most informative model, including the 4 sequences of DUX4 and FRG1 together, the corresponding 5×5 cv accuracy is found to be a disappointing 39.60%.

Download:

Table 4. Best gene subset found using the proposed method and LDA as performance measure in FSHD-DB2 (the FSHD-DB2 model).

https://doi.org/10.1371/journal.pone.0082071.t004

Download:

Table 5. Best gene subset found using PAMR in FSHD-DB2.

https://doi.org/10.1371/journal.pone.0082071.t005

Download:

Table 6. Best gene subset found using SVM-RFE in FSHD-DB2.

https://doi.org/10.1371/journal.pone.0082071.t006