Comprehensive Decision Tree Models in Bioinformatics

Gregor Stiglic; Simon Kocbek; Igor Pernek; Peter Kokol

doi:10.1371/journal.pone.0033812

Abstract

Purpose

Classification is an important and widely used machine learning technique in bioinformatics. Researchers and other end-users of machine learning software often prefer to work with comprehensible models where knowledge extraction and explanation of reasoning behind the classification model are possible.

Methods

This paper presents an extension to an existing machine learning environment and a study on visual tuning of decision tree classifiers. The motivation for this research comes from the need to build effective and easily interpretable decision tree models by so called one-button data mining approach where no parameter tuning is needed. To avoid bias in classification, no classification performance measure is used during the tuning of the model that is constrained exclusively by the dimensions of the produced decision tree.

Results

The proposed visual tuning of decision trees was evaluated on 40 datasets containing classical machine learning problems and 31 datasets from the field of bioinformatics. Although we did not expected significant differences in classification performance, the results demonstrate a significant increase of accuracy in less complex visually tuned decision trees. In contrast to classical machine learning benchmarking datasets, we observe higher accuracy gains in bioinformatics datasets. Additionally, a user study was carried out to confirm the assumption that the tree tuning times are significantly lower for the proposed method in comparison to manual tuning of the decision tree.

Conclusions

The empirical results demonstrate that by building simple models constrained by predefined visual boundaries, one not only achieves good comprehensibility, but also very good classification performance that does not differ from usually more complex models built using default settings of the classical decision tree algorithm. In addition, our study demonstrates the suitability of visually tuned decision trees for datasets with binary class attributes and a high number of possibly redundant attributes that are very common in bioinformatics.

Citation: Stiglic G, Kocbek S, Pernek I, Kokol P (2012) Comprehensive Decision Tree Models in Bioinformatics. PLoS ONE 7(3): e33812. https://doi.org/10.1371/journal.pone.0033812

Editor: Ahmed Moustafa, American University in Cairo, Egypt

Received: October 31, 2011; Accepted: February 17, 2012; Published: March 30, 2012

Copyright: © 2012 Stiglic et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: The authors have no funding or support to report.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Decision trees are one of the most popular classification techniques in data mining [1]. One of the main reasons for this is decision trees' ability to represent the results in a simple decision tree format which is easy to interpret for experts, as they can see the structure of decisions in the classifying process. The basic idea of the decision tree format is to construct a tree whose leaves are labeled with a particular value for the class attribute and whose inner nodes represent descriptive attributes. Given an inner node N, the children of N correspond to different possible values of the associated descriptive attribute. Once a decision tree is built, determining the class value for a new instance is achieved by following a path from the root to a leaf according to the values of the descriptive attributes of the instance. The class value assigned will be that labeling the leaf. Following this process one can easily extract classification rules that can be readily be expressed so that humans can understand them. In addition to their simplicity, building decision trees is often a less time consuming classification process compared to other classification techniques [2], and decision tree rules can be directly used as statements in a database access language (e.g. SQL).

Decision trees can be built with several different approaches where the most popular are C4.5 [3] and CART [4]. Due to their popularity, decision trees have been applied to different research fields including bioinformatics [5], [6], medicine [7] and image classification [8]. In addition, several commercial products use decision trees for knowledge discovery, predictive analysis and other purposes. For instance, KnowledgeSeeker [9] offers business intelligence software for customer analytics and marketing analytics.

From the knowledge discovery perspective, the ability to track and evaluate every step in the decision-making process is one of the most important factors for trusting the decisions gained from data-mining methods. Examples of such techniques are decision trees that possess an important advantage in comparison with competitive classification methods - i.e., the symbolic representation of the extracted knowledge. Decision trees, along with rule-based classifiers, represent a group of classifiers that perform classification by a sequence of simple, easy-to-understand tests whose semantics are intuitively clear to domain experts [10]. Although current state-of-the art classifiers (e.g. Support Vector Machines [11]) or ensembles of classifiers (e.g. Random Forest [12] or Rotation Forest [13]) significantly outperform classical decision tree classification models in terms of classification accuracy or other classification performance metrics, they are not suitable for knowledge discovery process.

When decision trees are used in knowledge discovery, one should usually include domain experts in the analysis process. Therefore, in most cases the final decision trees will be presented to domain experts for evaluation of extracted knowledge – i.e., rules that can be derived from a decision tree. In such cases the complexity of decision trees, which is usually measured as the number of nodes or the number of rules that can be extracted from a tree, is of high importance and can influence the evaluation of the discovered knowledge by domain experts [14]. Decision tree complexity has been studied in terms of reducing the complexity and maintaining or improving the accuracy at the same time. Bohanec and Bratko [15] studied the difference between pruning a decision tree for better approximation of the target concept and pruning the decision tree to make it practical for communication and understanding by the user. Their study focused on developing algorithms for obtaining the smallest pruned decision trees that represent concepts within some chosen accuracy. Oates and Jensen [16] studied the influence of database size on decision tree complexity. They demonstrated that the tree size strongly depends on the training set size. Therefore, many approaches that are based on removing training instances prior to tree construction [17], [18], [19] could result in smaller trees just because of the training set reduction.

Different visual representations of decision trees like the classical node-link diagrams [20], [21], treemaps [22], [23], concentric circles [24], [25], and many others have been proposed in the past. A major consideration in evaluation of decision trees is also how efficiently they use screen space to communicate the tree information [26]. Through application of decision trees to different fields of research and their use in open source and commercial software for machine learning and data mining, it has been demonstrated that end-users still prefer node-link diagrams although their space covering is not optimal. Huysmans et al. [27] observe that currently, most research focuses on improving the accuracy or precision of predictive models and comparatively little research has been undertaken to increase their comprehensibility to the analyst or end-user. They empirically investigated suitability of decision tables, (binary) decision trees, propositional rules, and oblique rules in environments where interpretability of models is of high importance. The results showed that users prefer decision tables, followed by decision trees to other compared knowledge representations, but authors admitted that only inexperienced users were included in the study.

A multi-criteria approach to evaluation of decision trees that also includes size of the built decision trees was proposed by Osei-Bryson [28]. It aims to make the data mining process simpler for data mining project teams, especially when they have to evaluate significant number of decision trees. The proposed project uses three measures to evaluate appropriateness of the decision trees: stability, simplicity and discriminatory power. Simplicity, or equivalently complexity is further divided in the number of rules that can be extracted from the tree and the average length of the extracted rules.

Due to their popularity and a need to build simple decision trees with as little effort as possible, this paper proposes a novel method called Visual Tuning of Decision Trees (VTDT). This method helps data analysts in building effective decision tree representations with spending less time on setting and tuning the parameters of decision tree induction algorithm when compared to classical methods. From the analyst's perspective it is very important that the produced representation of the decision tree allows effective communication with end-users (i.e. customers) or domain experts in cases of decision tree applications in research. In addition, from our own and experience of our colleagues, we know that, although we live in a digital age, we still meet a lot of experts in different domains who prefer to have the final decision tree printed out on a sheet of paper. The result of the VTDT method is a decision tree that can be printed out on a single page or displayed on a computer screen without the need for scrolling or zooming. It is also important to take care of the decision trees that would be too pruned when using the default parameters of decision tree induction method. One could also call this type of decision tree induction “one-button decision trees” as there is no need to tune the parameters and build multiple decision trees anymore.

Methods

The proposed method in this paper presents an automated tuning process for the widely used C4.5 decision tree, which was developed by Quinlan [3]. More precisely, it focuses on C4.5's implementation in the Weka machine learning framework [29], where it is referred to as J48.

2.1 Tuning the Parameters

There are multiple settings that can influence the size of the generated decision tree. Two types of pruning are available - i.e., subtree replacement and subtree raising. Subtree raising uses a technique where a node may be moved upwards towards the root of the tree, replacing other nodes along the way during a process of pruning. In general, subtree raising is computationally more complex than subtree replacement where the nodes in a decision tree can be replaced by leafs. Another setting influencing the pruning process is confidence factor that represents a threshold of allowed inherent error in data while pruning the decision tree. By lowering the threshold one is applying more pruning and consequently generates more general models. To obtain simpler models where leafs contain higher number of samples, it is possible to set the minimal number of objects in a single leaf. This setting can also be used in tuning to achieve simpler and smaller decision trees. The final setting that can be used to tune the visual outlook of the tree is called binary splits selection. This setting forces the splitting of nodes to only two branches instead of multiple splits. The default J48 decision tree in Weka uses pruning based on subtree raising, confidence factor of 0.25, minimal number of objects is set to 2, and nodes can have multiple splits. To allow automated tuning in Weka, a package called Visually Tuned J48 (VTJ48, available at http://ri.fzv.uni-mb.si/vtj48/) was developed during this study.

All parameters, mentioned in the previous paragraph, are automatically tuned in VTJ48 to allow the so called “one-button data mining”. However, it is possible to change the default values for dimensions of the resulting window that represent boundaries of the VTJ48 decision tree. Default values for maximal dimensions of the decision tree are set to 1280×800 pixels corresponding to the Widescreen eXtended Graphics Array (WXGA) video standard. The aspect ratio of this resolution is 16∶10 (1.60) and comes very close to aspect ratio of A4 paper dimensions (approx 1.41). The chosen dimensions can also be displayed on most computer monitors in use today.

Although it would be possible to use the original Weka source code to display decision trees, some adaptations to original decision tree visualization methods had to be done to allow better covering of space for nodes and leaves. In comparison to classical Weka decision tree visualization, we changed the shape of internal nodes to allow more space on both sides of nodes. Additionally, we reduced the height of the trees with reduction of the vertical distance between nodes by 50%.

Tuning of parameters in VTJ48 is done using adapted binary search where confidence factor of pruning is optimized until highest acceptable value of this parameter is found. Boundaries for confidence factor optimization are set at 0 and 0.5 (starting value in VTJ48 and the maximal allowed setting in J48). In cases where initial confidence factor tuning cannot build an acceptable decision tree, binary splits are turned on. This step usually significantly reduces horizontal dimensions of the tuned decision tree. Tuning of confidence factor is done once again. In rare cases, where binary splits are not enough, VTJ48 tries to increase minimal number of objects in leaves. This parameter (m) is increased from 2 until m<n in steps of m², where n is number of all samples in the training set. More extensive search could have been chosen, but in such case one should expect a significant increase in the time complexity of the tuning process. The pseudocode in Figure 1 describes the reduction of the tree size process as implemented in VTJ48.

Download:

Figure 1. Comparison of the original J48 decision tree (upper image) and visually tuned version from VTJ48 (lower image) on the letter dataset.

https://doi.org/10.1371/journal.pone.0033812.g001

In rare cases the default settings of VTJ48 algorithm will produce an extremely small tree consisting of just one or even without splitting nodes. Therefore, in cases of decision trees with only one or two leaves, an approach using unpruned decision tree is used. With confidence factor set to 0.5 such tree will usually grow over the predefined boundaries. This time a linear hill-climbing approach is used to increase the minimal number of objects in leaves, because there is no need to tune confidence factor in an unpruned decision tree.

2.2 Experimental Settings

By reducing the size and complexity of decision trees to fit the predefined screen resolution or paper size one is expecting significantly lower classification accuracy, especially in initially very large decision trees. We used several different datasets to test this assumption.

2.2.1. UCI Datasets.

Forty UCI repository [30] datasets retrieved from the Weka website were used to evaluate the classification performance of the VTDTs. Basic information including the information on attributes that can influence the size of a decision tree for all datasets is presented in Table 1.

Download:

Table 1. Basic information on 40 datasets from UCI repository used in this study including information about number of instances, attributes, classes, length of longest attribute name (LAN) and length of the longest nominal attribute value (LAV).

https://doi.org/10.1371/journal.pone.0033812.t001

2.2.2. Protein Solubility Datasets.

In addition to the datasets from the UCI repository, we tested our method on datasets in the field of bioinformatics.

Protein solubility is an important protein property since low protein solubility can lead to several diseases [31] or affect isolation of proteins from complex mixtures [32]. Several attempts to classify and predict protein solubility have been made [33]–[36]. To assess our method, we used the eSol database (available at http://tp-esol.genes.nig.ac.jp/) which includes information about protein solubility of the entire ensemble of E.coli proteins. The database contains 1,625 proteins, out of which 782 are insoluble and 843 are soluble proteins. We calculated 21 feature datasets for each of these proteins as shown in Table 2. These numeric features have shown to be influential in protein solubility prediction in previous works, where:

- the feature datasets 1–18 contain mono-, di- and tri-mers using 7 different alphabets,
- the feature dataset 19 contains 4 sequence-computed features, i.e., molecular weight, sequence length, isolectric point and GRAVY index,
- the feature dataset 20 contains features used in [33], and
- the feature dataset 21 combines all features from the previous datasets.

Download:

Table 2. Feature datasets used in protein solubility classification.

https://doi.org/10.1371/journal.pone.0033812.t002

2.2.3. Gene Expression Datasets.

Comprehensible classifiers can provide an important insight in gene expression analysis studies. In this study we used 9 Gene Expression Machine Learning Repository (GEMLeR) datasets [37]. Altogether 1545 samples are divided in the following groups by tumor type: breast (344 samples), colon (286), kidney (260), ovary (198), lung (126), uterus (124), omentum (77), prostate (69) and endometrium (61). GEMLeR datasets used in this study were created by selecting one out of 9 groups of samples in so called one-versus-all binary classification setting. Unsupervised highest variance filter was chosen to avoid the so called “selection bias” when reducing the number of attributes by eliminating the measurements with extremely low variance. Samples consisting of original 54,681 expression measurements from Human Genome U133 Plus 2.0 Array GeneChip were reduced to 10,935 (20%) gene expression measurements that represent attributes of 9 datasets.

2.2.4. Performance Evaluation.

Different measures were observed for J48 decision tree using default settings and VTJ48 decision tree on all datasets. Basic size related measures like width and height of decision tree in pixels, number of leaves and number of nodes were calculated for each decision tree on each dataset. Additionally, Classification accuracy (ACC) and area under ROC curve (AUC) were calculated using 20 runs of 10-fold cross-validation on all datasets to observe differences in classification performance.

Results

To evaluate the proposed method we compared the classification performance and size of the classical C4.5 trees (J48) with the visually tuned C4.5 trees (VTJ48). Initially, the tests were performed on 40 datasets from the well-known machine learning repository. In addition, the tests were done on two types of datasets where decision trees can be applied in the field of bioinformatics - i.e., 21 protein solubility datasets and 9 gene expression analysis datasets.

As expected, in most cases, the original J48 decision tree vastly exceeded the predefined display resolution of 1280×800 pixels (Table 3 and Table 4). In some extreme cases the width of the decision tree exceeded the predefined dimension by more than 10-fold (letter, audiology, soybean). However, decision trees of this size and high number of classes are inappropriate for extraction of rules and presentation to end-user. Altogether, in the UCI datasets evaluation, there are 30 datasets where VTJ48 optimized a decision tree by reducing the number of leaves to fit into predefined dimensions. In 8 cases VTJ48 produced decision trees with more leaves than the original J48 method. Increase of the tree size occurred in cases when there were only one or two leaves produced using default settings of J48, pruning was automatically turned off in VTJ48 resulting in more complex decision trees. In case of protein solubility datasets, there were 20 datasets where the complexity of the tree was reduced and only one case where it increased. Similar changes in tree complexity were observed in gene expression problems, where complexity increased only in 2 out of 10 datasets. Observing the complexity of built decision trees one should also note that VTJ48 starts the tuning process of confidence factor at 0.5, whereas J48 starts at 0.25 resulting in more complex VTJ48 decision trees that still fit into predefined visual boundaries.

Download:

Table 3. Comparison of decision tree dimensions on 40 UCI datasets including the number of leaves.

https://doi.org/10.1371/journal.pone.0033812.t003

Download:

Table 4. Comparison of decision tree dimensions on the protein feature datasets including the number of leaves.

https://doi.org/10.1371/journal.pone.0033812.t004

3.1 Classification Performance on UCI Data

Accuracy and AUC (Table 5) were used for evaluation of classification performance, although due to the high number of multiclass datasets, it is debatable whether accuracy is the right measure for classification performance. As suggested in [38], the Wilcoxon signed ranks test was used to assess statistical significance of difference in performance and complexity of the decision tree. Comparing accuracy using win/draw/lose record one can observe J48 wins on 21 datasets, while VTJ48 managed to outperform J48 on 16 datasets. Statistical significance testing shows that J48 significantly outperforms VTJ48 in accuracy (p = 0.022), while there is no significant difference in results of AUC (p = 0.766). As already mentioned, one should be cautious when interpreting the results above, since accuracy is not a well suited performance measure in cases of unbalanced multi-class datasets. Therefore we did another test where only 16 binary class datasets were used and found out that there are no statistically significant differences present (p = 0.320). Table 3 demonstrates a big difference in decision tree size (number of leaves) comparing J48 to VTJ48 decision trees.

Download:

Table 5. Comparison of classification performance (20 runs of 10-fold cross-validation) on 40 UCI datasets.

https://doi.org/10.1371/journal.pone.0033812.t005

3.2 Classification Performance on Bioinformatics Data

Table 4 shows the average decision tree dimensions for the protein datasets including the average number of leaves. It can be noticed that the size was reduced on the majority of the feature datasets. The only exceptions are the DimersClustEm14 and TrimersHydro datasets, on which the tree size increased.

Table 6 shows accuracy and AUC for the evaluation of classification performance on protein datasets. Since all these datasets present a binary classification problem, accuracy and ACC are more appropriate measurements when compared to the UCI datasets. Again, the Wilcoxon signed ranks test was used to assess statistical significance of difference in performance and complexity of the decision tree. When observing the accuracy win/draw/lose record, one can notice that J48 wins on 5 datasets, while VTJ48 managed to outperform J48 on 15 datasets. The results are similar for AUC where J48 wins on 5 datasets, while VTJ48 wins on 16 datasets.

Download:

Table 6. Comparison of classification performance (20 runs of 10-fold cross-validation) on the protein datasets.

https://doi.org/10.1371/journal.pone.0033812.t006

Table 7 shows the average decision tree dimensions for the 9 GEMLeR datasets including the average number of leaves. In comparison to protein and UCI datasets it is evident that gene expression problems do not create very large tree, therefore the reduction in size, when VTJ48 is used is not that big.

Download:

Table 7. Comparison of decision tree dimensions on the GEMLeR datasets including the number of leaves.

https://doi.org/10.1371/journal.pone.0033812.t007

Table 8 shows accuracy and AUC for the evaluation of classification performance for the GEMLeR datasets and also demonstrates that the performance actually increases if we use simpler (i.e., smaller) decision tree models.

Download:

Table 8. Comparison of classification performance (20 runs of 10-fold cross-validation) on the GEMLeR datasets.

https://doi.org/10.1371/journal.pone.0033812.t008

Statistical significance testing was done on all 20 cross-validation run results for 30 bioinformatics datasets together using Wilcoxon signed-rank test. In case of accuracy (p = 0.002) and AUC (p = 0.001) the VTJ48 trees significantly outperformed J48 trees. Although we did not expect such significant differences between results in favor of VTJ48, it is obvious that VTJ48 is well suited for datasets with binary class attributes and a high number of possibly redundant attributes.

3.3 Examples of Large Decision Trees

In this section we demonstrate two examples, each of them with two decision trees built on a single dataset. The first tree in each example is the result of J48 using default settings, and the second tree is the result of VTJ48.The dataset in the first example is the letter dataset from the UCI repository. This dataset contains 26 class values which represent 26 capital letters in the English alphabet. The character images were based on 20 different fonts and each letter within these 20 fonts was randomly distorted to produce 20.000 instances with 16 attributes. Fig. 2 shows the original tree and the visually tuned tree. One can notice the extremely complex original decision tree, which is the result of the high number of classes. Since the visually tuned tree does not cover all the possible classes, it cannot achieve competitive classification accuracy.

Download:

Figure 2. Comparison of original J48 decision tree and visually tuned version from VTJ48 on All Features dataset.

https://doi.org/10.1371/journal.pone.0033812.g002

The second example presents two decision trees built on the protein solubility dataset with all features (Figure 3). In this case the accuracy and AUC were both improved significantly, when the size of the decision tree model was reduced. This is possible due to binary class attribute that still allows effective trees that are much smaller than the original pruned J48 trees.

Download:

Figure 3. Comparison of durations for different datasets.

https://doi.org/10.1371/journal.pone.0033812.g003

In addition, to demonstrate the most significant rules from both decision trees in Figure 3, we extracted top 5 rules according to their support in training set (Table 9). The numbers at the end of mono-, di- and tri-mer attribute names (e.g. the number 34 in MonomersClustEm17_34) distinguish attributes inside different alphabets. It can be observed that J48 produces more complex rules with a higher number of conditions which use attributes from more different alphabets. On the other hand, top 5 rules from VTJ48 tree cover much more samples (70.6%) than top 5 rules derived from J48 (36.6%). It is evident that low error rates on the training set do not guarantee good classification performance on the test set. We can once again conclude that in most cases, at least in protein solubility domain, more complex trees result in overfitting to training samples.

Download:

Table 9. Top 5 rules with the highest support in All Features extracted from J48 and VTJ48 decision trees.

https://doi.org/10.1371/journal.pone.0033812.t009

3.4 User Study

To test the effectiveness of the VTJ48 method in terms of usability, a Weka package was developed implementing the visually constrained tree building algorithm. An experiment to compare the duration of building decision trees using the J48 and VTJ48 Weka packages in Weka Explorer was set up. Three different datasets from the UCI repository (balance-scale, credit-g, and splice) were chosen based on their complexity where the need for tuning the tree models is more likely to be necessary. Fourteen master students, all enrolled in a Bioinformatics program, were recruited to take part in the experiment. After a brief introduction to the VTJ48 Weka package, the participants were given the datasets and were asked to build a comprehensible decision tree from each dataset using both, J48 and VTJ48 methods. Additionally, the participants were instructed to optimize each decision tree to fit to a single computer screen to allow optimal comprehensibility. In the case of J48 classifiers, this meant tuning the binary splits, minimal number of objects, and pruning parameters. In the case of VTJ48, this simply meant setting the desired resolution parameters. The duration from the start of the tree building process to the point when the decision tree was displayed on a single screen, was stored for further analysis. Figure 4 clearly shows that tree building times were shorter for VTJ48 method on all datasets.

Download:

Figure 4. Pseudocode of decision tree reduction in Visually Tuned J48.

https://doi.org/10.1371/journal.pone.0033812.g004

In order to test the statistical significance of the obtained results, the Wilcoxon signed-rank test was chosen to compare the distributions of tree building times and accuracies for different tree building methods. Each of the tests was assessed at a significance level of 95%. The medians of tree building times for the J48 and VTJ48 methods were significantly different for two datasets (balance-scale: p = 0.002, credit-g: p = 0.020). Tree building times were not significantly different for the splice dataset (p = 0.396), however, the mean tree building time was still 33.14 seconds shorter for the VTJ48 method.

Discussion

This study focused on evaluation of decision tree performance when useful and comprehensible decision trees are needed. The evaluation was done on 30 datasets from the following two areas in bioinformatics: protein solubility classification and gene expression analysis.

More precisely, strict boundaries for width and height of built decision trees were set to produce more comprehensible trees. It is important to note that VTDT approach only helps the end-user in tuning the decision tree building parameters and does not propose a novel decision tree building algorithm. Although this paper presents the automated visual tuning on C4.5 decision trees, it would be possible to adapt the VTDT principles to any other decision tree building algorithm that requires tuning of parameters to achieve optimal results. By tuning the parameters, without interfering with the internal decision tree building process and constraining the tuning only by the dimensions of the decision tree, the bias of influencing the classification performance is avoided.

The results of our study confirmed there is no statistically significant difference in predictive performance between the decision trees built using default values and the ones that were built using the proposed process of visual tuning. Moreover, when AUC is observed, visually tuned models, that are usually also much simpler than large default models, performed better on majority of datasets. This is especially true for most of the protein and gene expression datasets, where the performance improvements were significant. However, it has to be noted that a larger sample of datasets would be needed to draw more reliable conclusions. Based on these results, one could conclude that simpler models usually produce at least comparable results if not better. This has also been shown in many other studies related to the Occam's razor theory [39], [40]. However, there are also studies that demonstrate the contrary - i.e., growing the trees will improve the classification performance [41]. To sum it up, it all depends on how a simple model is defined. In the case of VTDTs, we should probably state that if the model is simple enough (i.e., fits into our predefined visual boundaries), it will produce good or even better results than most of the more complex models. Unfortunately, the proposed decision tree tuning suffers from the high time complexity in comparison to classical decision tree that is built only once. However, as shown by the the user study, it still saves a lot of time in comparison to manual tuning and fitting of the decision tree to desired dimensions.

In this paper we evaluated the visual tuning strategy only on C4.5 decision trees. From the research and also from the practical usability point of view, it would be important to extend this study and consequently develop a Weka package that would allow simultaneous tuning of different decision tree models (e.g. CART [4]). Some of the areas where visual tuning could also be applied are comprehensible ensembles of classifiers or variations of decision tree models (e.g, Alternating decision trees [42]). These models combine boosted decision stumps in a structure where visual constraints could be beneficial for the end-user in different areas of bioinformatics.

Acknowledgments

Authors would like to thank Geoffrey I. Webb for his helpful discussion on decision trees and binary classification problems.

Author Contributions

Conceived and designed the experiments: GS SK IP PK. Performed the experiments: GS. Analyzed the data: GS SK IP. Wrote the paper: GS SK IP PK.

References

1. Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, et al. (2008) Top 10 Algorithms in Data Mining, Knowledge and Information Systems 14, 1: 1–37.
- View Article
- Google Scholar
2. Shafer JC, Agrawal R, Mehta M (1996) SPRINT: A Scalable Parallel Classifier for Data Mining. pp. 544–555. Proc. 22nd Intl. Conf. on Very Large Databases (VLDB '96), Bombay, India.
3. Quinlan JR (1993)
4. Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Belmont, CA., ZDA: Wadsworth International Group.
5. Darnell SJ, Page D, Mitchell JC (2007) An automated decision-tree approach to predicting protein interaction hot spots, Proteins.
6. Xiaojing Y, Xiaohui Y, Fan Y, Jing P, Buckles BP (2003) Gene Expression Classification: Decision Trees vs. SVMs. FLAIRS Conference 92–97.
- View Article
- Google Scholar
7. Serrano-Aguilar P, Abreu R, Antón-Canalís L, Guerra-Artal C, Ramallo-Fariña Y, et al. (2011) Development and validation of a computer-aided diagnostic tool to screen for age-related macular degeneration by optical coherence tomography., Br J Ophthalmol.
- View Article
- Google Scholar
8. Marée R, Geurts P, Piater J, Wehenkel L (2004) A generic approach for image classsification based on decision tree ensembles and local sub-windows. 2. : 860–865.
9. KnowledgeSeeker website. Available: http://www.angoss.com/business-intelligence-software/products/data-analysis-software. Accessed 2011 Oct 10.
10. Smith SK (1998) Automatic construction of decision trees from data: A multidisciplinary survey. Data Mining and Knowledge Discovery 2: 345–389.
- View Article
- Google Scholar
11. Vapnik VN (1999) The Nature of Statistical Learning Theory, Information Science and Statistics, Springer, London, UK.
12. Breiman L (2001) Random forests. Machine Learning 45: 5–32.
- View Article
- Google Scholar
13. Rodríguez JJ, Kuncheva LI, Alonso CJ (2006) Rotation forest: A new classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence 28: 10 1619–1630.
- View Article
- Google Scholar
14. Babic A, Krusinska E, Stromberg JE (1992) Extraction of Diagnostic Rules Using Recursive Partitioning Systems: A Comparison of Two Approaches. Artificial Intelligence in Medicine 4: 373–387, Elsevier.
- View Article
- Google Scholar
15. Bohanec M, Bratko I (1994) Trading accuracy for simplicity in decision trees, Machine Learning 15: 223–250.
- View Article
- Google Scholar
16. Oates T, Jensen D (1997) The effects of training set size on decision tree complexity. pp. 254–262.
17. John GH (1995) Robust decision trees: removing outliers from databases. pp. 174–179.
18. Brodley CE, Friedl MA (1999) Identifying mislabeled training data. J Artif Intell Res 11: 131–167.
- View Article
- Google Scholar
19. Cano JR, Herrera F, Lozano M (2007) Evolutionary stratified training set selection for extracting classification rules with trade off precision-interpretability. Data & Knowledge Engineering 60(1): 90–108.
- View Article
- Google Scholar
20. Reingold EM, Tilford JS (1981) Tidier drawings of trees. IEEE Transactions on Software Engineering SE-7(2): 223–228.
- View Article
- Google Scholar
21. Buchheim CM, Jünger , Leipert S (2002) Improving Walker's algorithm to run in linear time. pp. 344–353. In Proceedings of Symposium on Graph Drawing (GD).
22. Johnson B, Shneiderman B (1991) Tree-maps: A space-filling approach to the visualization of hierarchical information structures. pp. 284–291.
23. Shneiderman B (1992) Tree visualization with tree-maps: 2-d space-filling approach. ACM Transactions on Graphics (TOG) 11(1): 92–99.
- View Article
- Google Scholar
24. Andrews K, Heidegger H (1998) Information slices: Visualising and exploring large hierarchies using cascading, semi-circular discs. pp. 9–12. In Proceedings of IEEE Symposium on Information Visualization (InfoVis), Late Breaking Hot Topics.
25. Stasko J, Zhang E (2000) Focus+context display and navigation techniques for enhancing radial, space-filling hierarchy visualizations. pp. 57–65. In Proceedings of IEEE Symposium on Information Visualization (InfoVis).
26. McGuffin MJ, Robert JM (2010) Quantifying the space-efficiency of 2d graphical representations of trees. pp. 115–140. In Information Visualization, volume 9.
27. Huysmans J, Dejaeger K, Mues C, Vanthienen J, Baesens B (2010) An empirical evaluation of the comprehensibility of decision table, tree and rule based predictive models. Decision Support Systems.
- View Article
- Google Scholar
28. Osei-Bryson KM (2004) Evaluation of decision trees: a multicriteria approach. Computers and Operations Research 31: 1933–1945.
- View Article
- Google Scholar
29. Hall M, Eibe F, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA Data Mining Software: An Update; SIGKDD Explorations, Volume 11, Issue 1.
30. Asuncion A, Newman DJ (2007) UCI Machine Learning Repository [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, School of Information and Computer Science.
31. Baneyx F (1999) Recombinant protein expression in Escherichia coli, Current Opinion in Biotechnology.
32. Chow MK, Amin AA, Fulton KF, Whisstock JC, Buckle AM, et al. (2006) REFOLD: an analytical database of protein refolding methods, Protein Expression and Purification.
33. Niwa T, Ying BW, Saito K, Jin WZ, Takada S, et al. (2009) Bimodal protein solubility distribution revealed by an aggregation analysis of the entire ensemble of Escherichia coli proteins, Proc Natl Acad Sci USA 106: 4201–4206.
- View Article
- Google Scholar
34. Smailowski P, Martin-Galiano AJ, Mikolajka A, Girchick T, Holak TA, et al. (2007) Protein solubility: sequence based prediction and experimental verification, Bioinformatics.
35. Wilkinson DL, Harrison RG (1991) Predicting the solubility of recombinant proteins in Escherichia coli, Biotechnology.
36. Magnan CN, Randall A, Baldi P (2009) SOLpro: accurate sequence-based prediction of protein solubility, Bioinformatics.
37. Stiglic G, Kokol P (2010) Stability of Ranked Gene Lists in Large Microarray Analysis Studies. Journal of biomedicine biotechnology 616358.
- View Article
- Google Scholar
38. Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7: 1–30.
- View Article
- Google Scholar
39. Mingers J (1989) pp. 227–243. An empirical comparison of prunning methods for decision tree induction, Mach. Learn., vol. 4.
40. Murphy P, Pazzani M (1994) Exploring the decision forest: An empirical investigation of Occam's Razor in decision tree induction. J of Artificial Intelligence Research 1: 257–275.
- View Article
- Google Scholar
41. Webb G (1996) Further experimental evidence against the utility of Occam's razor. Journal of Artificial Intelligence Research 4: 397–417.
- View Article
- Google Scholar
42. Freund Y, Mason L (1999) The alternating decision tree learning algorithm. pp. 124–133.

[ref1] 1. Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, et al. (2008) Top 10 Algorithms in Data Mining, Knowledge and Information Systems 14, 1: 1–37.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Shafer JC, Agrawal R, Mehta M (1996) SPRINT: A Scalable Parallel Classifier for Data Mining. pp. 544–555. Proc. 22nd Intl. Conf. on Very Large Databases (VLDB '96), Bombay, India.

[ref3] 3. Quinlan JR (1993)

[ref4] 4. Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Belmont, CA., ZDA: Wadsworth International Group.

[ref5] 5. Darnell SJ, Page D, Mitchell JC (2007) An automated decision-tree approach to predicting protein interaction hot spots, Proteins.

[ref6] 6. Xiaojing Y, Xiaohui Y, Fan Y, Jing P, Buckles BP (2003) Gene Expression Classification: Decision Trees vs. SVMs. FLAIRS Conference 92–97.
View Article
Google Scholar

[9] View Article

[10] Google Scholar

[ref7] 7. Serrano-Aguilar P, Abreu R, Antón-Canalís L, Guerra-Artal C, Ramallo-Fariña Y, et al. (2011) Development and validation of a computer-aided diagnostic tool to screen for age-related macular degeneration by optical coherence tomography., Br J Ophthalmol.
View Article
Google Scholar

[12] View Article

[13] Google Scholar

[ref8] 8. Marée R, Geurts P, Piater J, Wehenkel L (2004) A generic approach for image classsification based on decision tree ensembles and local sub-windows. 2. : 860–865.

[ref9] 9. KnowledgeSeeker website. Available: http://www.angoss.com/business-intelligence-software/products/data-analysis-software. Accessed 2011 Oct 10.

[ref10] 10. Smith SK (1998) Automatic construction of decision trees from data: A multidisciplinary survey. Data Mining and Knowledge Discovery 2: 345–389.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref11] 11. Vapnik VN (1999) The Nature of Statistical Learning Theory, Information Science and Statistics, Springer, London, UK.

[ref12] 12. Breiman L (2001) Random forests. Machine Learning 45: 5–32.
View Article
Google Scholar

[21] View Article

[22] Google Scholar

[ref13] 13. Rodríguez JJ, Kuncheva LI, Alonso CJ (2006) Rotation forest: A new classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence 28: 10 1619–1630.
View Article
Google Scholar

[24] View Article

[25] Google Scholar

[ref14] 14. Babic A, Krusinska E, Stromberg JE (1992) Extraction of Diagnostic Rules Using Recursive Partitioning Systems: A Comparison of Two Approaches. Artificial Intelligence in Medicine 4: 373–387, Elsevier.
View Article
Google Scholar

[27] View Article

[28] Google Scholar

[ref15] 15. Bohanec M, Bratko I (1994) Trading accuracy for simplicity in decision trees, Machine Learning 15: 223–250.
View Article
Google Scholar

[30] View Article

[31] Google Scholar

[ref16] 16. Oates T, Jensen D (1997) The effects of training set size on decision tree complexity. pp. 254–262.

[ref17] 17. John GH (1995) Robust decision trees: removing outliers from databases. pp. 174–179.

[ref18] 18. Brodley CE, Friedl MA (1999) Identifying mislabeled training data. J Artif Intell Res 11: 131–167.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref19] 19. Cano JR, Herrera F, Lozano M (2007) Evolutionary stratified training set selection for extracting classification rules with trade off precision-interpretability. Data & Knowledge Engineering 60(1): 90–108.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref20] 20. Reingold EM, Tilford JS (1981) Tidier drawings of trees. IEEE Transactions on Software Engineering SE-7(2): 223–228.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref21] 21. Buchheim CM, Jünger , Leipert S (2002) Improving Walker's algorithm to run in linear time. pp. 344–353. In Proceedings of Symposium on Graph Drawing (GD).

[ref22] 22. Johnson B, Shneiderman B (1991) Tree-maps: A space-filling approach to the visualization of hierarchical information structures. pp. 284–291.

[ref23] 23. Shneiderman B (1992) Tree visualization with tree-maps: 2-d space-filling approach. ACM Transactions on Graphics (TOG) 11(1): 92–99.
View Article
Google Scholar

[46] View Article

[47] Google Scholar

[ref24] 24. Andrews K, Heidegger H (1998) Information slices: Visualising and exploring large hierarchies using cascading, semi-circular discs. pp. 9–12. In Proceedings of IEEE Symposium on Information Visualization (InfoVis), Late Breaking Hot Topics.

[ref25] 25. Stasko J, Zhang E (2000) Focus+context display and navigation techniques for enhancing radial, space-filling hierarchy visualizations. pp. 57–65. In Proceedings of IEEE Symposium on Information Visualization (InfoVis).

[ref26] 26. McGuffin MJ, Robert JM (2010) Quantifying the space-efficiency of 2d graphical representations of trees. pp. 115–140. In Information Visualization, volume 9.

[ref27] 27. Huysmans J, Dejaeger K, Mues C, Vanthienen J, Baesens B (2010) An empirical evaluation of the comprehensibility of decision table, tree and rule based predictive models. Decision Support Systems.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref28] 28. Osei-Bryson KM (2004) Evaluation of decision trees: a multicriteria approach. Computers and Operations Research 31: 1933–1945.
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref29] 29. Hall M, Eibe F, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA Data Mining Software: An Update; SIGKDD Explorations, Volume 11, Issue 1.

[ref30] 30. Asuncion A, Newman DJ (2007) UCI Machine Learning Repository [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, School of Information and Computer Science.

[ref31] 31. Baneyx F (1999) Recombinant protein expression in Escherichia coli, Current Opinion in Biotechnology.

[ref32] 32. Chow MK, Amin AA, Fulton KF, Whisstock JC, Buckle AM, et al. (2006) REFOLD: an analytical database of protein refolding methods, Protein Expression and Purification.

[ref33] 33. Niwa T, Ying BW, Saito K, Jin WZ, Takada S, et al. (2009) Bimodal protein solubility distribution revealed by an aggregation analysis of the entire ensemble of Escherichia coli proteins, Proc Natl Acad Sci USA 106: 4201–4206.
View Article
Google Scholar

[62] View Article

[63] Google Scholar

[ref34] 34. Smailowski P, Martin-Galiano AJ, Mikolajka A, Girchick T, Holak TA, et al. (2007) Protein solubility: sequence based prediction and experimental verification, Bioinformatics.

[ref35] 35. Wilkinson DL, Harrison RG (1991) Predicting the solubility of recombinant proteins in Escherichia coli, Biotechnology.

[ref36] 36. Magnan CN, Randall A, Baldi P (2009) SOLpro: accurate sequence-based prediction of protein solubility, Bioinformatics.

[ref37] 37. Stiglic G, Kokol P (2010) Stability of Ranked Gene Lists in Large Microarray Analysis Studies. Journal of biomedicine biotechnology 616358.
View Article
Google Scholar

[68] View Article

[69] Google Scholar

[ref38] 38. Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7: 1–30.
View Article
Google Scholar

[71] View Article

[72] Google Scholar

[ref39] 39. Mingers J (1989) pp. 227–243. An empirical comparison of prunning methods for decision tree induction, Mach. Learn., vol. 4.

[ref40] 40. Murphy P, Pazzani M (1994) Exploring the decision forest: An empirical investigation of Occam's Razor in decision tree induction. J of Artificial Intelligence Research 1: 257–275.
View Article
Google Scholar

[75] View Article

[76] Google Scholar

[ref41] 41. Webb G (1996) Further experimental evidence against the utility of Occam's razor. Journal of Artificial Intelligence Research 4: 397–417.
View Article
Google Scholar

[78] View Article

[79] Google Scholar

[ref42] 42. Freund Y, Mason L (1999) The alternating decision tree learning algorithm. pp. 124–133.

Figures

Abstract

Purpose

Methods

Results

Conclusions

Introduction

Methods

2.1 Tuning the Parameters

2.2 Experimental Settings

2.2.1. UCI Datasets.

2.2.2. Protein Solubility Datasets.

2.2.3. Gene Expression Datasets.

2.2.4. Performance Evaluation.

Results

3.1 Classification Performance on UCI Data

3.2 Classification Performance on Bioinformatics Data

3.3 Examples of Large Decision Trees

3.4 User Study

Discussion

Acknowledgments

Author Contributions

References