Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A Novel Artificial Bee Colony Based Clustering Algorithm for Categorical Data

  • Jinchao Ji ,

    Contributed equally to this work with: Jinchao Ji, Wei Pang, Yanlin Zheng, Zhe Wang, Zhiqiang Ma

    Affiliations School of Computer Science and Information Technology, Northeast Normal University, Changchun, China, Key Lab of Intelligent Information Processing of Jilin Universities, Northeast Normal University, Changchun, China, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China

  • Wei Pang ,

    Contributed equally to this work with: Jinchao Ji, Wei Pang, Yanlin Zheng, Zhe Wang, Zhiqiang Ma

    pang.wei@abdn.ac.uk (WP); mazq0431@gmail.com (ZM)

    Affiliation School of Natural and Computing Sciences, University of Aberdeen, Aberdeen, United Kingdom

  • Yanlin Zheng ,

    Contributed equally to this work with: Jinchao Ji, Wei Pang, Yanlin Zheng, Zhe Wang, Zhiqiang Ma

    Affiliations School of Computer Science and Information Technology, Northeast Normal University, Changchun, China, Key Lab of Intelligent Information Processing of Jilin Universities, Northeast Normal University, Changchun, China

  • Zhe Wang ,

    Contributed equally to this work with: Jinchao Ji, Wei Pang, Yanlin Zheng, Zhe Wang, Zhiqiang Ma

    Affiliations Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China, College of Computer Science and Technology, Jilin University, Changchun, China

  • Zhiqiang Ma

    Contributed equally to this work with: Jinchao Ji, Wei Pang, Yanlin Zheng, Zhe Wang, Zhiqiang Ma

    pang.wei@abdn.ac.uk (WP); mazq0431@gmail.com (ZM)

    Affiliations School of Computer Science and Information Technology, Northeast Normal University, Changchun, China, Key Lab of Intelligent Information Processing of Jilin Universities, Northeast Normal University, Changchun, China

Abstract

Data with categorical attributes are ubiquitous in the real world. However, existing partitional clustering algorithms for categorical data are prone to fall into local optima. To address this issue, in this paper we propose a novel clustering algorithm, ABC-K-Modes (Artificial Bee Colony clustering based on K-Modes), based on the traditional k-modes clustering algorithm and the artificial bee colony approach. In our approach, we first introduce a one-step k-modes procedure, and then integrate this procedure with the artificial bee colony approach to deal with categorical data. In the search process performed by scout bees, we adopt the multi-source search inspired by the idea of batch processing to accelerate the convergence of ABC-K-Modes. The performance of ABC-K-Modes is evaluated by a series of experiments in comparison with that of the other popular algorithms for categorical data.

Introduction

As an important technique in data mining, clustering analysis has been used in many fields [1,2], such as information retrieval [3], social media analysis [4], privacy preserving [5], image analysis [6], text analysis [7], and bioinformatics [8]. The aim of clustering is to group those data objects with similar characteristics into the same clusters, and the ones with dissimilar characteristics into different clusters. Most existing clustering algorithms in the literature belong to one of the following two types: hierarchical and partitional. Hierarchical clustering algorithms allocate a group of data objects into a dendrogram of the nested partitions according to a divisive or agglomerative strategy [9]. While partitional clustering algorithms partition a set of data objects into a pre-defined number of clusters by optimizing an objective cost function.

Center-based clustering algorithms are the most popular partitional clustering algorithms. The k-means algorithm is a widely used center-based partitional clustering algorithm due to its simplicity and high efficiency [10]. Considering the uncertainty of data objects, the fuzzy k-means algorithm [11] is also developed. The k-means algorithm and the fuzzy k-means algorithm can only deal with numeric data. However, categorical data are frequently encountered in real world applications, and especially in the emerging social media analysis. For instance, clustering Twitter users based on their profiles described by categorical attributes. For clustering categorical data, Huang extended these two classical algorithms and introduced the well-known k-modes algorithm and fuzzy k-modes algorithm [1214]. However, one issue associated with (fuzzy) k-means and (fuzzy) k-modes algorithms is that they may fall into local optima. To address this issue, many heuristic clustering algorithms, which adopt the optimization procedures in the clustering process, have been proposed. By introducing genetic algorithms (GAs), the GA-based clustering approaches [15], including the genetic k-means algorithm [16], the fast genetic k-means algorithm [17], and the genetic k-modes algorithm [18] have been developed. Among these GA-based clustering algorithms, the genetic k-modes algorithm [18] is suitable for categorical data. In addition, the following heuristic clustering algorithms are used to cluster numeric data: Selim and Al-Sultan introduced a simulated annealing algorithm for the clustering problem [19]. Maulik and Mukhopadhyay introduced a novel fuzzy clustering approach by integrating the simulated annealing heuristic with artificial neural networks [20]. Sung and Jin presented a tabu search-based clustering approach by combining the packing and releasing procedures [21].

Over the last decade, a few approaches have been developed to model the intelligent foraging behavior of social animals, such as birds and ants, for optimization problems, and these approaches have been successfully applied to clustering. Shelokar, Jayaraman, and Kulkarni proposed an ant colony clustering algorithm which simulates the way real ants look for an optimal path from their nest to a food source [22]. Kao, Zahara, and Kao integrated the particle swarm optimization (PSO) approach, which mimics the way birds find the optimal food sources in search space, with the k-means procedure and Nelder–Mead simplex search method for improving the performance of clustering [23]. Unlike Kao's approach, Tunchan proposed a pure PSO approach for clustering [24]. Chuang, Hsiao, and Yang presented an accelerated chaotic map particle swarm optimization (ACPSO) for clustering by integrating the chaotic map particle swarm optimization (CPSO) with an accelerated convergence rate strategy [25]. Wan et al. introduced a clustering algorithm on the basis of the optimization property of bacterial foraging behavior [26].

In recent years, investigating the foraging behavior of honeybees, including the learning, memorising, and information sharing mechanism, has emerged as an interesting research direction in swarm intelligence [27]. Inspired by the foraging behavior of bee swarms in the real world, Lucic and Teodorović introduced the bee colony optimization heuristic [28], which has been used for solving various engineering and management problems. Karaboga and Basturk presented an artificial bee colony (ABC) algorithm [29] to deal with numerical optimization problems. By using the ABC optimization strategy, Karaboga and Ozturk proposed an artificial bee colony clustering approach [30]. Almost at the same time, Zhang, Ouyang and Ning also introduced an artificial bee colony clustering approach, in which Deb’s rules were used to direct the search direction of each candidate food source [27]. However, most of these heuristic approaches are designed for numeric data, and therefore they are not suitable to deal with categorical data. Considering the ubiquity of categorical data in real-world applications, it is necessary to develop an ABC-based clustering algorithm for categorical data.

In this paper, we propose a novel artificial bee colony clustering approach for categorical data. In our approach, we first introduce the one-step k-modes procedure, and then integrate this procedure with the artificial bee colony heuristic to cluster categorical data. The time and space complexity of the proposed approach is analysed, and a comparison with the other popular approaches demonstrates the effectiveness of our approach.

The remainder of this paper is organised as follows: we first review some related work. This is followed by the presentation of our proposed method. Then, we report the experimental results, which demonstrate the advantages of the proposed method. Finally, we draw conclusions and explore future work.

Related Work

In this section, we first review the k-modes algorithm, and then describe the idea of artificial bee colony optimization.

The k-modes algorithm

The k-modes algorithm was first introduced by Huang in [31] for clustering categorical data. Let X = {x1, x2,…,xn} denote a dataset consisting of n data objects and xi (1 ≤ in) be a data object characterised by m categorical attributes A1, A2,…, Am. Each categorical attribute Aj has a domain of values denoted by , where t is the number of categorical values for the attribute Aj. A data object xi is generally represented in the form of a vector [xi1, xi2,…, xim].

The aim of the k-modes algorithm is to divide a dataset X into k clusters by minimizing the following cost function:

(1)

Here Ql is the set of the most frequent value for each attribute in a cluster l, and it is called the mode of the cluster l; uil (0 ≤ uil ≤ 1) is an element of the partition matrix Un×k; k is the number of clusters, and dis(xi, Ql) is the distance measure as given below:

(2)

In Eq (2), α(xij, qlj) is defined as: (3) where qlj is the most frequent value of the jth categorical attribute in the cluster l. The process of the k-modes algorithm is depicted as follows:

  1. Step 1. Randomly pick up k data objects from the dataset X as the initial modes of clusters.
  2. Step 2. For each data object in X, assign it to the cluster the mode of which is the nearest one to this data object compared to the modes of other clusters in terms of Eq (2). After all data objects have been assigned to clusters, update the modes of all clusters.
  3. Step 3. Re-evaluate the dissimilarity between the data objects and the current modes after all data objects have been assigned to clusters. If it is found that a data object’s nearest mode belongs to another cluster rather than the current one, reassign this data object to that cluster and update the modes of both clusters.
  4. Step 4. After a full circle test of X, if no data objects have changed clusters, terminate the algorithm; otherwise go to Step 3.

The artificial bee colony algorithm

The artificial bee colony (ABC) algorithm proposed by Karaboga and Basturk [29] is well-known for its simplicity and robustness for optimising numeric problems. In the ABC algorithm, the artificial bee swarm consists of three types of bees: employed bees, onlookers, and scouts. The employed bee takes a particular food source to exploit and shares the information about the food source with onlookers in the nest; a scout looks for a new food source in the search space, and an onlooker waits in the nest and finds a food source through the information shared by employed bees. The artificial bee colony has two parts: the first half are the employed bees and the second half are the onlookers. In the model of forage selection, three essential components (food sources, employed foragers, and unemployed foragers) and two modes of the behavior (recruitment to a food source and abandonment of a food source) are given. The value of food source is associated with many factors such as its proximity to the nest, nectar amount and the ease of gathering this nectar. The unemployed forgers contain two types of bees: scouts and onlookers. There is only one employed bee on a food source. Thus, the number of employed bees is equal to the number of food sources. Onlookers move onto a food source according to a probability-based selection strategy. When the nectar of a food source is exhausted, the corresponding employed bee becomes a scout. In ABC algorithm, the exploitation and exploration processes are performed together. Specifically, the employed bees and onlookers implement the exploitation process, and the scouts execute the exploration process. The bee colony explores and exploits the food sources in a way to maximize the nectar being stored in the nest. For an optimisation problem, a food source means a possible solution, the nectar amount of a food source measures the quality of the corresponding solution, and the goal is to obtain the optimal value of the objective function. The procedure of ABC algorithm is given as follows:

  1. Step 1. Initialize the population of food sources.
  2. Step 2. Send the employed bees onto the food sources and evaluate the corresponding nectar amounts.
  3. Step 3. Evaluate the probabilities of all food sources to be chosen by the onlooker bees, and the probability value of each food source is determined by its nectar amount (i.e., the quality of the corresponding solution): the bigger the nectar amount of the food source, the higher the probability value is;
  4. Step 4. Send the onlookers onto the food sources: each onlooker will chose its food source based on the probabilities calculated from Step 3, exploit its food source, evaluate the nectar amount of the obtained food source, and apply greedy selection process;
  5. Step 5. Terminate the exploitation process of an employed bee if its food source becomes exhausted, and this employed bee becomes a scout bee;
  6. Step 6. Send the scouts into the search space for finding new food sources randomly;
  7. Step 7. Memorise the best food source found so far;
  8. Step 8. If the requirements are met, output the best food source; otherwise go to Step 2.

Our Proposed ABC Clustering Algorithm

In this section, we first describe our proposed ABC clustering approach, and then discuss the complexity and convergence of this approach.

The proposed approach

In this subsection, we propose a novel clustering algorithm on the basis of artificial bee colony and the k-modes approach. As mentioned above, there are three types of artificial honeybees: employed bees, onlookers, and scouts. A food source corresponds to a possible solution of the problem to be optimised, and the nectar amount of a food source characterises the quality of the corresponding solution. In the clustering, the clustering results depend on the cluster centers. When the cluster centers are fixed, the clustering results are determined. Therefore, the clustering issue can be seen as the optimisation of the cluster centers, and a set of cluster centers correspond a possible solution. For categorical data clustering, let fi = {Q1, Q2,…, Qk} denote a food source, where Ql is the mode of cluster l. E(fi) = E(U, fi) is the objective cost function, and , where the symbols have the same meaning as in Eq (1). Then, the nectar amount of a food source fi is given by:

(4)

Similar to the ABC approach, the colony of artificial bees in our algorithm has two parts: the first half of the artificial bees are the employed bees, and the second half of the artificial bees are the onlookers. There exists only one employed bee for a food source, and the number of the employed bees is equal to the number of solutions in the population. Let Pfs = {f1, f2,…, fH} denote the population of food sources, where H is the number of the food sources, and fi is the ith food source. Then the probability of the ith food source being picked up by an onlooker is given by:

(5)

For deriving a candidate food source from the current one in memory, we introduce the one-step k-modes procedure, called OKM, in our algorithm. The OKM procedure is essentially one iteration step in the search process of the k-modes algorithm, and it is used to search the neighbor food source based on the current food source in the exploitation process performed by employed bees and onlookers. Let fi be the current food source, then the OKM consists of the following two steps:

  1. Allocate each data object to the cluster with the nearest mode, and then form a partition matrix U; specifically, if the ith data object belongs to the lth cluster uil = 1; otherwise uil = 0, where uil is one element of U;
  2. Calculate the new modes on the basis of the partition matrix U, and thus form a candidate food source .

For the colony of bees, an employed bee becomes a scout when its food source is exhausted. In our algorithm we adopt the parameter L, which is a predetermined number of trials to control the abandonment of a food resource. If a food source cannot be improved further through L trials, this food source is assumed to be abandoned, and the corresponding employed bee becomes a scout. Let the abandoned food source be fi, and then the search operation of a scout finding a new food source is given by: (6) where i ∈ {1, 2,…, H}, and Rand(Dom(X)) is the operation of randomly selecting k data objects from the data set X. In our algorithm, the multi-source search, which is inspired by the idea of batch processing [32], is adopted to accelerate the convergence of the proposed algorithm. The idea of the multi-source search is described as follows: a scout bee searches T candidate food sources at a time, and then picks up the best one as the new food source.

Having introduced the detailed calculation formula for relevant variables, the proposed ABC-K-Modes clustering algorithm for categorical data is given as follows:

Input: The size of bee colony N, the maximum cycle number MCN, the number of clusters k, and L.

Output: The best food source.

  1. Initialise the population of food sources Pfs = {f1, f2,…, fH} randomly; specifically, for each food source, select k data objects randomly from the dataset X as the modes of clusters; set the exploitation numbers of food sources En1 = 0, En2 = 0,…, EnH = 0.
  2. Evaluate the nectar amounts of the food sources NA(f1), NA(f2), … NA(fH), according to Eq (4);
  3. set CN(the cycles number) to 1;
  4. For each employed bee
    1. Generate a new food source fi' from the current food source fi by using the one-step k-modes procedure OKM, and set Eni = Eni+1;
    2. Evaluate the nectar amount NA(fi') for the food source fi' according to Eq (4);
    3. If NA(fi') > NA(fi), the current food source fi is replaced by the new food source; otherwise the current food source fi is retained.
  5. Evaluate the probability proi for each food source fi according to Eq (5);
  6. For each onlooker bee
    1. Pick up one food source fi as the current food source according to the calculated probabilities;
    2. Generate a new food source fi' from the current food source fi by using OKM, and set Eni = Eni+1;
    3. Evaluate the nectar amount of fi', that is, NA(fi');
    4. If NA(fi') > NA(fi), the current food source fi is replaced by the new food source fi'; otherwise the current food source fi is retained;
    5. Update the probability proi for each food source fi according to Eq (5).
  7. For each food source fi, if the exploitation number Eni is no less than L, this food source is abandoned, and the corresponding employed bee becomes a scout.
  8. If there exists an abandoned food source fi,
    1. Send the scout in the search space to find T candidate food sources according to Eq (6);
    2. Evaluate the nectar amounts of the food sources ;
    3. Choose the food source with the highest nectar amount as the new food source fi', and set Eni = 0;
    4. If NA(fi') > NA(fi) the current food source fi is replaced by the new food source fi'; otherwise the current food source fi is retained.
  9. CN = CN+1;
  10. If CN = MCN, terminate the algorithm and output the best food source; otherwise go to step 4).

Complexity analysis

In this subsection, we discuss the complexity of the proposed ABC-K-Modes approach. The time complexity of the proposed method mainly consists of five parts: the initialisation, the search operation of employed bees, the calculation of the probability of food sources, and the search operation of scouts and onlookers. The computational cost of these five parts are O(Hknm), O(H(nkm + nkC)), O(H), O(Tnkm), and O(H(H + (nkm + nkC) + nkm)), respectively. Here n is the number of data objects in the dataset X; m is the number of attributes; k is the number of clusters; H is the number of employed bees or food sources; is the total number of categories for all attributes. Therefore, the overall time complexity of the proposed approach is O(Hkmn + s(Tnkm + H(H + nkm + nkC))). Here, s is the number of iterations. For space complexity, it requires O(mn) to store the dataset X, O((H + T)km) to store the food sources, and O(nk) to store the partition matrix. Thus, the overall space complexity of our method is O(mn + (H + T)km + nk). The time complexity and the space complexity of the k-modes algorithm are O(nkm + s(nkC + nkm)) and O(m(n + k) + km), respectively. For genetic k-modes algorithm, the time complexity and space complexity are O(Nn + S(N2nkm + N(n2kC + nkm) + nkC + nkm)) and O(mn + Hn + Hkm), respectively. Here N is the size of population, and S is the maximal number of generations. Generally, when H, m, k << n, the complexity of our algorithm is higher than k-modes algorithm, and lower than genetic k-modes algorithm.

Convergence analysis

In this subsection, we discuss the convergence of the proposed approach. In our approach, the exploration and exploitation are both executed by ABC. For a categorical dataset, the number of different values for an attribute is finite, and the number of attributes is finite as well. It is noted that a candidate solution is a set of cluster centers, and a cluster center is a set of attributes values. Therefore, the number of candidate solutions is finite. Specifically, the number of candidate solutions is , where , and k is the number of clusters. Here, |Ai| is the number of different categorical values for the attribute Ai. In the process of exploration or exploitation, the current solution will be replaced by a new solution if the new one is better. Thus each possible solution appears at most once in the current solution list. If the value of MCN (maximum number of iterations) is large enough, the global optimal solution will be very likely to be found; otherwise, the algorithm will be converged to a local optimum. In other words, the larger the value of MCN, the greater the possibility that ABC-K-Modes will converge is. When MCN tends to be infinite, the possibility of convergence for our proposed approach approaches to 100%. Therefore the convergence of our algorithm to a global/local optimal solution is guaranteed as long as MCN is big enough. However, due to different characteristics of the search spaces to be explored, for each dataset a different value of MCN may be required for the algorithm to converge.

Experimental Results and Discussion

In this section, for evaluating the performance of our proposed clustering algorithm ABC-K-Modes, we run the proposed approach on six real-world categorical datasets: Zoo, Breast cancer, Soybean, Lung cancer, Mushroom, and Dermatology, all of which can be downloaded from UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets.html). In this research, we adopt Yang’s accuracy measure [33] and the Rand Index [34] to assess the obtained clustering results. In Yang's method, the definitions of accuracy (AC), precision (PR), and recall (RE) are given as follows: (7) (8) (9) where ai is the number of data objects that are correctly allocated to class Ci, bi is the number of data objects that are incorrectly allocated to class Ci, ci is the number of data objects that are incorrectly denied from class Ci, k is the total number of classes contained in a dataset, and n is the total number of data objects in a dataset. In the above measures, the AC has the same meaning as the clustering accuracy r defined in [12]. Given a dataset X = {x1, x2,…, xn} as well as two partitions of this dataset: and , the Rand Index (RI) [34] is given by (10) where

(11)

The RI is calculated by using the true clustering and the clustering obtained from a clustering algorithm. According to these measures, the higher values of AC, PR, RE, and RI indicate a better clustering result. In the performance analysis, we run our proposed ABC-K-Modes algorithm, the k-modes algorithm, the fuzzy k-modes algorithm, and the genetic k-modes approach on six different datasets, and for each dataset we run twenty trials. We then compare the clustering result of the proposed ABC-K-Modes algorithm with that of the other three well-known algorithms in terms of the best (Best), average (Avg.), and standard deviation (Std.) of AC, PR, RE, and RI, respectively. All algorithms are implemented in Java language and executed on an Intel(R) Core(TM) i7, 3.4GHz, 8GB RAM computer. In all experiments, the parameters of the proposed ABC-K-Modes algorithm are set as follows: N = 20, MCN = 1000, which are the typical values used in the original ABC algorithm [30]; L = 5 and T = 5 are set by the rule of thumb. The cluster number k in all four algorithms is set according to the number of classes provided by the class information of the dataset. We remark that other class information is not used in the clustering process apart from the number of classes. The other parameters of the k-modes algorithm, fuzzy k-modes algorithm, and the genetic k-modes are set the same as those stated in their original papers.

The Zoo dataset consists of 101 data objects, each of which has 17 Boolean-valued attributes. According to the class attributes, all data objects belong to one of the seven classes. Tables 14 list the comparison of clustering results of ABC-K-Modes, the k-modes, fuzzy k-modes, and the genetic k-modes on the Zoo dataset according to AC, PR, RE, and RI, respectively.

thumbnail
Table 1. The AC of the four algorithms on the Zoo dataset.

https://doi.org/10.1371/journal.pone.0127125.t001

thumbnail
Table 2. The PR of the four algorithms on the Zoo dataset.

https://doi.org/10.1371/journal.pone.0127125.t002

thumbnail
Table 3. The RE of the four algorithms on the Zoo dataset.

https://doi.org/10.1371/journal.pone.0127125.t003

thumbnail
Table 4. The RI of the four algorithms on the Zoo dataset.

https://doi.org/10.1371/journal.pone.0127125.t004

The Breast Cancer dataset contains 699 data objects, each of which is described by 10 categorical attributes. According to the class attribute, the data objects belong to one of the two classes: Benign and Malignant. Tables 58 summarise the comparison of the clustering results of ABC-K-Modes and the other three well-known algorithms on the Breast Cancer dataset according to AC, PR, RE, and RI, respectively.

thumbnail
Table 5. The AC of the four algorithms on the Breast Cancer dataset.

https://doi.org/10.1371/journal.pone.0127125.t005

thumbnail
Table 6. The PR of the four algorithms on the Breast Cancer dataset.

https://doi.org/10.1371/journal.pone.0127125.t006

thumbnail
Table 7. The RE of the four algorithms on the Breast Cancer dataset.

https://doi.org/10.1371/journal.pone.0127125.t007

thumbnail
Table 8. The RI of the four algorithms on the Breast Cancer dataset.

https://doi.org/10.1371/journal.pone.0127125.t008

The Soybean dataset is composed of 47 data objects, each of which has 36 categorical attributes. In terms of the class attribute, the data objects belong to one of the four diseases: Diaporthe Stem Canker, Charcoal Rot, Rhizoctonia Root Rot, and Phytophthora Rot. Tables 912 list the comparison of the clustering results of ABC-K-Modes and the other three well-known algorithms on the Soybean dataset according to AC, PR, RE, and RI, respectively.

thumbnail
Table 9. The AC of the four algorithms on the Soybean dataset.

https://doi.org/10.1371/journal.pone.0127125.t009

thumbnail
Table 10. The PR of the four algorithms on the Soybean dataset.

https://doi.org/10.1371/journal.pone.0127125.t010

thumbnail
Table 11. The RE of the four algorithms on the Soybean dataset.

https://doi.org/10.1371/journal.pone.0127125.t011

thumbnail
Table 12. The RI of the four algorithms on the Soybean dataset.

https://doi.org/10.1371/journal.pone.0127125.t012

The Lung Cancer dataset has 32 data objects, each of which is described by 57 categorical attributes. According to the class attribute, the dataset has three classes. Tables 1316 summarise the comparison of the clustering results of the ABC-K-Modes algorithm and the other three well-known algorithms on the Lung Cancer dataset according to AC, PR, RE, and RI, respectively.

thumbnail
Table 13. The AC of the four algorithms on the Lung Cancer dataset.

https://doi.org/10.1371/journal.pone.0127125.t013

thumbnail
Table 14. The PR of the four algorithms on the Lung Cancer dataset.

https://doi.org/10.1371/journal.pone.0127125.t014

thumbnail
Table 15. The RE of the four algorithms on the Lung Cancer dataset.

https://doi.org/10.1371/journal.pone.0127125.t015

thumbnail
Table 16. The RI of the four algorithms on the Lung Cancer dataset.

https://doi.org/10.1371/journal.pone.0127125.t016

Mushroom dataset contains 8,124 data objects, each of which has 23 categorical attributes. According to the class attribute, each data object falls into one of the two classes: edible and poisonous. Tables 1720 list the comparison of the clustering results of ABC-K-Modes and the other three well-known algorithms on the Mushroom dataset according to AC, PR, RE, and RI, respectively.

thumbnail
Table 17. The AC of the four algorithms on the Mushroom dataset.

https://doi.org/10.1371/journal.pone.0127125.t017

thumbnail
Table 18. The PR of the four algorithms on the Mushroom dataset.

https://doi.org/10.1371/journal.pone.0127125.t018

thumbnail
Table 19. The RE of the four algorithms on the Mushroom dataset.

https://doi.org/10.1371/journal.pone.0127125.t019

thumbnail
Table 20. The RI of the four algorithms on the Mushroom dataset.

https://doi.org/10.1371/journal.pone.0127125.t020

Dermatology dataset has 366 data objects, each of which is described 34 categorical attributes. In terms of the class attribute, each data object belongs to one of the six classes: psoriasis, seborrheic dermatitis, lichen planus, pityriasis rosea, chronic dermatitis, and pityriasis rubra pilaris. Tables 2124 summarize the comparison of the clustering results of ABC-K-Modes and the other three well-known algorithms on the Dermatology dataset according to AC, PR, RE, and RI, respectively.

thumbnail
Table 21. The AC of the four algorithms on the Dermatology dataset.

https://doi.org/10.1371/journal.pone.0127125.t021

thumbnail
Table 22. The PR of the four algorithms on the Dermatology dataset.

https://doi.org/10.1371/journal.pone.0127125.t022

thumbnail
Table 23. The RE of the four algorithms on the Dermatology dataset.

https://doi.org/10.1371/journal.pone.0127125.t023

thumbnail
Table 24. The RI of the four algorithms on the Dermatology dataset.

https://doi.org/10.1371/journal.pone.0127125.t024

From the experimental results shown in Tables 124, we can see that our proposed ABC-K-Modes achieves higher Best, Avg., and lower Std. values in AC, PR, RE, and RI in most cases, and therefore ABC-K-Modes in general outperforms the other three algorithms in terms of AC, PR, RE, and RI, respectively. The reason for the success of ABC-K-Modes is due to its effective combination of global search (exploration) and local search (exploitation). This is achieved by the adoption of the OKM operator and the ABC optimisation framework. Therefore, the proposed ABC-K-Modes can obtain optimal or near optimal results. In Table 25, we list the average running time over twenty trials for the proposed ABC-K-Modes and the other three popular algorithms on the six different datasets. The results in Table 25 show that the size/dimension has a direct effect on the running time of these four algorithms. Specifically, the larger the size/dimension of data set is, the more time it is for these algorithms to find the satisfactory solution. This is consistent with the analysis of time complexity in the complexity analysis section. Compared to the k-modes algorithm, the ABC-K-Modes takes more time to execute due to the introduction of the ABC optimization strategy. However, we also notice that the running time difference between our ABC-K-Modes approach and traditional K-Modes approach decreases with the increase of the size and dimensions of the dataset, and this seems promising. For instance, for the mushroom dataset, which contains the largest number of records, the performance of ABC-K-Modes is closest to that of K-Modes compared to the situation on the other datasets. Finally, we will further explore the acceleration issue of the ABC-K-Modes in our future work.

thumbnail
Table 25. The average running time of the four algorithms on different datasets.

https://doi.org/10.1371/journal.pone.0127125.t025

Conclusions and Future Work

In real-world applications, data objects characterised by categorical attributes are frequently encountered. The k-modes type algorithms are well known for their high efficiency to clustering categorical data. However, it is acknowledged that this type of algorithms is prone to fall into local optima.

To address this issue, in this research we proposed a novel clustering algorithm ABC-K-Modes on the basis of the traditional k-modes algorithm and ABC optimisation procedure. In our algorithm, the search process of employee bees and onlookers is implemented by introducing a specific procedure named OKM, and the search process of scouts are performed by random exploration. To accelerate the convergence of the ABC-K-Modes, we adopt the idea of multi-source search for the search of scout bees. Moreover, we analysed the time and space complexity of the proposed algorithm ABC-K-Modes, and tested ABC-K-Modes on six real-world categorical datasets derived from the UCI Machine Learning Repository. The experimental results demonstrated that our proposed algorithm was superior to the other three well-known algorithms according to the evaluation measures AC, PR, RE, and RI, respectively.

In the near future, we will explore the acceleration issue of the ABC-K-Modes, and extend this approach to cluster mixed data containing both numeric and categorical attributes. We will investigate the potential of ABC-K-Modes when applied to social media data. Furthermore, we would also like to explore other swarm intelligent algorithms for clustering categorical data as well as mixed data.

Acknowledgments

The authors are very grateful to the editor and anonymous reviewers for their valuable comments and suggestions.

Author Contributions

Conceived and designed the experiments: JJ WP ZM. Performed the experiments: JJ YZ. Analyzed the data: ZW ZM. Contributed reagents/materials/analysis tools: YZ. Wrote the paper: JJ WP ZM.

References

  1. 1. Celebi ME, Kingravi HA, Vela PA. A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Systems with Applications. 2013; 40: 200–210.
  2. 2. Jain AK, Murty MN, Flynn PJ. Data clustering: A review. ACM Computing Surveys. 1999; 31: 264–323. pmid:10614516
  3. 3. Bordogna G, Pasi G. A quality driven hierarchical data divisive soft clustering for information retrieval. Knowledge-Based Systems. 2012; 26: 9–19.
  4. 4. Luo C, Pang W, Wang Z. Semi-supervised clustering on heterogeneous information networks. In Proc of the 18th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'14). Taiwan. 2014; pp. 548–559.
  5. 5. Islam MZ, Brankovic L. Privacy preserving data mining: A noise addition framework using a novel clustering technique. Knowledge-Based Systems. 2011; 24: 1214–1223.
  6. 6. Bogner C, Widemann BTY, Lange H. Characterising flow patterns in soils by feature extraction and multiple consensus clustering. Ecological Informatics. 2013; 15: 44–52.
  7. 7. Zhang W, Yoshida T, Tang XJ, Wang Q. Text clustering using frequent itemsets. Knowledge-Based Systems. 2010; 23: 379–388.
  8. 8. Saeed F, Salim N, Abdo A. Information theory and voting based consensus clustering for combining multiple clusterings of chemical structures. Molecular Informatics. 2013; 32: 591–598.
  9. 9. Han J, Kamber M, Pei J. Data mining concepts and techniques. 3rd ed. Waltham: Morgan Kaufmann; 2012.
  10. 10. Jain AK, Dubes RC. Algorithms for clustering data. New Jersey: Prentice Hall; 1988.
  11. 11. Bezdek JC, Ehrlich R, Full W. FCM: the fuzzy c-means clustering algorithm. Computers & Geosciences. 1984; 10: 191–203.
  12. 12. Huang Z. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery. 1998; 2: 283–304.
  13. 13. Huang Z. A fast clustering algorithm to cluster very large categorical data dets in data mining. In Research Issues on Data Mining and Knowledge Discovery. 1997; pp. 1–8.
  14. 14. Huang ZX, Ng MK. A fuzzy k-modes algorithm for clustering categorical data. IEEE Transactions on Fuzzy Systems. 1999; 7: 446–452.
  15. 15. Maulik U, Bandyopadhyay S. Genetic algorithm-based clustering technique. Pattern Recognition. 2000; 33: 1455–1465.
  16. 16. Krishna K, Murty MN. Genetic k-means algorithm. IEEE Transactions on Systems Man and Cybernetics Part B-Cybernetics. 1999; 29: 433–439.
  17. 17. Lu Y, Lu S, Fotouhi F, Deng Y, Brown SJ. FGKA: a fast genetic k-means clustering algorithm. In Proceedings of the 2004 ACM symposium on Applied computing. Nicosia, Cyprus: ACM. 2004; pp. 622–623.
  18. 18. Gan G, Yang Z, Wu J. A genetic k-modes algorithm for clustering categorical data. Advanced Data Mining and Applications. In: Li X, Wang S, Dong Z, editors. Berlin: Springer-Verlag; 2005. pp. 195–202.
  19. 19. Selim SZ, Alsultan K. A simulated annealing algorithm for the clustering problem. Pattern Recognition. 1991; 24: 1003–1008.
  20. 20. Maulik U, Mukhopadhyay A. Simulated annealing based automatic fuzzy clustering combined with ANN classification for analyzing microarray data. Computers & Operations Research. 2010; 37: 1369–1380.
  21. 21. Sung CS, Jin HW. A tabu-search-based heuristic for clustering. Pattern Recognition. 2000; 33: 849–858.
  22. 22. Shelokar PS, Jayaraman VK, Kulkarni BD. An ant colony approach for clustering. Analytica Chimica Acta. 2004; 509: 187–195.
  23. 23. Kao Y-T, Zahara E, Kao IW. A hybridized approach to data clustering. Expert Systems with Applications. 2008; 34: 1754–1762.
  24. 24. Cura T. A particle swarm optimization approach to clustering. Expert Systems with Applications. 2012; 39: 1582–1588.
  25. 25. Chuang L-Y, Hsiao C-J, Yang C-H. Chaotic particle swarm optimization for data clustering. Expert Systems with Applications. 2011; 38: 14555–14563.
  26. 26. Wan M, Li L, Xiao J, Wang C, Yang Y. Data clustering using bacterial foraging optimization. Journal of Intelligent Information Systems. 2012; 38: 321–341.
  27. 27. Zhang C, Ouyang D, Ning J. An artificial bee colony approach for clustering. Expert Systems with Applications. 2010; 37: 4761–4767.
  28. 28. Teodorović D. Bee Colony Optimization (BCO). In: Lim C, Jain L, Dehuri S, editors. Innovations in Swarm Intelligence. Berlin: Springer-Verlag; 2009. pp. 39–60.
  29. 29. Karaboga D, Basturk B. On the performance of artificial bee colony (ABC) algorithm. Applied Soft Computing. 2008; 8: 687–697.
  30. 30. Karaboga D, Ozturk C. A novel clustering approach: artificial bee colony (ABC) algorithm. Applied Soft Computing. 2011; 11: 652–657.
  31. 31. Huang Z. Clustering large data sets with mixed numeric and categorical values. In the first Pacific-Asia Conference on Knowledge Discovery and Data Mining. 1997; pp. 21–34.
  32. 32. Ikura Y, Gimple M. Efficient scheduling algorithms for a single batch processing machine. Operations Research Letters. 1986; 5: 61–65.
  33. 33. Yang Y. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval. 1999; 1: 67–88.
  34. 34. Rand WM. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association. 1971; 66: 846–850.