Oleg Okun and Giorgio Valentini (Eds.) Supervised and Unsupervised Ensemble Methods and their Applications
Studies in Computational Intelligence, Volume 126 Editor-in-chief Prof. Janusz Kacprzyk Systems Research Institute Polish Academy of Sciences ul. Newelska 6 01-447 Warsaw Poland E-mail:
[email protected] Further volumes of this series can be found on our homepage: springer.com Vol. 105. Wolfgang Guenthner Enhancing Cognitive Assistance Systems with Inertial Measurement Units, 2008 ISBN 978-3-540-76996-5 Vol. 106. Jacqueline Jarvis, Dennis Jarvis, Ralph R¨onnquist and Lakhmi C. Jain (Eds.) Holonic Execution: A BDI Approach, 2008 ISBN 978-3-540-77478-5 Vol. 107. Margarita Sordo, Sachin Vaidya and Lakhmi C. Jain (Eds.) Advanced Computational Intelligence Paradigms in Healthcare - 3, 2008 ISBN 978-3-540-77661-1 Vol. 108. Vito Trianni Evolutionary Swarm Robotics, 2008 ISBN 978-3-540-77611-6 Vol. 109. Panagiotis Chountas, Ilias Petrounias and Janusz Kacprzyk (Eds.) Intelligent Techniques and Tools for Novel System Architectures, 2008 ISBN 978-3-540-77621-5
Vol. 116. Ying Liu, Aixin Sun, Han Tong Loh, Wen Feng Lu and Ee-Peng Lim (Eds.) Advances of Computational Intelligence in Industrial Systems, 2008 ISBN 978-3-540-78296-4 Vol. 117. Da Ruan, Frank Hardeman and Klaas van der Meer (Eds.) Intelligent Decision and Policy Making Support Systems, 2008 ISBN 978-3-540-78306-0 Vol. 118. Tsau Young Lin, Ying Xie, Anita Wasilewska and Churn-Jung Liau (Eds.) Data Mining: Foundations and Practice, 2008 ISBN 978-3-540-78487-6 Vol. 119. Slawomir Wiak, Andrzej Krawczyk and Ivo Dolezel (Eds.) Intelligent Computer Techniques in Applied Electromagnetics, 2008 ISBN 978-3-540-78489-0 Vol. 120. George A. Tsihrintzis and Lakhmi C. Jain (Eds.) Multimedia Interactive Services in Intelligent Environments, 2008 ISBN 978-3-540-78491-3
Vol. 110. Makoto Yokoo, Takayuki Ito, Minjie Zhang, Juhnyoung Lee and Tokuro Matsuo (Eds.) Electronic Commerce, 2008 ISBN 978-3-540-77808-0
Vol. 121. Nadia Nedjah, Leandro dos Santos Coelho and Luiza de Macedo Mourelle (Eds.) Quantum Inspired Intelligent Systems, 2008 ISBN 978-3-540-78531-6
Vol. 111. David Elmakias (Ed.) New Computational Methods in Power System Reliability, 2008 ISBN 978-3-540-77810-3
Vol. 122. Tomasz G. Smolinski, Mariofanna G. Milanova and Aboul-Ella Hassanien (Eds.) Applications of Computational Intelligence in Biology, 2008 ISBN 978-3-540-78533-0
Vol. 112. Edgar N. Sanchez, Alma Y. Alan´ıs and Alexander G. Loukianov Discrete-Time High Order Neural Control: Trained with Kalman Filtering, 2008 ISBN 978-3-540-78288-9
Vol. 123. Shuichi Iwata, Yukio Ohsawa, Shusaku Tsumoto, Ning Zhong, Yong Shi and Lorenzo Magnani (Eds.) Communications and Discoveries from Multidisciplinary Data, 2008 ISBN 978-3-540-78732-7
Vol. 113. Gemma Bel-Enguix, M. Dolores Jimenez-Lopez and Carlos Mart´ın-Vide (Eds.) New Developments in Formal Languages and Applications, 2008 ISBN 978-3-540-78290-2
Vol. 124. Ricardo Zavala Yoe Modelling and Control of Dynamical Systems: Numerical Implementation in a Behavioral Framework, 2008 ISBN 978-3-540-78734-1
Vol. 114. Christian Blum, Maria Jos´e Blesa Aguilera, Andrea Roli and Michael Sampels (Eds.) Hybrid Metaheuristics, 2008 ISBN 978-3-540-78294-0 Vol. 115. John Fulcher and Lakhmi C. Jain (Eds.) Computational Intelligence: A Compendium, 2008 ISBN 978-3-540-78292-6
Vol. 125. Larry Bull, Bernad´o-Mansilla Ester and John Holmes (Eds.) Learning Classifier Systems in Data Mining, 2008 ISBN 978-3-540-78978-9 Vol. 126. Oleg Okun and Giorgio Valentini (Eds.) Supervised and Unsupervised Ensemble Methods and their Applications, 2008 ISBN 978-3-540-78980-2
Oleg Okun Giorgio Valentini (Eds.)
Supervised and Unsupervised Ensemble Methods and their Applications
With 50 Figures and 46 Tables
123
Dr. Oleg Okun
Giorgio Valentini
Machine Vision Group Infotech Oulu & Department of Electrical and Information Engineering University of Oulu P.O. Box 4500 FI-90014 Oulu Finland
[email protected].fi
Dipartimento di Scienze dell’Informazione Universita degli Studi di Milano Via Comelico 39 20135 Milano Italy
[email protected]
ISBN 978-3-540-78980-2
e-ISBN 978-3-540-78981-9
Studies in Computational Intelligence ISSN 1860-949X Library of Congress Control Number: 2008924367 c 2008 Springer-Verlag Berlin Heidelberg This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned,specifically the rights of translation,reprinting,reuse of illustrations,recitation,broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply,even in the absence of a specific statement,that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Cover design: Deblik, Berlin, Germany Printed on acid-free paper 9 8 7 6 5 4 3 2 1 springer.com
To my parents, Raisa and Gregory, and to my wife, Helen – Oleg Okun Al mio caro topolino, che amo tanto – Giorgio Valentini
Preface
The rapidly growing amount of data, available from different technologies in the field of bio-sciences, high-energy physics, economy, climate analysis, and in several other scientific disciplines, requires a new generation of machine learning and statistical methods to deal with their complexity and heterogeneity. As data collections becomes easier, data analysis is required to be more sophisticated in order to extract useful information from the available data. Even if data can be represented in several ways, according to their structural characteristics, ranging from strings, lists, trees to graphs and other more complex data structures, in most applications they are typically represented as a matrix whose rows correspond to measurable characteristics called features, attributes, variables, depending on the considered discipline and whose columns correspond to examples (cases, samples, patterns). In order to avoid confusion, we will talk about features and examples. In real-world tasks, there can be many more features than examples (cancer classification based on gene expression levels in bioinformatics) or there can be many more examples than features (intrusion detection in computer/network security). In addition, each example can be either labeled or not. Attaching labels allows to distinguish members of the same class or group from members of other classes or groups. Hence, one can talk about supervised and unsupervised tasks that can be solved by machine learning methods. Since it is widely accepted that no single classifier or clustering algorithm can be superior to the others, ensembles of supervised and unsupervised methods are gaining popularity. A typical ensemble includes a number of classifiers/clusterers whose predictions are combined together according to a certain rule, e.g. majority vote. Statistical, algorithmical, representational, computational and practical reasons can explain the success of ensemble methods. In particular several empirical results have demonstrated that ensembles often provide a better solution to the problem than any single method.
VIII
Preface
This book was inspired by the last argument and resulted from the workshop on Supervised and Unsupervised Ensemble Methods and their Applications (briefly, SUEMA) organized on June 4, 2007 in Girona, Spain. This workshop was held in conjunction with the 3rd Iberian Conference on Pattern Recognition and Image Analysis and was intended to encompass the progress in the ensemble applications made by the Iberian and international scholars. Despite its small format, SUEMA attracted researchers from Spain, Portugal, France, USA, Italy, and Finland, who presented interesting ideas about using the ensembles in various practical cases. Encouraged by this enthusiastic reply, we decided to publish workshop papers in an edited book, since CD proceedings were the only media distributed among the workshop participants at that time. The book includes nine chapters divided into two parts, assembling contributions to the applications of supervised and unsupervised ensembles. Chapter 1 serves the tutorial purpose as an introduction to unsupervised ensemble methods. Chapter 2 concerns ensemble clustering of categorical data where symbolic names are assigned to examples as labels. Chapter 3 describes fuzzy ensemble clustering applied to gene expression data for cancer classification and discovery of subclasses of pathologies at bio-molecular level. Chapter 4 introduces collaborative multi-strategical clustering where individual algorithms attempt to find a consensus with other ensemble members while grouping the data into clusters on remote sensed images of urban and coastal areas. Chapter 5 presents the application of ensembles combining one-class classifiers to computer network intrusion detection. Chapter 6 deals with ensembles of nearest neighbor classifiers for gene expression based cancer classification. Chapter 7 applies the two-level ensemble scheme called stacking to multivariate time series classification for industrial process diagnosis and speaker recognition. Chapter 8 concentrates on the analysis of heteroskedastic financial time series by means of boosting-like ensembles utilizing neural networks as the base classifiers. Chapter 9 explores three two-level ensemble schemes – stacking, grading, and cascading – when working with nominal data. The book is intended to be primarily a reference work. It could be a good complement to two excellent books on ensemble methodology – “Combining pattern classifiers: methods and algorithms” by Ludmila Kuncheva (John Wiley & Sons, 2004) and “Decomposition methodology for knowledge discovery and data mining: theory and applications” by Oded Maimon and Lior Rokach (World Scientific, 2005). Extra primal sources of information are proceedings of the biannual international workshop on Multiple Classifier Systems (MCS) published by Springer-Verlag, and proceedings of the International Conference on Information Fusion (FUSION) organized by the International Society of Information Fusion (http://www.isif.org/). Among other conferences of interest are International Conference on Machine Learning (ICML), European Conference on Machine Learning (ECML), and International Conference on Machine Learning and Data Mining (MLDM) (proceedings of the two latter are published by Springer-Verlag). Two international journals are largely devoted to
Preface
IX
the topic of our book are Information Fusion published by Elsevier and Journal of Advances in Information Fusion published by The International Society of Information Fusion, but most machine learning journals such as Machine Learning, the Journal of Machine Learning Research and the IEEE Transactions on Pattern Analysis and Machine Intelligence dedicate large room to papers on ensemble methods. These recommended sources, of course, do not constitute the complete list to look in, since nowadays ensemble methods gain increasing popularity and thus, they are among topics of many scientific meetings and journal issues. Our book would dramatically increase in size if we tried to list all events and publications related to ensemble methods. Hence, we use this argument to apologize to all researchers and organizations whose valuable contribution to the exciting field of ensembles we unintentionally missed. We are grateful to many people who helped this book to appear. We would like to thank Prof. Joan Mart´ı and Dr. Joaquim Salvi for providing us with great opportunity to hold the abovementioned workshop in Girona. We are also thankful to all authors who spent their time and efforts to contribute to this book. Prof. Janusz Kacprzyk and Dr. Thomas Ditzinger from SpringerVerlag deserved our special acknowledgment for warm welcome to our book and their support and a great deal of encouragement.
Oulu (Finland) and Milan (Italy), January 2008
Oleg Okun Giorgio Valentini
Contents
Part I Ensembles of Clustering Methods and Their Applications Cluster Ensemble Methods: from Single Clusterings to Combined Solutions Ana Fred and Andr´e Louren¸co . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
Random Subspace Ensembles for Clustering Categorical Data Muna Al-Razgan, Carlotta Domeniconi, and Daniel Barbar´ a . . . . . . . . . . . 31 Ensemble Clustering with a Fuzzy Approach Roberto Avogadri, Giorgio Valentini . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Collaborative Multi-Strategical Clustering for Object-Oriented Image Analysis Germain Forestier, C´edric Wemmert, and Pierre Gan¸carski . . . . . . . . . . . 71
Part II Ensembles of Classification Methods and Their Applications Intrusion Detection in Computer Systems Using Multiple Classifier Systems Igino Corona, Giorgio Giacinto, and Fabio Roli . . . . . . . . . . . . . . . . . . . . . . 91 Ensembles of Nearest Neighbors for Gene Expression Based Cancer Classification Oleg Okun and Helen Priisalu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Multivariate Time Series Classification via Stacking of Univariate Classifiers ´ Carlos Alonso, Oscar Prieto, Juan Jos´e Rodr´ıguez, and An´ıbal Breg´ on . . 135
XII
Contents
Gradient Boosting GARCH and Neural Networks for Time Series Prediction Jos´e M. Mat´ıas, Manuel Febrero, Wenceslao Gonz´ alez-Manteiga, and Juan C. Reboredo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Cascading with VDM and Binary Decision Trees for Nominal Data Jes´ us Maudes, Juan J. Rodr´ıguez, and C´esar Garc´ıa-Osorio . . . . . . . . . . . 165 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
List of Contributors
Carlos Alonso University of Valladolid, Spain
[email protected]
Germain Forestier University of Strasburg, France
[email protected]
Muna Al-Razgan George Mason University, USA
[email protected]
Ana Fred Instituto de Telecomunica¸c˜oes, Instituto Superior T´ecnico, Lisboa, Portugal
[email protected]
Roberto Avogadri University of Milan, Italy
[email protected] Daniel Barbar´ a George Mason University, USA
[email protected] An´ıbal Breg´ on University of Valladolid, Spain
[email protected] Igino Corona University of Cagliari, Italy
[email protected] Carlotta Domeniconi George Mason University, USA
[email protected] Manuel Febrero University of Santiago de Compostela, Spain
[email protected]
Pierre Gan¸ carski University of Strasburg, France
[email protected] C´ esar Garc´ıa Osorio University of Burgos, Spain
[email protected] Giorgio Giacinto University of Cagliari, Italy
[email protected] Wenceslao Gonz´ alez-Manteiga University of Santiago de Compostela, Spain
[email protected] Andr´ e Louren¸ co Instituto de Telecomunica¸c˜oes, Instituto Superior de Engenharia de Lisboa, Portugal
[email protected]
XIV
List of Contributors
Jos´ e Mat´ıas University of Vigo, Spain
[email protected] Jes´ us Maudes University of Burgos, Spain
[email protected] Oleg Okun University of Oulu, Finland
[email protected]
Juan Reboredo University of Santiago de Compostela, Spain
[email protected] Juan Jos´ e Rodr´ıguez University of Burgos, Spain
[email protected] Fabio Roli University of Cagliari, Italy
[email protected]
´ Oscar Prieto University of Valladolid, Spain
[email protected]
Giorgio Valentini University of Milan, Italy
[email protected]
Helen Priisalu Teradata, Finland
[email protected]
C´ edric Wemmert University of Strasburg, France
[email protected]
Cluster Ensemble Methods: from Single Clusterings to Combined Solutions Ana Fred1 and Andr´e Louren¸co2 1 2
Instituto de Telecomunica¸co ˜es, Instituto Superior T´ecnico, Lisboa, Portugal,
[email protected], Instituto de Telecomunica¸co ˜es, Instituto Superior de Engenharia de Lisboa, Portugal,
[email protected]
Summary. Cluster ensemble methods attempt to find better and more robust clustering solutions by fusing information from several data partitionings. In this chapter, we address the different phases of this recent approach: from the generation of the partitions, the clustering ensemble, to the combination and validation of the combined result. While giving an overall revision of the state-of-the-art in the area, we focus on our own work on the subject. In particular, the Evidence Accumulation Clustering (EAC) paradigm is detailed and analyzed. For the validation/selection of the final partition, we focus on metrics that can quantitatively measure the consistency between partitions and combined results, and thus enabling the choice of best results without the use of additional information. Information-theoretic measures in conjunction with a variance analysis using bootstrapping are detailed and empirically evaluated. Experimental results throughout the paper illustrate the various concepts and methods addressed, using synthetic and real data and involving both vectorial and string-based data representations. We show that the clustering ensemble approach can be used in very distinct contexts with the state of the art quality results. Key words: cluster ensemble, evidence accumulation clustering, string representation, k-means, hierarchical clustering, normalized mutual information, consistency index
1 Introduction The clustering problem can be formulated as follows: given a data set, find a partitioning of the data into groups (clusters), such that the patterns in a cluster are more similar to each other than patterns in different clusters [23]. This very general definition reflects the most common view of clusters assuming or expressing some concept of similarity between patterns. Important issues in clustering are: Which similarity measure should be used? How many clusters are present on the data? Which is the “best” clustering method? How A. Fred and A. Louren¸co: Cluster Ensemble Methods: from Single Clusterings to Combined Solutions, Studies in Computational Intelligence (SCI) 126, 3–30 (2008) c Springer-Verlag Berlin Heidelberg 2008 www.springerlink.com
4
A. Fred and A. Louren¸co
to choose algorithmic parameters? Are the individual clusters and the partition valid? Clustering is a challenging research problem. Clusters can have different shapes, different sizes, data sparseness and may be corrupted by noise. Figure 1(a) illustrates some of these difficulties, presenting a synthetic data set. The same data set is represented in Fig. 1(b) in different colors/symbols, according to a cluster shape decomposition or data generation model type, leading to five clusters: rings (plotted as red circles); a star (pink ‘*’ symbols); bars (blue ‘+’ signs); gaussians (symbol ‘♦’ in green); and circle (black outer dots). Figure 1(c) presents the nine clusters solution provided by the algorithm in [16], a hierarchical agglomerative method using a cluster isolation criterion based on the computation of statistics of dissimilarity increments between nearest neighbor patterns. In general, each clustering algorithm addresses differently issues of cluster validity and structure imposed on the data. With so many available algorithms, one can obtain very different clustering results for a given data set. Without objective criteria, it is difficult to select the most adequate method for a given problem [25]. In addition, many of the existing algorithms require the specification/fine tuning of many parameters to obtain a possible grouping of the data. Instead of choosing a particular algorithm, similarity measure, algorithmic parameters, or some clustering configuration that best suits a given problem, recent work on clustering has focused on the problem of how to interpret and how to combine different partitions produced by different clustering algorithms [11, 12, 41]. This framework, known as Combination of Clustering Ensembles or Ensemble methods, aims at obtaining better clustering results by combining information of different partitionings of the data. The input space for such a problem consists of a set of N data partitions, referred as a clustering ensemble. This research topic has gained increased interest over the past years, a fair amount of clustering combination techniques having been proposed [1, 2, 3, 7, 8, 11, 12, 17, 31, 40, 41, 42, 44]. Along with the theoretical, methodological and algorithmic contributions, the clustering ensemble approach has also played an increasingly important role in practical application domains. This framework has been successfully applied in very different contexts, namely: Document Clustering and Web Document Clustering [21], Medical Diagnosis [8, 20], Gene Expression Microarray Data Clustering [5, 9, 36], Protein Clustering [4], Image Segmentation [40], Retail Market-Basket Analysis [43], etc. Clustering ensemble methods can be decomposed into a cluster generation mechanism and a partition integration process, both influencing the quality of the combination results. In this chapter, we address the three following steps in cluster ensemble methods: (1) generation of the clustering ensemble; (2) partition integration process leading to a combined data partition, P ∗ ; and (3) validation of the result. We present an overview of these topics together
Ensemble Methods: from Single Clustering to a Combined Solution
5
8
6
4
2
0
−2
−4
−6 −6
−4
−2
0
2
4
6
8
4
6
8
4
6
8
(a) How many Clusters? 8
6
4
2
0
−2
−4
−6 −6
−4
−2
0
2
(b) 5 clusters? 8
6
4
2
0
−2
−4
−6 −6
−4
−2
0
2
(c) 9 clusters? Fig. 1. Challenging issues: What is the appropriate algorithm to cope with different cluster shapes and densities? How many clusters are present in this data set? [16]
6
A. Fred and A. Louren¸co
with illustrative application examples, with focus on our own contributions to these topics. The chapter is organized as follows. Two main data sets used for illustration purposes are presented in Sect. 2. The next section addresses the production of the clustering ensemble, involving different proximity measures and algorithms. Section 4 focuses on the clustering combination methods. The validation of the combined partitions is presented in Sect. 5.
2 Data Sets The two following data sets are used along the chapter, illustrating the different topics and methods. Both vector and string representations are addressed. The first data set, d3, defined in a vector space, exemplifies situations of arbitrary shaped clusters in noise; the second, the contour images, is used to illustrate how the ensemble clustering approach can be easily applied to structural representations, the particular example shown using a string description. 2.1 D3 Data Set The d3 data set is composed of 200 samples in a 2-dimensional vector space, organized in 4 well separated clusters over a background noise cluster, as shown in Fig. 2. This synthetic data set exhibits clusters with different shapes.
2.2 Contour Images of Hardware Tools The second data set concerns the string descriptions of objects from a database with images of 15 types of hardware tools (24 classes if counting with different poses for tools with moving parts), in a total of 634 images [10]. Typical samples of each tool are presented in Fig. 3(a). Object boundaries were detected and sampled at 50 equally spaced points; objects’ contours were then encoded using an 8-directional differential chain code [24], yielding the string descriptions. Figure 3(b) illustrates the contour descriptions, showing prototypical images of tool types t1 and t14 and string descriptions of two randomly chosen images within each class, using the alphabet {0, 1, . . . , 7}.
3 Generation of the Clustering Ensemble The input space for the clustering ensemble techniques consists of a set of N data partitions, a clustering ensemble, P = {P 1 , P 2 , . . . , P i , . . . , P N }. Each partition, P i = {C1i , C2i , . . . , Cki i }, is formed by cluster labels, symbolic and categorical by nature, and may have different number of clusters, ki .
Ensemble Methods: from Single Clustering to a Combined Solution
7
10
9
8
7
6
5
4
3
2
1
0
1
2
3
4
5
6
7
8
9
Fig. 2. d3 data set
The data partitions can be available from previous clustering experiments, in a knowledge reuse framework [42], or may explicitly be built for the purpose of clusterings combination. While in the first instance the goal is to combine existing data partitions in order to get a single clustering solution, expressing consensus associations amongst the previously applied clustering algorithms, in the second instance, building of the clustering ensemble is part of the clustering ensemble approach design process, and should take into consideration the need to have diversity in the partitioning solutions within the ensemble in order to get robust and best performing solutions. The issue of diversity in cluster ensembles has been addressed by several authors [7, 27], showing how diversity is correlated with improvements in clustering ensemble methods. The generation of the clustering ensemble can be achieved in different ways [17], either working: at the data level, by manipulation of the data set or using different proximity measures between patterns; at the algorithm level, exploring distinct algorithms or a single algorithm and different parameter values; or combinations of both. The various possibilities are explored next. 3.1 Clustering Algorithms Different clustering algorithms lead, in general, to distinct clustering solutions. A single clustering algorithm, however, can also lead to a diversity of clustering
8
A. Fred and A. Louren¸co
(a) Hardware Tools
t1: 1- 00000000050000000010005030500000000000050000000003 2- 10000000057000000007107007702700000000050000000012 t14:1- 00310600000000000000000121000000000000000006022000 2- 00210700000000000000000121000000000000000006022000
(b) Tool types t1 and t14: examples of string descriptions Fig. 3. Typical samples of the database of images of hardware tools; string descriptions are used to represent the contours of images
solutions, for example, using different initializations and/or distinct parameter values, such as choosing different number of clusters. Both situations are illustrated in Figs. 4 and 5, with synthetic data (d3 data set), showing clustering results with the k-means and the spectral clustering [37] algorithms. In the figure, each cluster is represented by a specific pair symbol/color. Figures 4(a) and 4(b) refer to the application of the k-means algorithm with different number of clusters, k = 26 and k = 20, respectively. Application of the spectral clustering algorithm with the same number of clusters (k = 7) and different scaling parameter values may lead to drastically different clustering results, as shown in Figs. 5(a) and 5(b). Single Algorithm Clustering Ensembles In the literature, we can find many examples of the use of clustering ensembles built based on a single clustering algorithm [2, 13, 30, 45]. Diversity of data
Ensemble Methods: from Single Clustering to a Combined Solution
9
10
9
8
7
6
5
4
3
2
1
0
1
2
3
4
5
6
7
8
9
6
7
8
9
(a) k-means (k = 26) 10
9
8
7
6
5
4
3
2
1
0
1
2
3
4
5
(b) k-means (k = 20) Fig. 4. Generating different partitionings of the d3 synthetic data set with the k-means algorithm [13] for two values of k
10
A. Fred and A. Louren¸co 10
9
8
7
6
5
4
3
2
1
0
1
2
3
4
5
6
7
8
9
8
9
(a) Spectral clustering (k = 7, σ = 0.05) 10
9
8
7
6
5
4
3
2
1
0
1
2
3
4
5
6
7
(b) Spectral clustering (k = 7, σ = 10) Fig. 5. Generating different partitionings of the d3 synthetic data set with the spectral clustering algorithm, maintaining the number of clusters and varying the scaling parameter, σ [30]
Ensemble Methods: from Single Clustering to a Combined Solution
11
partitions is obtained by random initialization of the algorithms, and/or usage of different (in general, randomly selected) algorithmic parameters. One of the most used and most studied clustering algorithm is the k-means [23]. In the literature, this was one of the first techniques used to produce clustering ensembles, achieving excellent combination results [11, 13]. Good combination solutions have been achieved with clustering ensembles with a fixed number of clusters, k = k1 , diversity being obtained solely based on random initialization of the k-means algorithm. It is well known that the k-means algorithm, based on a minimum squared error criterion, imposes structure on the data, being sensitive to the clusters dimensions, eventually decomposing the data into comparable sized hyper-spherically shaped clusters. In order to be able to cope with arbitrary shapes and arbitrary sizes of clusters, k1 should be greater than the “true” (unknown) number of clusters, leading in general to highly fragmented data partitions, but preventing the formation of artificial clusters gathering samples belonging to distinct “natural groupings” of the data. This approach follows a split and merge strategy, according to which “natural” clusters are split into smaller clusters in the partitions of the clustering ensemble, and then recovered during the ensemble combination phase, accomplishing a merging mechanism. It has been shown that clustering ensembles obtained by means of randomization on both the initialization and on the selection of the parameter k of the k-means algorithm, lead to more robust combination results [15, 17]. The later approach, while producing greater diversity clustering ensembles, and, in general, better combination results, also requires less a priori information about the data, used for the selection of k, being a flexible and robust technique based on the simple choice of wide intervals for the parameter k ∈ [kmin , kmax ], with kmax − kmin representing the interval width. Other single criterion clustering ensembles (meaning the usage of a single clustering algorithm, and therefore a single underlying clustering criterion) have been successfully explored, an example of which being presented in [30], using a spectral clustering algorithm. Multiple Algorithms Clustering Ensembles Instead of using only one algorithm one can generate heterogeneous clustering ensembles by creating partitions exploring more than one clustering method, each using a different optimization criterion and thus adapted for different types of cluster structures. The use of very diversified clustering strategies is a natural way to achieve high diversity clustering ensembles. Experimental evaluation has shown that this leads in general to better combined data partitions, as compared to single algorithm clustering ensembles [31].
12
A. Fred and A. Louren¸co
3.2 Manipulation of the Data Set Different data representations or different sets of features from a single data representation may be used to create the clustering ensemble. The random projections technique is explored in [7] for constructing clustering ensembles for high-dimensional data, the effectiveness of the technique being experimentally evaluated as compared to other dimension reduction techniques, such as the principal component analysis; the success of the approach is once again explained by the resulting diversity in base clusterings. Another type of data manipulation includes different forms of data resampling, such as bootstrapping or bagging, where a single, or multiple clustering algorithms process different perturbed versions of the data set. Experimental results show improved stability and accuracy for consensus partitions obtained via a bootstrapping technique [35]. 3.3 Proximity Measures Clustering algorithms intrinsically assume some measure of similarity or proximity between patterns. Some clustering methods explicitly require the definition of inter-pattern relationships in terms of a distance or a proximity matrix. Examples include hierarchical methods. By using diverse proximity measures between patterns, one can produce diversity in base clusterings. 3.4 Examples of Clustering Ensembles Take the problem of building clustering ensembles for the string patterns from the contour images data set presented in Sect. 2.2. Clustering algorithms for strings patterns are typically extensions of conventional clustering methods (assuming vector representations) to handle string descriptions, most of them by adopting a convenient measure of similarity between patterns [14, 19]. Most common similarity measures adopt a string matching paradigm, typically based on String Editing Operations, such as the Levenshtein and the weighed Levenshtein distances [14]. Let us assume, for simplicity, zero cost for the symbol maintenance operation, and unitary cost for the remaining editing operations of substitution, insertion and deletion of symbols, and refer the corresponding Levenshtein distance by the acronym SEO. We consider additionally two normalized versions of this distance as follows: NSED – classical normalization by the string length; and NSEDL – normalization by the length of the editing path (normalized string edit distance [34]). A distinct view of string similarity adopts a structural resemblance notion, looking for similar substring structures. We consider three such measures of string proximity, namely: ECP – a measure of dissimilarity between strings based on the concept of error correcting parsing; SOLO – similarity based on the notion of compressibility of sequences and algorithmic complexity using Solomonoff’s coding; and RDGC – combining grammatical inference with
Ensemble Methods: from Single Clustering to a Combined Solution
13
compressibility concepts in a similarity measure accounting for the ratio of decrease in grammar complexity obtained by joint descriptions as compared to isolated string descriptions. For details on how to compute these measures consult, for instance, [14] and the references therein. Different clusterings are obtained by different combinations of similarity measures, clustering algorithms and value assignment of associated algorithmic parameters. Tables 1 and 2 specify the construction conditions of several clustering ensembles, named CE1 to CE6. Table 1. Clustering ensembles based on the spectral clustering algorithm and different string proximity measures (parameters: σ ∈ [0.08 : 0.02 : 0.5], k = 24) Ensemble Proximity measure CE1 CE2 CE3 CE4 CE5
NSEDL SOLO RDGC ECP ALL of the above
Clustering ensembles in Table 1 use a single clustering algorithm, the spectral clustering method described in [37], enforcing the formation of k = 24 clusters, corresponding to the number of classes of hardware tools with differentiated poses. Ensembles CE1 to CE4 use a particular combination of clustering algorithm and proximity measure, diversity being obtained by using different values for the scaling parameter, σ. The total number of clusterings in each of these clustering ensembles is 22. Ensemble CE5 is formed by the union of the previous clustering ensembles, thus combining a single clustering algorithm with different proximity measures and different algorithmic parameters. Table 2 presents a heterogenous clustering ensemble, CE6, exploring different clustering paradigms and the several proximity measures between strings described previously. The clustering algorithms used are as follows: k-means – an adaptation of the k-means algorithm using as cluster prototypes the cluster median string; NN-StoS-Fu and NN-ECP-Fu – two partitioning clustering approaches proposed by Fu [19, 32, 33], the first using proximity measures between strings, and the second modeling cluster structure by grammars and using error-correcting parsing [14]; Hier-SL, Hier-CL, Hier-AL and Hier-WL – four hierarchical clustering algorithms, respectively the Single Link (SL), Complete Link (CL), Average Link (AL), and Ward’s Link (WL) [23]; spectral – the previously mentioned spectral clustering algorithm.
14
A. Fred and A. Louren¸co
Table 2. Clustering ensemble based on several clustering algorithms and several string proximity measures Ensemble Algorithm
CE6
NN-StoS-Fu NN-StoS-Fu NN-ECP-Fu NN-ECP-Fu NN-ECP-Fu k-means k-means Hier-SL Hier-SL Hier-SL Hier-SL Hier-CL Hier-CL Hier-CL Hier-CL Hier-AL Hier-AL Hier-WL Hier-WL Hier-WL Spectral Spectral Spectral
Proximity measure Parameters NSEDL SEO SEO SEO NSED NSEDL SEO NSEDL SOLOM RDGC ECP NSEDL SOLOM RDGC ECP SOLOM NSED SOLOM RDGC ECP NSEDL NSEDL NSEDL
CI
th = 0.3 25.4 th = 8 69.7 th = 4 25.4 th = 5 27.4 th = 0.09 27.4 k = 15 48.3 k = 15 47.3 k = 24 21.5 k = 24 15.9 k = 24 24.3 k = 24 16.6 k = 24 39.3 k = 24 54.9 k = 24 42.4 k = 24 41.8 k = 24 57.3 k = 24 90.7 k = 24 60.6 k = 24 51.7 k = 24 55.2 σ=0.08; k = 24 76.5 σ=0.16; k = 24 67.4 σ=0.44; k = 24 82.6
4 Combination of Clusterings The goal of a clustering combination technique is to ’combine’ information from the clustering ensemble P = {P 1 , P 2 , . . . , P i , . . . , P N } into a partitioning solution, P ∗ . Ideally, the combined data partition, P ∗ , should: (1) - agree with the clustering ensemble, P; (2) - be robust in the sense of not changing significantly with small perturbations of either the data or of the partitions in P; and (3) - be consistent with external cluster labels (ground truth information or perceptual evaluation of the data), if available [15]. Fred and Jain [11, 12, 17] proposed a method, the Evidence Accumulation Clustering (EAC), for finding consistent data partitions, where the combination of a clustering ensemble is performed by transforming partitions into a co-association matrix, which maps the coherent associations and represents a new similarity measure between patterns. Any clustering algorithm can be applied to this new similarity matrix in order to extract the combined partition. In [17] agglomerative hierarchical methods are used, namely: Single Link (SL), Average Link (AL), and Ward’s Link (WL). To automatically find the
Ensemble Methods: from Single Clustering to a Combined Solution
15
number of clusters, without using a priori information, the cluster lifetime criterion has been proposed [17]. Strehl and Gosh [41, 42] formulated the clustering ensemble problem as an optimization problem based on the maximization of the average mutual information between the combined clustering and the clustering ensemble. Three heuristics were presented to solve it, exploring graph theoretical concepts: a) CSPA - a fully connected graph is created, with vertices representing the samples and the weights of edges obtained from the same matrix as described before (co-association matrix); the final combined partition is produced using a graph partitioning algorithm (METIS) [26]; b) MCLA - a fully connected graph is also created, but now the vertices represent clusters, and the weights correspond to similarity between the clusters (computed using the Jaccard measure) [8]; c) HGPA - a hypergraph is created, modeling clusters as hyper-edges and samples as vertices, and applying a hypergraph partitioning algorithm (HMETIS) [26] to produce the combined partition. Topchy, Jain and Punch [44] proposed to solve the combination problem based on a probabilistic model of the consensus partition in the space of clusterings. The clustering ensemble is represented as a new set of features, corresponding to cluster labels, and the final combined partition is obtained applying the EM (expectation maximization) method for multinomial mixture decomposition. More recent work includes an efficient cluster combination technique based on consensus cumulative voting, proposed by Ayad and Kamel [2, 3], and graph-based partitioning by Fern and Brodley [8]. In most of these clustering combination methods, each partition is given an equal weight in the combination process, and all clusters in each partition contribute to the combined solution. However, researchers have already started to question this equitable combination strategy, in a quest for better performing and more robust methods. Weighting of the clustering ensembles has been proposed in [1, 6]. A new research trend proposes differentiated combination rules over different regions of the feature space, based on local assessment of performances of multiple clustering criteria. Law et al proposed a multiobjective data clustering method based on the selection of individual clusters produced by several clustering algorithms, through an optimization procedure [28]. They choose the best set(s) of objective functions for different parts of the feature space from the results of different clustering algorithms. Fred and Jain [18] explore a clustering ensemble approach combined with cluster stability criteria to selectively learn the similarity from a collection of different clustering algorithms with various parameter configurations. The new method, named Multi-EAC, has shown promising experimental results, better exposing the underlying clustering structure of the data. In this chapter, we focus on the Evidence Accumulation Clustering (EAC) approach, detailed next.
16
A. Fred and A. Louren¸co
4.1 Evidence Accumulation Clustering – EAC The idea of evidence accumulation clustering is to combine the results of multiple clusterings into a single data partition, by viewing each clustering result as an independent evidence of data organization. A clustering algorithm, l, by organizing the n patterns into clusters according to the partition P l , expresses relationships between objects in the same cluster; these are mapped into a binary n × n co-association matrix, C l (i, j), where non-zero pairwise relations, C l (i, j) = 1, express co-existence of patterns i and j in the same cluster of P l . Assuming that patterns belonging to a “natural” cluster are very likely to be co-located in the same cluster in different clusterings, we take the co-occurrences of pairs of patterns in the same cluster as votes for their association; the clustering ensemble P is mapped into an n × n co-association matrix, as follows: N l C (i, j) nij = l=1 , (1) C(i, j) = N N where nij is the number of times the pattern pair (i, j) is assigned to the same cluster among the N clusterings. Evidence accumulated over the N clusterings, according to (1), induces a new similarity measure between patterns, which is then used to recluster the patterns, yielding the combined data partition P ∗ . Different clustering algorithms can be applied to the induced similarity matrix, C, in order to retrieve the combined data partition. Clustering algorithms that assume a similarity/dissimilarity matrix as input can be applied directly to the co-association matrix C to obtain the combined partition. Taking C as a feature matrix, or using some embedded spaces mapping technique [38], any feature-based clustering algorithm can be also applied. In [11, 17], hierarchical agglomerative algorithms are used. We define the lifetime of a k-cluster partition as the absolute difference between its birth and merge thresholds in the dendrogram produced by the hierarchical method; for well separated clusters, the final data partition can be automatically chosen as the one with the highest lifetime. For data exhibiting touching clusters, several combination results can be tested, using, for instance, the validation criteria described in Sect. 5 for a final decision. 4.2 Clustering Ensembles Combination Results D3 Data Set We illustrate the effects of the EAC strategy with the d3 data set (Sect. 2.1). Consider a clustering ensemble composed of N = 100 k-means partitions, produced by randomly selecting the number of clusters k of each partition in the interval k ∈ [20, 30]. Figure 6 plots the similarity matrix between the patterns before and after the application of the EAC combination technique. In these figures, similarity values are represented in a gradient of grey levels
Ensemble Methods: from Single Clustering to a Combined Solution
17
from white (low similarity) to black (corresponding to highest similarity values, and predominantly found in the matrix diagonal). By direct comparison of Figs. 6(a) and 6(b), it can be seen that the induced similarity better reveals the intrinsic structure of the data, enhancing the inter-pattern relationships within “natural” groups/clusters and diminishing the similarity between patterns belonging to different clusters. Note that the block diagonal form of the matrix is obtained since the samples are arranged so that samples belonging to the same cluster are adjacent to each other (with the same index order in both rows and columns of the matrix). Figure 7 shows the combined partition P ∗ when applying the SL method over the co-association matrix and forcing a 5-cluster solution. We can see that the four well separated clusters are identified, the noise cluster being dispersed into neighboring clusters, with the exception of one sample that gets isolated in a cluster. By using the lifetime criteria, this spurious singleton cluster is merged into the nearest cluster. It is important to note that the k-means algorithm, imposing a spherical structure on the 2-D data, cannot adequately address the given d3 data set. With the clustering ensemble approach, however, by combination of the results of the multiple runs of the k-means algorithm, the underlying clustering structure of the data is recovered. This is achieved by the split-and-merge strategy accomplished by the EAC method when associated with k-means based clustering ensembles, with the ensemble approach thus profiting from strong local connections produced by the granular k-means clusters, and the merging of these local similarities eventually unveiling the overall structure of the data. Contour Images Data Set We now address the real data presented in Sect. 2.2, applying the EAC method to the clustering ensembles previously defined. In order to evaluate the clustering results, we compare individual and combined partitioning solutions, P , with ground truth information, P o , taken as the 24 class labels for the data (differentiating between poses within hardware tools with moving parts), which naturally are not used in the unsupervised learning process. We quantify the performance using two indices: • normalized mutual information, N M I(P, P o ), measuring the consistency between two partitions under an information theoretical frame work [41]. Given two partitions P a = C1a , C2a , . . . , Ckaa and P b = b b C1 , C2 , . . . , Ckbb , with ka and kb clusters, respectively, the normalized mutual information is defined as [15] ab ka kb ab n n −2 i=1 j=1 nij log naijnb i j a nb , (2) N M I(P a , P b ) = ni ka kb j a b n log + n log i=1 i j=1 j n n
18
A. Fred and A. Louren¸co
10
20
9
40
8 60 7 80 6 100 5 120 4 140 3 160
2
180
1
200 20
40
60
80
100
120
140
160
180
2
(a) Similarity in the original feature space 1
20
0.9
40
0.8
60
0.7
80
0.6
100
0.5
120
0.4
140
0.3
160
0.2
180
0.1
200 20
40
60
80
100
120
140
160
180
200
(b) Induced similarity (co-association matrix) Fig. 6. D3 original pairwise similarity and EAC induced similarity
0
Ensemble Methods: from Single Clustering to a Combined Solution
19
10
9
8
7
6
5
4
3
2
1
0
1
2
3
4
5
6
7
8
9
∗
Fig. 7. Combined partition (P ) obtained using the SL algorithm with the coassociation matrix
where nai represents the number of patterns in cluster Cia ∈ P a , nab ij denotes a b a the number of shared patterns between clusters Ci and Cj , Ci ∈ P a and Cjb ∈ P b , and n is the total number of patterns in the data set. • consistency index CI(P, P o ), based on finding the best cluster matching between the two partitions, and counting the percentage of samples that consistently belong to these matching clusters [11]. When the number of clusters in both partitions is equal, the consistency index corresponds to the percentage of correct labelings, i.e. CI(P, P o ) = (1 − Pe ) ∗ 100, with Pe denoting the error probability. Clustering ensembles CE1 to CE5 use a single clustering algorithm (spectral clustering), the first four exploring distinct string proximity measures between strings (CE1–NSEDL, CE2–SOLO, CE3–RDGC, and CE4–ECP), and the last being the union of all these partitions. Table 3 presents sample clustering results taken from the several clustering ensembles, showing the corresponding consistency index CI(P, P o ). We included in this table the best and the worst individual clustering results, respectively 80.1 (presented in bold) and 24.0. Ensemble CE6 is an heterogeneous clustering ensemble, as specified in Table 2. The last column of this table contains the corresponding consistency index, CI(P, P o ), for the individual clusterings. As indicated in bold, the best
20
A. Fred and A. Louren¸co
Table 3. Spectral clustering of contour images: sample clustering results from CE1 to CE4, indicating the corresponding consistency index CI(P, P o ) Proximity measure σ
k CI
NSEDL NSEDL SOLO SOLO RDGC RDGC ECP ECP
24 24 24 24 24 24 24 24
0.46 0.14 0.32 0.38 0.48 0.14 0.08 0.50
80.1 45.7 63.2 54.1 47.6 42.0 49.1 24.0
of the individual clustering results in this ensemble gives CI = 90.7, being achieved with the average link method when using the normalized string edit distance. Table 4. Results of the application of EAC technique to the several clustering ensembles, using different algorithms for extraction of the combined partition, P ∗ , and either setting the final number of clusters to k = 24 or using the lifetime criterion. Columns CI correspond to the accuracy index CI(P ∗ , P o ) SL CL Ensemble fix LT fix LT fix CI CI k CI CI k CI
AL LT CI k
fix CI
CE1 CE2 CE3 CE4 CE5 CE6
70.3 63.6 45.0 14.4 78.7 21.1
82.6 63.6 43.7 45.9 84.1 91.7
77.8 50.5 37.2 13.2 66.4 61.7
62.9 9.3 30.4 7.9 14.5 14.5
12 2 11 1 2 2
69.4 65.3 43.2 36.3 79.8 73.3
69.7 65.5 42.4 7.9 69.1 21.1
23 23 25 1 20 9
76.2 63.9 43.5 47.0 86.0 73.3
20 22 22 447 18 9
WL LT CI k 84.4 62.0 12.1 30.9 14.5 80.4
18 20 2 8 2 16
Table 4 presents the results of the application of the EAC method to the several clustering ensembles, using different hierarchical clustering algorithms to extract the combined data partition, as indicated in columns SL, CL, AL, and WL. The number of clusters in the combined partitions, P ∗ , were either set to the fixed value 24 (columns with title “fix”), or were automatically determined using the lifetime criterion (columns “LT”), in which case the number of clusters found is indicated in columns “k”. Columns “CI” correspond to the accuracy index CI(P ∗ , P o ). As shown, extraction of the combined partition using the Average link or Wards link methods lead in general to better results. The best result with spectral-based clustering ensembles and a single string proximity measure is CI = 84.4, a value higher than the corresponding best individual clustering results (CI = 80.1). An even better performance,
Ensemble Methods: from Single Clustering to a Combined Solution
21
CI = 86.0, is obtained with CE5, where a greater ensemble diversity is obtained by joining clusterings using the distinct string proximity measures. The overall best combination result was obtained with CE6, corresponding to CI = 91.7; this value is also higher than the best individual clustering in CE6 (90.7). It is also worth noting that, although combination results have high variability, strongly depending on the clustering ensemble, on the combination method and on the final partition extraction algorithm, they produce on average much better results than individual clusterings, the combination outperforming best individual results. Overall, these results also corroborate previous findings of strong positive correlation between accuracy of combination results and diversity of the clustering ensemble. Figure 8 shows how EAC induces a better discriminating similarity between patterns, by plotting original and induced similarities. Comparison of Figs. 8(a) and 8(b) shows the strengthening of intra-cluster similarities and fading of inter-cluster similarity by applying the EAC technique based on the same underlying string distance between patterns. The bottom plot corresponds to the induced similarity from CE5, thus combining different string proximity measures. The refined induced similarities potentiate the ability of ensemble methods to obtain combination results better than individual clusterings, from which they are derived. From the results in Table 4, we conclude that defining a priori the final number of clusters (k = 24) leads in general to better results than using the lifetime criterion. This is justified by the fact that clusters are not well separated, the lifetime criterion tending to select a smaller number of clusters to organize the data (see columns “k” in Table 4). This is illustrated in Fig. 9, showing a 2D representation of the co-association matrix for CE5, by means of a multidimensional scaling (MDS) technique. Notice that some clusters are mixed with others. In the previous analysis, a single combination result is obtained by each combination procedure and for each clustering ensemble. In order to assess the robustness of the technique, we produced more observations of the combination process by applying bootstrapping over the clustering ensembles. For each of the previous clustering ensembles, CE1 to CE5, 100 bootstrap ensembles were produced, the EAC technique having been applied to each bootstrap clustering ensemble, and statistics of performance of combination results being computed. These are summarized in Tables 5 and 6, corresponding to the performance indices N M I(Pb∗ , P o ) and CI(Pb∗ , P o ), respectively, where Pb∗ represents a combination result from a bootstrap clustering ensemble. In these tables, N M I (CI) and σN M I (σCI) represent mean values and standard deviations of the performance indices, computed from the 100 bootstrap estimates. From these tables, the best average results are obtained with CE5 and using Wards link algorithm for extracting the final clusters, corresponding to an 86.4% accuracy. The lifetime criterion leads to great variance of combination results, being outperformed by fixed-k combination solutions. From
22
A. Fred and A. Louren¸co NSEDL
0.8 100 0.7
0.6
200
0.5 300 0.4 400 0.3
0.2
500
0.1 600 100
200
300
400
500
600
0
(a) Normalized string edit distance matrix 1
0.9 100 0.8
0.7
200
0.6 300 0.5
0.4 400 0.3 500
0.2
0.1 600 100
200
300
400
500
600
0
(b) EAC-CE1 1
0.9 100 0.8
0.7
200
0.6 300 0.5
0.4 400 0.3 500
0.2
0.1 600 100
200
300
400
500
600
0
(c) EAC-CE5 Fig. 8. Clustering of contour images – similarities before and after the EAC technique. (a) Original normalized string edit distances between patterns. (b) Induced similarity (co-association matrix) by applying the EAC method to CE1, based also on the normalized string edit distance. (c) induced similarity by EAC, combining partitions based on various string proximity measures
Ensemble Methods: from Single Clustering to a Combined Solution
23
0.5 0.4 0.3 0.2 0.1 0 −0.1 1
−0.2
0.5
−0.3 −0.4
0
−0.5
−0.5
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
Fig. 9. MDS representation of the induced similarity after the application of the EAC technique to CE5
the tables, it is also apparent that better combination results are achieved using the AL and the WL methods, the weakest method being the SL. Overall, both indices, N M I and CI, lead to similar conclusions. Table 5. Spectral Clustering - combined results for the bootstrap versions of the ensembles in terms of N M I(Pb∗ , P o ). The first half of the table corresponds to fixing the final number of clusters to k = 24, and the second half corresponds to using the lifetime criteria Ensemble
SL CL AL WL N M I σN M I N M I σN M I N M I σN M I N M I σN M I
CE1 CE2 CE3 CE4 CE5 CE1 CE2 CE3 CE4 CE5
0.574 0.488 0.569 0.138 0.309 0.314 0.040 0.671 0.002 0.592
0.008 0.026 0.009 0.051 0.073 0.124 0.145 0.014 0.005 0.098
0.851 0.845 0.852 0.779 0.852 0.771 0.686 0.848 0.093 0.813
0.004 0.010 0.004 0.030 0.005 0.211 0.334 0.007 0.083 0.094
0.897 0.895 0.899 0.833 0.899 0.264 0.897 0.897 0.376 0.899
0.005 0.005 0.003 0.033 0.003 0.026 0.004 0.008 0.197 0.009
0.942 0.937 0.943 0.932 0.945 0.866 0.934 0.936 0.856 0.941
0.003 0.003 0.003 0.006 0.003 0.185 0.010 0.013 0.037 0.009
24
A. Fred and A. Louren¸co
Table 6. Spectral Clustering - combined results for the bootstrap versions of the ensembles in terms of CI(Pb∗ , P o ). The first half of the table corresponds to fixing the final number of clusters to k = 24, and the second half corresponds to using the lifetime criteria Ensemble
SL CL AL WL CI σCI CI σCI CI σCI CI σCI
CE1 CE2 CE3 CE4 CE5 CE1 CE2 CE3 CE4 CE5
46.5 38.2 47.7 15.1 25.1 25.6 13.0 19.1 10.8 38.5
1.0 3.0 1.1 2.2 5.0 9.8 8.3 2.7 0.2 7.1
74.9 74.4 76.2 60.2 76.4 68.6 61.8 76.1 10.4 67.2
0.6 1.6 0.6 5.3 0.6 18.4 26.3 1.2 2.9 16.2
82.2 83.2 84.6 71.6 84.7 25.6 83.2 84.0 32.8 84.4
1.7 1.4 0.6 4.8 0.6 0.7 1.9 1.8 9.6 1.8
85.4 84.1 85.9 82.2 86.4 75.8 83.1 83.9 65.3 85.2
1.1 1.1 1.1 1.8 0.9 19.7 2.8 3.4 6.0 2.8
5 Clusterings Validation Different clustering algorithms lead in general to different partitioning of the data set. The problem of evaluation/comparison of clustering results, as well as deciding the number of clusters better fitting the data, is fundamental in clustering analysis, and has been the subject of many research efforts [22, 23, 25, 29]. In the context of clustering combination approaches, the problem of clustering validation is still central. The selection/weighting of the best partitions or clusters of the clustering ensemble determines the performance of the clustering combination algorithms. From the several clustering ensemble techniques, and possible variations, we are also led to diverse combination results to choose from. Different approaches can be followed. Stability analysis, measuring the reproducibility of clustering solutions, either perturbing the data set or the clustering ensemble, offers an interesting solution [15, 29, 39]. We will focus on the stability analysis proposed in [15] where the robustness of the EAC algorithm was assessed by variance analysis of the normalized mutual information between the combined solution and the clustering ensemble, based on bootstrapping of the later. The basic analysis procedure is detailed next. 5.1 Bootstrap Variance Analysis The clustering ensemble P is perturbed using bootstrapping, producing B bootstrap versions of the clustering ensembles: PB = {Pb1 , . . . , Pbi , . . . , PbB },
Ensemble Methods: from Single Clustering to a Combined Solution
25
where Pbi is a clustering ensemble that when combined will generate the combined data partition denoted by Pb∗i . Using the normalized mutual information defined in Sect. 4.2, (2), we define the average normalized mutual information between the combined partition Pb∗i and the bootstrap clustering ensemble Pbi as: N M I(Pb∗i , Pbi )
N 1 = N M I(Pb∗i , Pbi ) . N i=1
(3)
Averaging over all bootstrap combination results, we get N M I(Pb∗ , Pb ) =
B 1 N M I(Pb∗i , Pbi ) B i=1
(4)
and the corresponding standard deviation
B
1 ∗ b std{N M I(Pb , P )} = (N M I(Pb∗i , Pbi ) − N M I(Pb∗ , Pb ))2 (5) B − 1 i=1 These last two measures enable the verification of the consistency of the different combined partitions Pb∗i with each perturbed bootstrap version of the clustering ensemble Pbi . The std{N M I(Pb∗ , Pb )} measures the variability of consistency results between combined partitions and clustering ensembles as a result of ensemble perturbation, based on mutual information. It is also a measure of the stability of the combination process. Experimental results have shown that minimum std{N M I(Pb∗ , Pb )} values are in general associated with better combination results [15], thus providing a criterion for the decision amongst different combined partitions. For the measures presented in (4) and (5), similar definitions are used for CI(Pb∗ , Pb ) and std{CI(Pb∗ , Pb )}, where CI represents the consistency index presented previously [11] that finds the best match between partitions, counting the percentage of agreement between the labelings. 5.2 Validation Results We applied the previous analysis methodology to the EAC results over the contour images data set. Table 7 summarizes these results for combinations forcing k = 24 clusters. From this table, we can see a clear correlation between low variance of consistency measures between combined partitions and the clustering ensembles, and best performing combination results in Table 6. Taking the normalized mutual information as reference, std{N M I(Pb∗ , Pb )} values of 0.003 and 0.004 lead to combination results with performance indices CI ≥ 80, thus being a plausible tool for deciding amongst different combination results. Minimum
26
A. Fred and A. Louren¸co
Table 7. Bootstrap variance analysis of combination results based on different cluster ensembles. The first column indicates the clustering ensemble, and the second column the clustering method used for extracting the final partition N M I(Pb∗ , Pb ) std{N M I(Pb∗ , Pb )} CI(Pb∗ , Pb ) std{CI(Pb∗ , Pb )}
Ensemble CE1 CE1 CE1 CE1 CE2 CE2 CE2 CE2 CE3 CE3 CE3 CE3 CE4 CE4 CE4 CE4 CE5 CE5 CE5 CE5
SL CL AL WL SL CL AL WL SL CL AL WL SL CL AL WL SL CL AL WL
0.932 0.937 0.943 0.942 0.779 0.845 0.852 0.851 0.833 0.895 0.899 0.897 0.138 0.488 0.569 0.574 0.650 0.693 0.701 0.704
0.006 0.003 0.003 0.003 0.030 0.010 0.004 0.004 0.033 0.005 0.003 0.005 0.051 0.026 0.009 0.008 0.027 0.009 0.003 0.003
82.200 84.100 85.900 85.400 60.200 74.400 76.200 74.900 71.600 83.200 84.600 82.200 15.100 38.200 47.700 46.500 46.600 54.200 55.900 54.300
1.800 1.100 1.100 1.100 5.300 1.600 0.600 0.600 4.800 1.400 0.600 1.700 2.200 3.000 1.100 1.000 4.000 1.200 0.400 0.500
variance selection based on the consistency index leads to a more refined selection of potential solutions. Although the minimum value of std{CI(Pb∗ , Pb )} (0.4) does not lead to the best performing solution, CI = 86.4 (assuming that the ground truth labeling constitutes the best possible partitioning of the data), pointing to a CI = 84.7 solution, enlarging the selection criteria to the two lowest variance values, std{CI(Pb∗ , Pb )} = 0.4 and 0.5, leads to the selection of two combination solutions, one of which being the best in the combinations set according to the corresponding consistency index CI(Pb∗ , P o ) read from Table 6.
6 Conclusion Cluster ensemble methods are a recent but very active research topic of utmost importance in the area of unsupervised learning. It aims at obtaining better and more robust clustering solutions by combining information from multiple data partitions, exploring multiple criteria underlying different clustering algorithms, and multiple views of the data. Cluster ensemble methods can be decomposed into a cluster generation mechanism and a partition integration process, both influencing the quality
Ensemble Methods: from Single Clustering to a Combined Solution
27
of the combination results. In this chapter, we addressed these different phases, from the generation of the partitions, the clustering ensemble, to the combination and validation of the combined result. We presented an overview of these topics together with illustrative application examples, with focus on our own methodological and algorithmic contributions. Different strategies for building clustering ensembles were presented. The importance of ensemble diversity was emphasized, and further put in evidence through experimental evaluation in a real world application of clustering of contour images. The Evidence Accumulation Clustering paradigm for combining data partitions was detailed and analyzed. Experimental results were discussed, trying to emphasize the underlying basic mechanisms, in particular how this method “learns” a new similarity between patterns, by accounting on induced similarities by individual clusterings. Extensions of this paradigm using weighted data partitions, and, more recently, learning similarity selectively on the feature space were referred, with links to the literature on state-of-the-art methods in this and other cluster ensemble related topics. Finally, the problem of cluster validity was addressed, focusing on information theoretical approaches and stability analysis. A series of open questions regarding algorithmic selection and cluster validity issues was posed, to which solutions were proposed under the cluster ensemble framework. Still, much research work is required in order to properly address these challenging problems. In particular, theoretical studies are important to better characterize and understand these new approaches, possibly giving rise to more effective and efficient methods.
References 1. Al-Razgan M, Domeniconi C (2006) Weighted clustering ensembles. In: Ghosh J, Lambert D, Skillicorn D, Srivastava J (eds) Proc the 6th SIAM Int Conf Data Mining, Bethesda, Maryland. SIAM, Philadelphia, pp 258–269 2. Ayad H, Kamel MS (2003) Finding natural clusters using multi-clusterer combiner based on shared nearest neighbors. In: Windeatt T, Roli F (eds) Proc the 4th Int Workshop Multiple Classifier Syst, Guildford, UK. Springer, Berlin/Heidelberg, pp 166–175 3. Ayad H, Kamel M (2007) Cumulative voting consensus method for partitions with variable number of clusters. IEEE Trans Pattern Analysis Mach Intell 30:160–173 4. Caruana R, Elhawary M, Nguyen N, Smith C (2006) Meta clustering. In: Proc the 6th IEEE Int Conf Data Mining, Hong Kong, China. IEEE Computer Society, Los Alamitos, pp 107–118 5. de Souto MCP, Silva SCM, Bittencourt VG, de Araujo DSA (2005) Cluster ensemble for gene expression microarray data. In: Proc IEEE Int Joint Conf Neural Networks, Montr´eal, QB, Canada. IEEE Computer Society, pp 487–492 6. Duarte FJ, Fred AL, Louren¸co A, Rodrigues MF (2005) Weighted evidence accumulation clustering. In: Simoff SJ, Williams GJ, Galloway J, Kolyshkina I
28
7.
8.
9.
10.
11.
12.
13.
14.
15.
16. 17. 18.
19. 20.
21. 22. 23.
A. Fred and A. Louren¸co (eds) Proc the 4th Australasian Conf Knowl Discovery Data Mining, Sydney, NSW, Australia. University of Technology, Sydney, pp 205–220 Fern XZ, Brodley CE (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. In: Fawcett T, Mishra N (eds) Proc the 20th Int Conf Mach Learn, Washington, DC, USA. AAAI Press, Menlo Park, pp 186–193 Fern XZ, Brodley CE (2004) Solving cluster ensemble problems by bipartite graph partitioning. In: Brodley CE (ed) Proc the 21st Int Conf Mach Learn, Banff, AL, Canada. ACM, New York, pp 281–288 Filkov V, Skiena S (2003) Integrating microarray data by consensus clustering. In: Proc the 15th IEEE Int Conf Tools with Artif Intell, Sacramento, CA, USA. IEEE Computer Society, Los Alamitos, p 418–426 Fred AL, Marques JS, Jorge PM (1997) Hidden Markov models vs syntactic modeling in object recognition. In: Proc the Int Conf Image Proc, Santa Barbara, CA, USA. IEEE Computer Society, Los Alamitos, pp 893–896 Fred A (2001) Finding consistent clusters in data partitions. In: Kittler J, Roli F (eds) Proc the 2nd Int Workshop Multiple Classifier Syst, Cambridge, UK. Springer, Berlin/Heidelberg, pp 309–318 Fred A, Jain AK (2002) Data clustering using evidence accumulation. In: Proc the 16th Int Conf Pattern Recognition, Quebec, QB, Canada. IEEE Computer Society, Washington, pp 276–280 Fred A, Jain AK (2002) Evidence accumulation clustering based on the k-means algorithm. In: Caelli T, Amin A, Duin RPW, Kamel MS, de Ridder D (eds) Proc Joint IAPR Int Workshop Structural, Syntactic, and Statistical Pattern Recognition, Windsor, Canada. Springer, London, pp 442–451 Fred A (2002) Similarity measures and clustering of string patterns. In: Chen D, Cheng X (eds) Pattern recognition and string matching. Springer-Verlag, New York, pp 155–194 Fred A, Jain AK (2003) Robust data clustering. In: Proc IEEE Computer Society Conf Comp Vision and Pattern Recognition, Madison, WI, USA. IEEE Computer Society, Los Alamitos, pp 128–133 Fred ALN, Leit˜ ao JMN (2003) A new cluster isolation criterion based on dissimilarity increments. IEEE Trans Pattern Analysis Machine Intell 25:944–958 Fred A, Jain AK (2005) Combining multiple clustering using evidence accumulation. IEEE Trans Pattern Analysis Mach Intell 27:835–850 Fred AL, Jain AK (2006) Learning pairwise similarity for data clustering. In: Proc the 18th Int Conf Pattern Recognition, Hong Kong, China. IEEE Computer Society, Washington, pp 925–928 Fu KS (1986) Syntactic pattern recognition. In: Handbook of pattern recognition and image processing. Academic Press, Orlando, pp 85–117 Greene D, Tsymbal A, Bolshakova N, Cunningham P (2004) Ensemble clustering in medical diagnostics. In: Long R, Antani S, Lee DJ, Nutter B, Zhang M (eds) Proc the 17th IEEE Symp Comp-Based Medical Syst, Bethesda, MD, USA. IEEE Computer Society, Los Alamitos, pp 576–581 Greene D, Cunningham P (2006) Efficient ensemble methods for document clustering. Tech Rep CS-2006-48, Trinity College Dublin Halkidi M, Batistakis Y, Vazirgiannis M (2002) Cluster validity methods: part I. SIGMOD Record 31:40–45 Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, Upper Saddle River
Ensemble Methods: from Single Clustering to a Combined Solution
29
24. Jain AK (1989) Fundamentals of digital image processing. Prentice-Hall, Upper Saddle River 25. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Computing Surveys 31:264–323 26. Karypis G (2002) Multilevel hypergraph partitioning. Tech Rep 02-25, University of Minnesota 27. Kuncheva L, Hadjitodorov ST (2004) Using diversity in cluster ensembles. In: Proc the IEEE Int Conf Syst, Man and Cybernetics, The Hague, The Netherlands. IEEE Computer Society, Los Alamitos, pp 1214–1219 28. Law M, Topchy A, Jain AK (2004) Multiobjective data clustering. In: Proc the 2004 IEEE Computer Society Conf Comp Vision and Pattern Recognition, Washington, DC, USA. IEEE Computer Society, Los Alamitos, pp 424–430 29. Levine E, Domany E (2000) Resampling method for unsupervised estimation of cluster validity. Neural Computation 13:2573–2593 30. Louren¸co A, Fred A (2004) Comparison of combination methods using spectral clustering. In: Fred A (ed) Proc the 4th Int Workshop Pattern Recognition in Inf Syst, Porto, Portugal. INSTICC Press, Set´ ubal, pp 222–234 31. Louren¸co A, Fred A (2007) String patterns: from single clustering to ensemble methods and validation. In: Fred A, Jain AK (eds) Proc the 7th Int Workshop Pattern Recognition in Inf Syst, Funchal, Madeira, Portugal. INSTICC Press, Set´ ubal, pp 39–48 32. Lu SY, Fu KS (1977) Stochastic error-correcting syntax analysis for the recognition of noisy patterns. IEEE Trans Computers 26:1268–1276 33. Lu SY, Fu KS (1978) A sentence-to-sentence clustering procedure for pattern analysis. IEEE Trans Syst, Man and Cybernetics 8:381–389 34. Marzal A, Vidal E (1993) Computation of normalized edit distance and applications. IEEE Trans Pattern Analysis Mach Intell 2:926–932 35. Minaei-Bidgoli B, Topchy A, Punch W (2004) Ensembles of partitions via data resampling. In: Proc Inf Tech: Coding and Computing, Las Vegas, NV, USA. IEEE Computer Society, pp 188–192 36. Monti S, Tamayo P, Mesirov JP, Golub TR (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52:91–118 37. Ng AY, Jordan MI, Weiss Y (2002) On spectral clustering: analysis and an algorithm. In: Dietterich TG, Becker S, Ghahramani Z (eds) Advances in Neural Inf Proc Syst 14. MIT Press, Cambridge, pp 849–856 38. Pekalska E, Duin RPW (2005) The dissimilarity representation for pattern recognition: foundations and applications. World Scientific, Singapore 39. Roth V, Lange T, Braun M, Buhmann J (2002) A resampling approach to cluster validation. In: Hrdle W (ed) Proc th 15th Symp in Computational Statistics, Berlin, Germany. Physica-Verlag, Heidelberg, pp 123–128 40. Singh V, Mukherjee L, Peng J, Xu J (2008) Ensemble clustering using semidefinite programming. In: Platt JC, Koller D, Singer Y, Roweis S (eds) Advances in Neural Inf Proc Syst 20, MIT Press, Cambridge 41. Strehl A, Ghosh J (2002) Cluster ensembles – a knowledge reuse framework for combining multiple partitions. J Mach Learn Research 3:583–617 42. Strehl A, Ghosh J (2002) Consensus clustering – a knowledge reuse framework to combine clusterings. In: Proc Conf Artif Intell, Edmonton, AL, Canada. AAAI/MIT Press, pp 93–98
30
A. Fred and A. Louren¸co
43. Strehl A, Ghosh J (2003) Relationship-based clustering and visualization for high-dimensional data mining. INFORMS J Computing 15:208–230 44. Topchy A, Jain AK, Punch W (2004) A mixture model of clustering ensembles. In: Proc the 4th SIAM Int Conf Data Mining, Lake Buena Vista, FL, USA. SIAM, Philadelphia 45. Villanueva WJP, Bezerra GBP, Lima CADM, Von Zuben FJ (2005) Improving support vector clustering with ensembles. In: Proc Workshop Achieving Functional Integration of Diverse Neural Models, Montr´eal, QB, Canada, pp 13–15
Random Subspace Ensembles for Clustering Categorical Data Muna Al-Razgan, Carlotta Domeniconi, and Daniel Barbar´ a Department of Computer Science, George Mason University Fairfax, Virginia 22030, USA,
[email protected],
[email protected],
[email protected] Summary. Cluster ensembles provide a solution to challenges inherent to clustering arising from its ill-posed nature. In fact, cluster ensembles can find robust and stable solutions by leveraging the consensus across multiple clustering results, while averaging out spurious structures that arise due to the various biases to which each participating algorithm is tuned. In this chapter we focus on the design of ensembles for categorical data. Our techniques build upon diverse input clusterings discovered in random subspaces, and reduce the problem of defining a consensus function to a graph partitioning problem. We experimentally demonstrate the efficacy of our approach in combination with the categorical clustering algorithm COOLCAT. Key words: clustering, clustering ensembles, categorical data, high dimensionality
1 Introduction Recently, cluster ensembles have emerged as a technique for overcoming problems with clustering algorithms. It is well known that clustering methods may discover different patterns in a given set of data. This is because each clustering algorithm has its own bias resulting from the optimization of different criteria. Furthermore, there is no ground truth against which the clustering result can be validated. Thus, no cross-validation technique can be carried out to tune input parameters involved in the clustering process. As a consequence, the user is equipped with no guidelines for choosing the proper clustering method for a given dataset. An orthogonal issue related to clustering is high dimensionality. High dimensional data pose a difficult challenge to the clustering process. Various clustering algorithms can handle data with low dimensionality, but as the dimensionality of the data increases, these algorithms tend to break down. A cluster ensemble consists of different partitions. Such partitions can be obtained from multiple applications of any single algorithm with different initializations, or from the application of different algorithms to the same dataset. Cluster ensembles offer a solution to challenges inherent to clustering M. Al-Razgan et al.: Random Subspace Ensembles for Clustering Categorical Data, Studies in Computational Intelligence (SCI) 126, 31–48 (2008) c Springer-Verlag Berlin Heidelberg 2008 www.springerlink.com
32
M. Al-Razgan et al.
arising from its ill-posed nature: they can provide more robust and stable solutions by leveraging the consensus across multiple clustering results, while averaging out emergent spurious structures that arise due to the various biases to which each participating algorithm is tuned. To enhance the quality of clustering results, clustering ensembles have been explored. Nevertheless, clustering ensembles for categorical data have not received much attention in the literature. In this chapter we focus on the design of ensembles for categorical data. Our techniques can be used in combination with any categorical clustering approach. Here we apply our method to the partitions provided by the COOLCAT algorithm [4].
2 Related Work We briefly describe relevant work in the literature on categorical clustering and cluster ensembles. Clustering of categorical data has recently attracted the attention of many researchers. The k-modes algorithm [14] is an extension of k-means for categorical features. To update the modes during the clustering process, the authors used a new distance measure based on the number of mis-matches between two points. Squeezer [23] is a categorical clustering algorithm that processes one point at the time. At each step, a point is either placed in an existing cluster or it is rejected by all clusters and it creates a new one. The decision is based on a given similarity function. ROCK (Robust Clustering using links) [9] is a hierarchical clustering algorithm for categorical data. It uses the Jaccard coefficient to compute the distance between points. Two points are considered neighbors if their Jaccard similarity exceeds a certain threshold. A link between two points is computed by considering the number of common neighbors. An agglomerative approach is then used to construct the hierarchy. To enhance the quality of clustering results, clustering ensemble approaches have been explored. A cluster ensemble technique is characterized by two components: the mechanism to generate diverse partitions, and the consensus function to combine the input partitions into a final clustering. One popular methodology utilizes a co-association matrix as a consensus function [7, 18]. Graph-based partitioning algorithms have been used with success to generate the combined clustering [3, 13, 22]. Clustering ensembles for categorical data have not received much attention in the literature. The approach in [10] generates a partition for each categorical attribute, so that points in each cluster share the same value for that attribute. The resulting clusterings are combined using the consensus functions presented in [22]. The work in [11] constructs cluster ensembles for data with mixed numerical and categorical features.
Random Subspace Ensembles for Clustering Categorical Data
33
3 The COOLCAT Algorithm In this section, we briefly present the clustering algorithm COOLCAT [4]. This algorithm has been proven effective for clustering categorical data. Thus, we use the clusterings provided by COOLCAT as components of our ensembles to further improve the grouping of data. The COOLCAT algorithm [4] is a scalable clustering algorithm that discovers clusters with minimal entropy in categorical data. COOLCAT uses categorical, rather than numerical attributes, enabling the mining of real-world datasets offered by fields such as psychology and statistics. The algorithm is based on the idea that a cluster containing similar points has an entropy smaller than a cluster of dissimilar points. Thus, COOLCAT uses entropy to define the criterion for grouping similar objects. Formally, the entropy measures the uncertainty associated to a random variable. Let X be a random variable with values in S(X), and let p(x) be the corresponding probability function of X. The entropy of X is defined as follows: p(x) log(p(x)). H(X) = − x∈S(X)
The entropy of a multivariate vector X = (X1 , X2 , . . . , Xn ) is defined as: ··· p(x1 , ..., xn ) log p(x1 , ..., xn ). H(X) = − x1 ∈S(X1 )
xn ∈S(Xn )
To minimize the entropy associated to clusters, COOLCAT proceeds as follows. Given n points x1 , x2 , ..., xn , where each point is represented as a vector of d categorical values, xi = (x1i , ..., xdi ), COOLCAT partitions the points into k clusters so that the entropy of the clustering is minimized. Let Cˆ = {C1 , . . . , Ck } represent the clustering. Then, the entropy associated to Cˆ is: k |Cj | ˆ = H(C) H(Cj ) n j=1 where H(Cj ) is the entropy of cluster Cj : ··· P (x1i , . . . , xdi |Cj ) log(P (x1i , . . . , xdi |Cj )). H(Cj ) = x1i ∈S(X 1 )
d xd i ∈S(X )
COOLCAT uses a heuristic to incrementally build clusters based on the entropy criterion. It consists of two main phases: an initialization step and an incremental step. During the initialization phase, COOLCAT bootstraps the algorithm by selecting a sample of points. Out of this sample, it selects the two points that have the maximum pairwise entropy, and so are most dissimilar. COOLCAT places these points in two different clusters. It then proceeds incrementally: at
34
M. Al-Razgan et al.
each step, it selects the point that maximizes the minimum pairwise entropy with the previously chosen points. At the end, the k selected points are the initial seeds of the clusters. During the incremental phase, COOLCAT constructs the k clusters. It processes the data in batches. For each data point, it computes the entropy resulting from placing the point in each cluster, and then assigns the point to the cluster that gives the minimum entropy. The final clustering depends on the order in which points are assigned to clusters, thus there is a danger of obtaining a poor-quality result. To circumvent this problem, the authors of COOLCAT propose a reprocessing step. After clustering a batch of points, a fraction of the set is reconsidered for clustering, where the size of the fraction is an input parameter. The fraction of points that least fit the corresponding clusters is reassigned to more fitting clusters. To assess which points least match their clusters, COOLCAT counts the number of occurrences of each point’s attributes in the cluster to which is assigned. This number of occurrences is then converted into a probability value by dividing it by the cluster size. The point with the lowest probability for each cluster is then removed and reprocessed according to the entropy criterion, as before. By performing this assessment at the conclusion of each incremental step, COOLCAT alleviates the risk imposed by the order of the input of points. COOLCAT requires four input parameters: the number of clusters k, the sample size used in the initialization step, the buffer size, and the number of points considered for reprocessing. While COOLCAT is an effective method for clustering categorical data, it still suffers from limitations inherent to clustering algorithms. The solution identified by COOLCAT depends on the initial clusters’ seeding. The greedy sequential strategy used by the algorithm affects the result as well (although the reprocessing step alleviates in part this problem). Furthermore, the sparsity of the data in high dimensional spaces can severely compromise the ability of discovering meaningful clustering solutions. In the following we address these issues by constructing ensembles of clusterings resulting from multiple runs of COOLCAT in random feature subspaces.
4 Clustering Ensemble Techniques Consider a set S = {x1 , x2 , . . . , xn } of n points. A clustering ensemble is a collection of m clustering solutions: C = {C1 , C2 , ..., Cm }. Each clustering solution CL for L = 1, . . . , m, is a partition of the set S, i.e. CL = {CL1 , CL2 , ..., CLKl }, where ∪K CLK = S. Given a collection of clustering solutions C and the desired number of clusters k, the objective is to combine the different clustering solutions and compute a new partition of S into k disjoint clusters.
Random Subspace Ensembles for Clustering Categorical Data
35
The challenge in cluster ensembles is the design of a proper consensus function that combines the component clustering solutions into an “improved” final clustering. In our ensemble techniques we reduce the problem of defining a consensus function to a graph partitioning problem. This approach has shown good results in the literature [5, 6, 22]. In the next sections, we introduce our two consensus functions for categorical features using the clustering results produced by COOLCAT. 4.1 Motivation To enhance the accuracy of the clustering given by COOLCAT, we construct an ensemble of clusterings obtained by multiple runs of the algorithm. A good accuracy-diversity trade-off must be achieved to obtain a consensus solution that is superior to the components. To improve the diversity among the ensemble components, each run of COOLCAT operates within a random subspace of the feature space, obtained by random sampling a fixed number of attributes from the set of given ones. Thus, diversity is guaranteed by providing the components different views (or projections) of the data. Since such views are generated randomly from a (typically) large pool of attributes, it is highly likely that each component receives a different prospective of the data, which leads to the discovery of diverse (and complementary) structures within the data. The rationale behind our approach finds its justification in classifier ensembles, and in the theory of Stochastic Discrimination [16, 17]. The advantages of a random subspace method, in fact, are well known in the context of ensembles of classifiers [12, 21]. Furthermore, we observe that performing clustering in random subspaces should be advantageous when data present redundant features, and/or the discrimination power is spread over many features, which is often the case in real life scenarios. Under these conditions, in fact, redundant/noisy features are less likely to appear in random subspaces. Moreover, since discriminant information is distributed across several features, we can generate multiple meaningful (for cluster discrimination) subspaces. This is in agreement with the assumptions made by the theory of stochastic discrimination [17] for building effective ensembles of classifiers; that is, there exist multiple sets of features able to discern between training data in different classes, and unable to discern training and testing data in the same class. 4.2 Categorical Similarity Partitioning Algorithm (CSPA) Our aim here is to generate robust and stable solutions via a consensus clustering method. We can generate contributing clusterings by running multiple times the COOLCAT algorithm within random subspaces. Thus, each ensemble component has access to a random sample of f features drawn from the original d dimensional feature space. The objective is then to find a consensus
36
M. Al-Razgan et al.
partition from the output partitions of the contributing clusterings, so that an “improved” overall clustering of the data is obtained. In order to derive our consensus function, for each data point xi and each cluster Cl , we want to define the probability associated with cluster Cl given that we have observed xi . Such probability value must conform to the information provided by a given component clustering of the ensemble. The consensus function will then aggregate the findings of each clustering component utilizing such probabilities. COOLCAT partitions the data into k distinct clusters. In order to compute distances between data points and clusters, we represent clusters using modes. The mode of a cluster is the vector of the most frequent attribute values in the given cluster. In particular, when different values for an attribute have the same frequency of occurrence, we consider the whole data set, and choose the attribute that has the least overall frequency. Ties are broken randomly. We then compute the distance between a point xi and a cluster Cl by considering the Jaccard distance [19] between xi and the mode cl of cluster Cl , defined as follows: |xi ∩ cl | dil = 1 − (1) |xi ∪ cl | where |xi ∩ cl | represents the number of matching attribute values in the two vectors, and |xi ∪ cl | is the number of distinct attribute values in the two vectors. Let Di = maxl {dil } be the largest distance of xi from any cluster. We want to define the probability associated with cluster Cl given that we have observed xi . At a given point xi , the cluster label Cl is assumed to be a random variable from a distribution with probabilities {P (Cl |xi )}kl=1 . We provide a nonparametric estimation of such probabilities based on the data and on the clustering result. In order to embed the clustering result in our probability estimations, the smaller the distance dil is, the larger the corresponding probability credited to Cl should be. Thus, we can define P (Cl |xi ) as follows: P (Cl |xi ) =
Di − dil + 1 kDi + k − l dil
(2)
where the denominator serves as a normalization factor to guarantee kl=1 P (Cl |xi ) = 1. We observe that ∀l = 1, . . . , k and ∀i = 1, . . . , n P (Cl |xi ) > 0. In particular, the added value of 1 in (2) allows for a non-zero probability P (CL |xi ) when L = arg maxl {dil }. In this last case P (Cl |xi ) assumes its minimum value P (CL |xi ) = 1/(kDi + k − l dil ). For smaller distance values dil , P (Cl |xi ) increases proportionally to the difference Di −dil : the larger the deviation of dil from Di , the larger the increase. As a consequence, the corresponding cluster Cl becomes more likely, as it is reasonable to expect based on the information provided by the clustering process. Thus, (2) provides a nonparametric estimation of the posterior probability associated to each cluster Cl .
Random Subspace Ensembles for Clustering Categorical Data
37
We can now construct the vector Pi of posterior probabilities associated with xi : Pi = (P (C1 |xi ), P (C2 |xi ), . . . , P (Ck |xi ))t (3) where t denotes the transpose of a vector. The transformation xi → Pi maps the d dimensional data points xi onto a new space of relative coordinates with respect to cluster centroids, where each dimension corresponds to one cluster. This new representation embeds information from both the original input data and the clustering result. We then define the similarity between xi and xj as the cosine similarity between the corresponding probability vectors: s(xi , xj ) =
Pit Pj . Pi Pj
(4)
We combine all pairwise similarities (4) into an (n × n) similarity matrix S, where Sij = s(xi , xj ). We observe that, in general, each clustering may provide a different number of clusters, with different sizes and boundaries. The size of the similarity matrix S is independent of the clustering approach, thus providing a way to align the different clustering results onto the same space, with no need to solve a label correspondence problem. After running the COOLCAT algorithm m times for different features we obtain the m similarity matrices S1 , S2 , . . . , Sm . The combined similarity matrix Ψ defines a consensus function that can guide the computation of a consensus partition: m 1 Ψ= Sl . (5) m l=1
Ψij reflects the average similarity between xi and xj (through Pi and Pj ) across the m contributing clusterings. We now map the problem of finding a consensus partition to a graph partitioning problem. We construct a complete graph G = (V, E), where |V | = n and the vertex Vi identifies xi . The edge Eij connecting the vertices Vi and Vj is assigned the weight value Ψij . We run METIS [15] on the resulting graph to compute a k-way partitioning of the n vertices that minimizes the edge weight-cut1 . This gives the consensus clustering we seek. The size of the resulting graph partitioning problem is n2 . The steps of the algorithm, which we call CSPA (Categorical Similarity Partitioning Algorithm), are summarized in the following. Input: n points x ∈ d , number of features f , and number of clusters k. 1. Produce m subspaces by random sampling f features without replacement from the d dimensional original feature space. 2. Run COOLCAT m times, each time using a different sample of f features. 1
In our experiments we also apply spectral clustering to compute a k-way partitioning of the n vertices.
38
M. Al-Razgan et al.
3. For each partition ν = 1, . . . , m: a) Obtain the mode cl for each cluster (l = 1, . . . , k). b) Compute the Jaccard distance dνil between each data point xi , and i ∩cl | each mode cl : dνil = 1 − |x |xi ∪cl | . c) Set Diν = maxl {dνil }. Dν −dν +1 d) Compute P (Clν |xi ) = kDν i+k−il dν . l il i e) Set Piν = (P (C1ν |xi ), P (C2ν |xi ), . . . , P (Ckν |xi ))t . f) Compute the similarity sν (xi , xj ) =
Piν Pjν , ∀i, j . Piν Pjν
ν g) Construct the matrix S ν where Sij = sν (xi , xj ). m 1 4. Build the consensus function Ψ = m ν=1 S ν . 5. Construct the complete graph G = (V, E), where |V | = n and Vi ≡ xi . Assign Ψij as the weight value of the edge Eij connecting the vertices Vi and Vj . 6. Run METIS (or spectral clustering) on the resulting graph G.
Output: The resulting k-way partition of the n vertices. 4.3 Categorical Bipartite Partitioning Algorithm (CBPA) Our second approach (CBPA) maps the problem of finding a consensus partition to a bipartite graph partitioning problem. This mapping was first introduced in [6]. In [6], however, 0/1 weight values are used. Here we extend the range of weight values to [0,1]. The graph in CBPA models both instances (e.g., data points) and clusters, and the graph edges can only connect an instance vertex to a cluster vertex, forming a bipartite graph. In detail, we proceed as follows for the construction of the graph. Suppose, again, we run the COOLCAT algorithm m times for different f random features. For each instance xi , and for each clustering ν = 1, . . . , m, we then can compute the vector of posterior probabilities Piν , as defined in (3) and (2). Using the P vectors, we construct the following matrix A: ⎛ 1 t ⎞ (P1 ) (P12 )t . . . (P1m )t ⎜(P21 )t (P22 )t . . . (P2m )t ⎟ ⎜ ⎟ A=⎜ . (6) .. .. ⎟ . ⎝ .. . . ⎠ (Pn1 )t (Pn2 )t . . . (Pnm )t Note that the (Piν )t s are row vectors (t denotes the transpose). The dimensionality of A is therefore n × km, under the assumption that each of the m clusterings produces k clusters. (We observe that the definition of A can be
Random Subspace Ensembles for Clustering Categorical Data
39
easily generalized to the case where each clustering may discover a different number of clusters.) Based on A we can now define a bipartite graph to which our consensus partition problem maps. Consider the graph G = (V, E) with V and E constructed as follows. V = V C ∪ V I , where V C contains km vertices, each representing a cluster of the ensemble, and V I contains n vertices, each representing an input data point. Thus |V | = km + n. The edge Eij connecting the vertices Vi and Vj is assigned a weight value defined as follows. If the vertices Vi and Vj represent both clusters or both instances, then E(i, j) = 0; otherwise, if vertex Vi represents an instance xi and vertex Vj represents a cluster Cjν (or vice versa) then the corresponding entry of E is A(i, k(ν − 1) + j). Note that the dimensionality of E is (km + n) × (km + n), and E can be written as follows: 0 At . E= A 0 A partition of the bipartite graph G partitions the cluster vertices and the instance vertices simultaneously. The partition of the instances can then be output as the final clustering. Due to the special structure of the graph G (sparse graph), the size of the resulting bipartite graph partitioning problem is kmn. Assuming that km n, this complexity is much smaller than the size n2 of CSPA. The steps of the algorithm, which we call WBPA (Weighted Bipartite Partitioning Algorithm), are summarized in the following. Input: n points x ∈ d , number of features f , and number of clusters k. 1. Produce m subspaces by random sampling f features without replacement from the d dimensional original feature space. 2. Run COOLCAT m times, each time using a different sample of f features. 3. For each partition ν = 1, . . . , m: a) Obtain the mode cl for each cluster (l = 1, . . . , k). b) Compute the Jaccard distance dνil between each data point xi , and i ∩cl | each mode cl : dνil = 1 − |x |xi ∪cl | . ν ν c) Set Di = maxl {dil }. Dν −dν +1 d) Compute P (Clν |xi ) = kDν i+k−il dν . i l il e) Set Piν = (P (C1ν |xi ), P (C2ν |xi ), . . . , P (Ckν |xi ))t . 4. Construct the matrix A as in (6). 5. Construct the bipartite graph G = (V, E), where V = V C ∪ V I , |V I | = n and ViI ≡ xi , |V C | = km and VjC ≡ Cj (a cluster of the ensemble). Set E(i, j) = 0 if Vi and Vj are both clusters or both instances. Set E(i, j) = A(i − km, j) = E(j, i) if Vi and Vj represent an instance and a cluster. 6. Run METIS (or spectral clustering) on the resulting graph G.
40
M. Al-Razgan et al.
Output: The resulting k-way partition of the n vertices in V I .
5 Experimental Design and Results In our experiments, we used four real datasets. The characteristics of all datasets are given in Table 1. The Archeological dataset is taken from [1], and was used in [4] as well. Soybeans, Breast, and Congressional Votes are from the UCI Machine Learning Repository [20]. The Soybeans dataset consists of 47 samples and 35 attributes. Since some attributes have only one value, we have removed them, and selected the remaining 21 attributes for our experiments, as it has been done in other research [8]. For the Breast-cancer data, we sub-sampled the most populated class from 444 to 239 as we have conducted in our previous work to obtain balanced data [2]. The Congressional Votes dataset contains attributes which consist of either ’yes’ or ’no’ responses; we treat missing values as an additional domain attribute value for each feature as conducted in [4]. Table 1. Characteristics of the data Dataset
k d n (points-per-class)
Archeological Soybeans Breast-cancer Vote
2 4 2 2
8 21 9 16
20 (11-9) 47 (10-10-10-17) 478 (239-239) 435 (267-168)
Evaluating the quality of clustering is in general a difficult task. Since class labels are available for the datasets used here, we evaluate the results by computing the error rate and the normalized mutual information (NMI). The error rate is computed according to the confusion matrix. The NMI provides a measure that is impartial with respect to the number of clusters [22]. It reaches its maximum value of one only when the result completely matches the original labels. The NMI is computed according to the average mutual information between every pair of cluster and class [22]: k N M I = k
i=1
k
ni,j log ni k
j=1
i=1 ni log
n
ni,j n ni nj
j=1 nj log
nj n
(7)
where ni,j is the number of agreement between cluster i and class j, ni is the number of data in cluster i, nj is the number of data in class j, and n is the total number of points.
Random Subspace Ensembles for Clustering Categorical Data
41
5.1 Analysis of the Results For each dataset, we ran COOLCAT 10 times with different sets of random features. The number f of selected features was set to half the original dimensionality for each data set: f = 4 for Archeological, f = 11 for Soybeans, f = 5 for Breast-cancer, and f = 8 for Vote. The clustering results of COOLCAT are then given as input to the consensus clustering techniques being compared. As value of k, we input both COOLCAT and the ensemble algorithms the actual number of classes in the data. Figures 1-8 plot the error rate (%) achieved by COOLCAT in each random subspace, and the error rates of our categorical clustering ensemble methods (CSPA-Metis, CSPA-SPEC, CBPA-Metis, and CBPA-SPEC, where SPEC is short for spectral clustering). We also plot the error rate achieved by COOLCAT over multiple runs in the entire feature space. The figures show that we were able to obtain diverse clusterings within the random subspaces. Furthermore, the instable performance of COOLCAT in the original space shows its sensitivity to the initial random seeding process, and to the order according to which data are processed. We kept the input parameters of COOLCAT fixed in all runs: the sample size was set to 8, and the reprocessing size was set to 10. Detailed results for all data are provided in Tables 2-5, where we report the NMI and error rate (ER) of the ensembles, as well as the maximum, minimum, and average NMI and error rate values for the input clusterings. In general, our ensemble techniques were able to filter out spurious structures identified by individual runs of COOLCAT, and performed quite well. Our techniques produced error rates comparable with, and sometime better than, COOLCAT’s minimum error rate. CSPA-Metis provided the lowest error rate among the methods being compared on three data sets. For the Archeological and Breast-cancer data, the error rate provided by the CSPA-Metis technique is as good or better than the best individual input clustering. It is worth noticing that for these two datasets, CSPA-Metis gave an error rate which is lower than the best individual input clustering on the entire feature space (see Figs. 2 and 6). In particular, on the Breast-cancer data all ensemble techniques provided excellent results. For the Soybeans dataset, the error rate of CSPA-Metis is still well below the average of the input clusterings, and for Vote is very close to the average. Also CBPA (both with Metis and SPEC) performed quite well. In general, it produced error rates comparable with the other techniques. CBPA produced error rates well below the average error rates of the input clusterings, with the exception of the Vote dataset. For the Vote data, all ensemble methods gave error rates close to the average error rate of the input clusterings. In this case, COOLCAT on the full space gave a better performance. Overall, our categorical clustering ensemble techniques are capable of boosting the performance of COOLCAT, and achieve more robust results.
42
M. Al-Razgan et al.
Given the competitive behavior previously shown by COOLCAT, the improvement obtained by our ensemble techniques is a valuable achievement. Table 2. Results on Archeological data EnsNMI EnsER MaxER MinER AvER MaxNMI MinNMI AvNMI CSPA-METIS CSPA-SPEC CBPA-METIS CBPA-SPEC
1 0.210 0.528 0.603
0 36.0 10.0 18.0
45.0 45.0 45.0 45.0
0 0 0 0
24.0 24.0 24.0 24.0
1 1 1 1
0.033 0.033 0.033 0.033
0.398 0.398 0.398 0.398
Table 3. Results on Soybeans data EnsNMI EnsER MaxER MinER AvER MaxNMI MinNMI AvNMI CSPA-METIS CSPA-SPEC CBPA-METIS CBPA-SPEC
0.807 0.801 0.761 0.771
10.6 12.3 12.8 15.3
52.1 52.1 52.1 52.1
0 0 0 0
24.4 24.4 24.4 24.4
1 1 1 1
0.453 0.453 0.453 0.453
0.689 0.689 0.689 0.689
Table 4. Results on Breast cancer data EnsNMI EnsER MaxER MinER AvER MaxNMI MinNMI AvNMI CSPA-METIS CSPA-SPEC CBPA-METIS CBPA-SPEC
0.740 0.743 0.723 0.743
4.3 4.4 4.8 4.4
9.4 9.4 9.4 9.4
6.1 6.1 6.1 6.1
7.9 7.9 7.9 7.9
0.699 0.699 0.699 0.699
0.601 0.601 0.601 0.601
0.648 0.648 0.648 0.648
Table 5. Results on Vote data EnsNMI EnsER MaxER MinER AvER MaxNMI MinNMI AvNMI CSPA-METIS CSPA-SPEC CBPA-METIS CBPA-SPEC
0.473 0.449 0.473 0.439
14.0 13.5 14.0 14.2
17.7 17.7 17.7 17.7
6.9 6.9 6.9 6.9
13.7 13.7 13.7 13.7
0.640 0.640 0.640 0.640
0.345 0.345 0.345 0.345
0.447 0.447 0.447 0.447
Random Subspace Ensembles for Clustering Categorical Data
43
0.45 0.4 0.35
Subspace COOLCAT CSPA−Metis CSPA−SPEC CBPA−Metis CBPA−SPEC
Error Rate
0.3 0.25 0.2 0.15 0.1 0.05 0 1
2
3
4
5
6
7
8
9
10
Fig. 1. Archeological data: error rates of cluster ensemble methods, and COOLCAT in random subspaces 0.45 0.4 0.35
COOLCAT CSPA−Metis CSPA−SPEC CBPA−Metis CBPA−SPEC
Error Rate
0.3 0.25 0.2 0.15 0.1 0.05 0 1
2
3
4
5
6
7
8
9
10
Fig. 2. Archeological data: error rates of cluster ensemble methods, and COOLCAT using all features
44
M. Al-Razgan et al. 0.7 Subspace COOLCAT CSPA−Metis CSPA−SPEC CBPA−Metis CBPA−SPEC
0.6
Error Rate
0.5
0.4
0.3
0.2
0.1
0 1
2
3
4
5
6
7
8
9
10
Fig. 3. Soybeans data: error rates of cluster ensemble methods, and COOLCAT in random subspaces 0.22 0.2 0.18
Error Rate
0.16 0.14 0.12 0.1 COOLCAT CSPA−Metis CSPA−SPEC CBPA−Metis CBPA−SPEC
0.08 0.06 0.04 1
2
3
4
5
6
7
8
9
10
Fig. 4. Soybeans data: error rates of cluster ensemble methods, and COOLCAT using all features
Random Subspace Ensembles for Clustering Categorical Data
45
0.1
0.09
Error Rate
0.08
0.07 Subspace COOLCAT CSPA−Metis CSPA−SPEC CBPA−Metis CBPA−SPEC
0.06
0.05
0.04 1
2
3
4
5
6
7
8
9
10
Fig. 5. Breast-cancer data: error rates of cluster ensemble methods, and COOLCAT in random subspaces 0.09 COOLCAT CSPA−Metis CSPA−SPEC CBPA−Metis CBPA−SPEC
0.085 0.08 0.075
Error Rate
0.07 0.065 0.06 0.055 0.05 0.045 0.04 1
2
3
4
5
6
7
8
9
10
Fig. 6. Breast-cancer data: error rates of cluster ensemble methods, and COOLCAT using all features
46
M. Al-Razgan et al. 0.18
0.16
Error Rate
0.14
0.12
0.1
Subspace COOLCAT CSPA−Metis CSPA−SPEC CBPA−Metis CBPA−SPEC
0.08
0.06 1
2
3
4
5
6
7
8
9
10
Fig. 7. Vote data: error rates of cluster ensemble methods, and COOLCAT in random subspaces
0.15 COOLCAT CSPA−Metis CSPA−SPEC CBPA−Metis CBPA−SPEC
0.145
Error Rate
0.14
0.135
0.13
0.125
0.12
0.115 1
2
3
4
5
6
7
8
9
10
Fig. 8. Vote data: error rates of cluster ensemble methods, and COOLCAT using all features
Random Subspace Ensembles for Clustering Categorical Data
47
6 Conclusion and Future Work We have proposed two techniques to construct clustering ensembles for categorical data. A number of issues remains to be explored: (1) determine which specific ensemble method is best suited for a given dataset; (2) how to achieve more accurate clustering components while maintaining high diversity (e.g. by exploiting correlations among features); (3) test our techniques on higher dimensional categorical data. We will address these questions in our future work.
Acknowledgments This work was in part supported by NSF CAREER Award IIS-0447814.
References 1. Aldenderfer MS, Blashfield RK (1984) Cluster analysis. Sage Publications, Thousand Oaks 2. Al-Razgan M, Domeniconi C (2006) Weighted clustering ensembles. In: Ghosh J, Lambert D, Skillicorn DB, Srivastava J (eds) Proc 6th SIAM Int Conf Data Mining, Bethesda, MD, USA. SIAM, Philadelphia, pp 258–269 3. Ayad H, Kamel M (2003) Finding natural clusters using multi-clusterer combiner based on shared nearest neighbors. In: Windeatt T, Roli F (eds) Proc 4th Int Workshop Multiple Classifier Systems, Guildford, UK. Springer, Berlin/Heidelberg, pp 166–175 4. Barbar´ a D, Li Y, Couto J (2002) COOLCAT: an entropy-based algorithm for categorical clustering. In: Proc 11th Int Conf Inf Knowl Manag, McLean, VA, USA. ACM Press, New York, pp 582–589 5. Dhillon I (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proc 7th SIGKDD Int Conf Knowl Discov Data Mining, San Francisco, CA, USA. ACM Press, New York, pp 269–274 6. Fern X, Brodley C (2004) Solving cluster ensemble problems by bipartite graph partitioning. In: Proc 21st Int Conf on Mach Learn, Banff, AL, Canada. ACM, New York, pp 281–288 7. Fred A, Jain A (2002) Data clustering using evidence accumulation. In: Proc 16th Int Conf Pattern Recognition, Quebec, QB, Canada. IEEE Computer Society, Washington, pp 276–280 8. Gan G, Wu J (2004) Subspace clustering for high dimensional categorical data. ACM SIGKDD Explorations Newsletter 6:87–94 9. Guha S, Rastogi R, Shim K (1999) ROCK: a robust clustering algorithm for categorical attributes. In: Proc 15th Int Conf Data Engineering, Sydney, NSW, Australia. IEEE Computer Society, Washington, pp 512–521 10. He Z, Xu X, Deng S (2005) A cluster ensemble method for clustering categorical data. Inf Fusion 6:143–151 11. He Z, Xu X, Deng S (2005) Clustering mixed numeric and categorical data: a cluster ensemble approach. ArXiv Computer Science e-prints
48
M. Al-Razgan et al.
12. Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Analysis Mach Intell 20:832–844 13. Hu X (2004) Integration of cluster ensemble and text summarization for gene expression analysis. In: Proc 4th IEEE Symp Bioinformatics and Bioengineering, Taichung, Taiwan, ROC. IEEE Computer Society, Washington, pp 251–258 14. Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining Knowl Discov 2:283–304 15. Karypis G, Kumar V (1995) Multilevel k-way partitioning scheme for irregular graphs. Technical report, University of Minnesota, Department of Computer Science and Army HPC Research Center 16. Kleinberg EM (1990) Stochastic discrimination. Annals Math Artif Intell 1:207– 239 17. Kleinberg EM (1996) An overtraining-resistant stochastic modeling method for pattern recognition. The Annals of Stat 24:2319–2349 18. Kuncheva L, Hadjitodorov S (2004) Using diversity in cluster ensembles. In: Proc IEEE Int Conf Systems, Man and Cybernetics, The Hague, The Netherlands. IEEE Computer Society, Washington, pp 1214–1219 19. Mei Q, Xin D, Cheng H, Han J, Zhai C (2006) Generating semantic annotations for frequent patterns with context analysis. In: Proc 12th SIGKDD Int Conf Knowl Discov Data Mining, Philadelphia, PA, USA. ACM Press, New York, pp 337–346 20. Newman D, Hettich S, Blake C, Merz, C (1998) UCI repository of machine learning databases 21. Skurichina M, Duin RPW (2001) Bagging and the random subspace method for redundant feature spaces. In: Kittler J, Roli, F (eds) Proc 2nd Int Workshop Multiple Classifier Systems, Cambridge, UK. Springer, London, pp 1–10 22. Strehl A, Ghosh J (2002) Cluster ensembles – a knowledge reuse framework for combining multiple partitions. J Mach Learn Research 3: pp 583–617 23. Zengyou H, Xiaofei X, Shengchun D (2002) Squeezer: an efficient algorithm for clustering categorical data. J Comput Sci Technol 17:611–624
Ensemble Clustering with a Fuzzy Approach Roberto Avogadri and Giorgio Valentini DSI, Dipartimento di Scienze dell’ Informazione, Universit`a degli Studi di Milano Via Comelico 39, 20135 Milano, Italia, avogadri,
[email protected]
Summary. Ensemble clustering is a novel research field that extends to unsupervised learning the approach originally developed for classification and supervised learning problems. In particular ensemble clustering methods have been developed to improve the robustness and accuracy of clustering algorithms, as well as the ability to capture the structure of complex data. In many clustering applications an example may belong to multiple clusters, and the introduction of fuzzy set theory concepts can improve the level of flexibility needed to model the uncertainty underlying real data in several application domains. In this paper, we propose an unsupervised fuzzy ensemble clustering approach that permit to dispose both of the flexibility of the fuzzy sets and the robustness of the ensemble methods. Our algorithmic scheme can generate different ensemble clustering algorithms that allow to obtain the final consensus clustering both in crisp and fuzzy formats. Key words: ensemble clustering, k-means, fuzzy sets, triangular norm, random projections, Johnson-Lindenstrauss lemma, gene expression
1 Introduction Ensemble clustering methods have been recently proposed to improve the accuracy, stability and robustness of clustering algorithms [1, 2, 3, 4]. They are characterized by many qualities like scalability and parallelism, the ability to capture complex data structure and the robustness regarding the noise [5]. Ensemble methods can combine both different data and different clustering algorithms. For instance, ensemble algorithms have been used in data-mining to combine heterogeneous data or to combine data in a distributed environment [6]. Other research lines proposed to combine heterogeneous clustering algorithms to generate an overall consensus ensemble clustering, in order to exploit the different characteristics of clustering algorithms [7]. By another general approach to ensemble clustering, multiple instances of the data are obtained R. Avogadri and G. Valentini: Ensemble Clustering with a Fuzzy Approach, Studies in Computational Intelligence (SCI) 126, 49–69 (2008) c Springer-Verlag Berlin Heidelberg 2008 www.springerlink.com
50
R. Avogadri and G. Valentini
through “perturbations” of the original data: a clustering algorithm is applied to the multiple perturbed data and the results are combined to achieve the overall ensemble clustering. In this contest several techniques have been proposed, such as noise injection, bagging, random projections [8, 9]. These methods try to improve both the accuracy and the diversity of each component (base) clustering. In fact several works showed that the diversity among the solutions of the components of the ensemble is one of the crucial factors to develop robust and reliable ensemble clustering algorithms [1, 10]. In many real world clustering applications it may occur that an example may belong to more than one cluster, and in these cases traditional clustering algorithms are not able to capture the real nature of the data. Consider, for instance, general clustering problems in bioinformatics, such as the discovery of functional classes of genes: it is well-known that a single gene may participate in different biological processes, thus it may belong to multiple functional classes of genes. Sometimes it is enough to use hard clustering algorithms and to relax the condition that the final clustering has to be a partition or to apply the probabilistic approach [3]. In this chapter we propose an ensemble clustering algorithmic scheme useful to deal with problems where we can capture and manage the possibility for an element to belong to more than one class with different degrees of membership. To achieve this objective we use the fuzzy-set theory to express the uncertainty of the data ownership, and others fuzzy tools to transform the fuzzy clusterings into crisp clusterings. To perturb the data we apply random projections with low distortion [9], a method well-suited to manage high dimensional data (high number of attributes or features), reducing the computational time and improving at the same time the diversity of the data. Combining ensemble clustering techniques and fuzzy set theory, on one hand we can improve the accuracy and the robustness of the consensus ensemble clustering, and on the other hand we can deal with the uncertainty and the fuzziness underlying real data. In the following sections we introduce the random projections and the fuzzy operators that characterize our proposed unsupervised ensemble methods. Then we describe the fuzzy ensemble clustering algorithmic scheme and the algorithms that can be obtained from it. In Sect. 5 we present some results of their application to synthetic and real data sets. Discussion and conclusion end the chapter.
2 Random Projections Our proposed method applies random projections with low distortion to perturb the data. The objective is to reduce the dimension (number of features) of the data, in order to preserve their structure. Consider a pair of euclidean spaces, the original high d-dimensional, and the target d -dimensional spaces, with d > d . A random projection θ is a
Ensemble Clustering with a Fuzzy Approach
51
randomized function θ : Rd −→ Rd such that ∀p, q ∈ Rd , 0 < < 0.5, with high probability the following inequalities hold: 1−ε≤
θ(p) − θ(q) 2 ≤1+ε. p − q 2
An example of random projection is the Plus-Minus-One √ (PMO) [11] θ(p) = Rp, represented by matrix R, with elements Rij = 1/ d (Aij ), where A is a d × d matrix and Ai,j ∈ {−1, 1} such that P rob(Ai,j = 1) = P rob(Ai,j = −1) = 1/2 . A key problem consists in finding d such that, for every pair of data p, q ∈ R , the distances between the projections θ(p) and θ(q) are approximately preserved with high probability. A natural measure of the approximation is the distortion distθ : d
distθ (p, q) =
||θ(p) − θ(q)||2 . ||p − q||2
(1)
If distθ (p, q) = 1, the distances are preserved. If 1 − ε ≤ distθ (p, q) ≤ 1 + ε, we say that an ε-distortion level is introduced. The main result on random projections is due to the Johnson-Lindenstrauss n (JL) lemma [12]: given n vectors {x1 , . . . , xn } ∈ Rd , if d ≥ c log ε2 ,, where c is d d a suitable constant, then it exists a projection θ : R → R that, with high probability, preserves the distances, for all pairs (xi , xj ), i, j ∈ {1, . . . , n}, i.e.: (1 − ε)d(xi , xj ) ≤ d(θ(xi ), θ(xj )) ≤ (1 + ε)d(xi , xj ) . If we choose a value of d according to the JL lemma, we may perturb the data introducing only bounded distortions, approximately preserving the metric structure of the original data (see [13] for more details). Examples of random projections that obey the JL Lemma can be found in [9, 13]. A key note is that the initial space d has to be not “too small”, otherwise, using the JL Lemma, the initial space and the reduced space, have to become of similar dimensions, even if the final dimension d does not depend on d but only on the number of examples n in a data set and on the distortion ε chosen. In fact our main target applications are characterized by high dimensionality, such as DNA microarray data [14], where usually few elements (samples) of high dimensionality (number of features/genes) are available. If d d , we can save considerable computational time, working with a data set that approximately preserves the metric characteristics of the initial space. The perturbation of the data is obtained randomly choosing d projected features for every base learner of the ensemble. However different perturbation methods can in principle be used.
52
R. Avogadri and G. Valentini
3 Fuzzy Set and Fuzzy Set Methods 3.1 The Membership Functions In the classic set theory (called also “crisp set theory”), created by Cantor, the membership values of an object to a set can be 0 (FALSE) or 1 (TRUE). The characteristic function of a crisp set can be defined as follow: Icrisp sets : {(elem, crisp set)} −→ {0, 1}. The fuzzy set theory [15] is a generalization of the previous theory; in fact an object (elem) can belong only partially to a set (f uzzy set). It is defined a so called “membership function”: µf uzzy sets : (elem, f uzzy set) −→ [0, 1] In general, the domain of a membership function of a fuzzy set U can be every set, but usually it is a discrete set (U = {u1 , u2 , ...um }) or it is a subset of R ([lower value..higher value]), where lower value and higher value can be every real number belonging to [0,1], with lower value < higher value. If we consider a fuzzy set A, and a membership function defined on it, we can rewrite the membership function definition as follow: µA : U −→ [0, 1], so that the membership value µA (ui ) describes the degree of ownership of the element ui to the set A. 3.2 Fuzzy Methods In several data clustering applications, it is useful to have a method that can capture with a certain approximation of the real structure of data to obtain the best clustering. Through the fuzzy ensemble clustering algorithm we propose, it is possible to manage not only the possibility of overlapping among the clusters, but also the degree of membership of every example of the data set to the different clusters. In some application, however the initial problem does not admit a strictly fuzzy answer, but at the same time it is generally useful to have a valuation method that can use all the possible information available (like the degrees of membership). We use two classical methods to “defuzzify” the results: 1. alpha-cut, 2. hard-clustering. The alpha-cut function can be defined as follow: ∀α ∈ [0, 1] the alpha-cut, [A]α , or simply Aα , of A is: [A]α = {u ∈ U | µA (u) ≥ α} . The expression of a threshold value for the membership function allows to obtain from a fuzzy set a crisp set, called Aα , which contains every element of U whose membership to A is larger than α. The hard-clustering is not a proper fuzzy function: it is a method to obtain a crisp clustering from the original fuzzy clustering. The role of both functions in the algorithm design will be described in Sect. 4.
Ensemble Clustering with a Fuzzy Approach
53
3.3 Triangular Norms To generalize the “classical” intersection operator in the fuzzy logic are often used the so called triangular norms (t-norms) [16]. A t-norm T is a function T : [0, 1] × [0, 1] → [0, 1] that satisfies: • • • •
1) the boundaries conditions: T (0, 0) = T (0, 1) = T (1, 0) = 0 2) the identity condition: T (a, 1) = a ∀a ∈ (0, 1] 3) the commutative property: T (a, b) = T (b, a) ∀(a, b) ∈ [0, 1] 4) the monotonic property: T (a, b) ≤ T (c, d) if a ≤ c and b ≤ d ∀(a, b, c, d) ∈ [0, 1] • 5) the associative property: T (a, T (b, c)) = T (T (a, b), c) ∀(a, b, c) ∈ [0, 1] In the literature have been proposed four basic t-norms: TM (x, y) = min(x, y), (minimum)
(2)
TP (x, y) = xy, (algebraic product)
(3)
TL (x, y) = max(x + y − 1, 0), (Lukasewicz’s t-norm) 0 if (x, y) ∈ [0, 1)2 , (drastic product) TD (x, y) = min(x, y) otherwise.
(4) (5)
The following order relations exist among these t-norms: TD < TL < TP < TM In our algorithm scheme we used the algebraic product as aggregation operator (see Sect. 4).
4 The Algorithmic Scheme 4.1 General Structure The general structure of the algorithm is similar to the Randclust algorithm, proposed in [9]: data are perturbed through random projections to lower dimensional subspaces, and multiple clusterings are performed on the projected data (note that it is likely to obtain different clusterings, since the clustering algorithm is applied to different “views” of the data). Then the clusterings are combined, and a consensus ensemble clustering is computed. The main difference of our proposed method consists in using a fuzzy kmeans algorithm as base clustering and in applying a fuzzy approach to the combination and the consensus steps of the ensemble algorithm. In particular, we can apply different crisp and fuzzy approaches both to the aggregation and consensus steps, thus obtaining the following fuzzy ensemble clustering algorithmic scheme:
54
R. Avogadri and G. Valentini
1. Random projections. Generation of multiple instances (views) of compressed data through random projections (but different type of data perturbation methods like resampling or noise-injection can also be used). 2. Generation of multiple fuzzy clusterings. The fuzzy k-means algorithm is applied to compressed data obtained from the previous step. The output of the algorithm is a membership matrix where each element represents the membership of an example to a particular cluster. 3. “Crispization” of the base clusterings. This step is executed if a “crisp” aggregation is performed: the fuzzy clusterings obtained in the previous step can be “defuzzified” through one of the following techniques: a) hard-clustering; b) α-cut; 4. Aggregation. If a fuzzy aggregation is performed, the base clusterings are combined, using a square similarity matrix [8] M C whose elements are generated through fuzzy t-norms applied to the membership functions of each pair of examples. If a crisp aggregation is performed, the similarity matrix is built using the product of the characteristic function between each pair of examples. 5. Clustering in the “embedded” similarity space. The similarity matrix induces a new representation of the data based on the pairwise similarity between pairs of examples: the fuzzy k-means clustering algorithm is applied to the rows (or equivalently to the columns) of the similarity matrix. 6. Consensus clustering. The consensus clustering could be represented by the overall consensus membership matrix, resulting in a fuzzy representation of the consensus clustering. Alternatively, we may apply the same crispization techniques used in step 3 to transform the fuzzy consensus clustering to a crisp one. The two classical “crispization” techniques we used in steps 3 and 6, can be described as follows: hard-clustering:
χH ri
=
1 ⇔ arg maxs Usi = r 0 otherwise,
α-cut: χα ri =
1 ⇔ Uri ≥ α 0 otherwise.
(6)
(7)
where χri is the characteristic function for the cluster r: that is χri = 1 if the ith example belongs to the rth cluster, χri = 0 otherwise; 1 ≤ s ≤ k; 1 ≤ i ≤ n, 0 ≤ α ≤ 1, and U is the fuzzy membership matrix obtained by applying the fuzzy k-means algorithm. Note that two different types of membership matrices are considered: in step 3 multiple membership matrices U are obtained through the application of the fuzzy k-means algorithm to multiple instances of the projected data; in step 6 another membership matrix U C (where the superscript C stands for “consensus”) is obtained by applying the fuzzy k-means algorithm to the rows of the similarity matrix.
Ensemble Clustering with a Fuzzy Approach
55
We may observe that considering the possibility of applying crisp or fuzzy methods in steps 3 and 6, we can obtain 9 different algorithms, exploiting different combinations of aggregation and consensus clustering techniques. For instance, combining a fuzzy aggregation with a consensus clustering obtained through α-cut we obtain from the algorithmic scheme a fuzzy-alpha ensemble clustering algorithm, while using a hard-clustering crispization technique for aggregation and a fuzzy consensus we obtain a max-fuzzy ensemble clustering. In the next two sections we discuss the algorithms based on the fuzzy aggregation step (fuzzy-* clustering ensemble algorithms) and the ones based on the crisp aggregation on the base clusterings (crisp-* clustering ensembles). 4.2 Fuzzy Ensemble Clustering with Fuzzy Aggregation of the Base Clusterings The pseudo-code of the fuzzy-* ensemble clustering algorithm is given below: Input: - a data set X = {x1 , x2 , . . . , xn }, stored in a d × n D matrix. - an integer k (number of clusters) - an integer c (number of clusterings) - an integer v (integer used for the normalization of the final clustering) - the fuzzy k-means clustering algorithm Cf - a procedure that realizes the randomized map µ - an integer d (dimension of the projected subspace) - a function τ that defines the t-norm begin algorithm (1) For each i, j ∈ {1, . . . , n} do Mij = 0 (2) For t = 1 to c do: (3) Rt = Generate projection matrix (d , µ) (4) Dt = Rt D (5) U (t) = Cf (Dt , k, m) (6) For each i, j ∈ {1, . . . , n} do (t) (t) (t) Mij = ks=1 τ (Usi , Usj ) c
M (t)
(7) M C = t=1v (8) < A1 , A2 , . . . , Ak >= Cf (M C , k, m) end algorithm Output: - the final clustering C =< A1 , A2 , . . . , Ak > - the cumulative similarity matrix M C . Note that the dimension d of the projected subspace is an input parameter of the algorithm, but it may be computed according to the JL lemma (Sect. 2), to approximately preserve the distances between the examples. Inside the loop (steps 2-6) the procedure Generate projection matrix produces a d × d Rt matrix according to a given randomized map µ [9], that it is used to randomly
56
R. Avogadri and G. Valentini
project the original data matrix D into a d × n Dt projected data matrix (step 4). In step 5 the fuzzy k-means algorithm Cf with a given fuzziness m is applied to Dt and a k-clustering represented by its U (t) membership matrix is achieved. Hence the corresponding similarity matrix M (t) is computed, using a given t-norm (step 6). Note that U is a fuzzy membership matrix (where the rows are clusters and the columns examples). A similar approach has been also proposed in [17]. In step 7 the “cumulative” similarity matrix M C is obtained by averaging across the similarity matrices computed in the main loop. Note the normalization factor v1 : it’s easy to demonstrate that, for the choice of the algebraic product as t-norm, a suitable choice of v can be the number of clusterings c. Finally, the consensus clustering is obtained by applying the fuzzy k-means algorithm to the rows of the similarity matrix M C (step 8). 4.3 Fuzzy Ensemble Clustering with Crisp Aggregation of the Base Clusterings In agreement with our algorithm scheme, there are two different methods to “defuzzify” the clusterings: • hard-clustering; • α-cut. Below we provide the pseudo-code for the max-* ensemble clustering with “crispization” through hard-clustering. Input: - a data set X = {x1 , x2 , . . . , xn }, stored in a d × n D matrix. - an integer k (number of clusters) - an integer c (number of clusterings) - an integer v (integer used for the normalization of the final clustering) - the fuzzy k-means clustering algorithm Cf - a procedure that realizes the randomized map µ - an integer d (dimension of the projected subspace) - a “crispization” algorithm “Crisp” begin algorithm (1) For each i, j ∈ {1, . . . , n} do Mij = 0 (2) For t = 1 to c do: (3) Rt = Generate projection matrix (d , µ) (4) Dt = Rt D (5) U (t) = Cf (Dt , k, m) (5.1) χ(t) = Crisp(U (t) ) (6) For each i, j ∈ {1, . . . , n} do k (t) (t) (t) Mij = s=1 χsi χsj c
M (t)
(7) M c = t=1c (8) < A1 , A2 , . . . , Ak >= Cf (M c , k, m)
Ensemble Clustering with a Fuzzy Approach
57
end algorithm Output: - the final clustering C =< A1 , A2 , . . . , Ak > - the cumulative similarity matrix M C . Different observations can be made about the proposed algorithm: 1. With respect to the previously proposed fuzzy-* ensemble clustering, a new step has been introduced (step 5.1) after the creation of the membership matrix of the single clusterings in order to obtain the transformation of fuzzy data into crisp ones. After this new step, a characteristic matrix χ(t) is created for every clustering. 2. After that, the data can be managed like “natural” crisp data. In fact, in step 6 the final similarity matrix is obtained through the methods used in [8, 9]. 3. As a consequence of the “hard-clustering crispization” (step 5.1), the consensus clustering is a partition, that is, each example may belong to one and only one cluster: 1 ⇐⇒ arg maxs Usj = i, ∀i, j χtij = 0 otherwise. Hence, the normalization of the values of the similarity matrix can be performed using the factor 1c (step 7). 4.4 Fuzzy Ensemble Clustering with Crisp Aggregation and α-Cut Defuzzification Another algorithm that can be derived from the algorithmic scheme is based on the crisp aggregation of base clustering using the α-cut defuzzification method (see (7)). In this case the results strongly depend on the choice of the value of α. Indeed for large values of α in several base clusterings we may have many unassigned examples, while, on the contrary, for small values of α it is likely some examples may belong to multiple clusters. The pseudo-code of the alpha-* ensemble clustering algorithm with “crispization” through α-cut is given below. Input: - a data set X = {x1 , x2 , . . . , xn }, stored in a d × n D matrix. - an integer k (number of clusters) - an integer c (number of clusterings) - a real value α (α-cut value) - the fuzzy k-means clustering algorithm Cf - a procedure that realizes the randomized map µ - an integer d (dimension of the projected subspace) - a “crispization” algorithm “Crisp” begin algorithm
58
R. Avogadri and G. Valentini
(1) For each i, j ∈ {1, . . . , n} do Mij = 0 (2) For t = 1 to c do: (3) Rt = Generate projection matrix (d , µ) (4) Dt = Rt D (5) U (t) = Cf (Dt , k, m) (5.1) χ(t) = Crispα (U (t) ) (6) For each i, j ∈ {1, . . . , n} do (t) (t) (t) Mij = ks=1 χsi χsj c
M (t)
(7) M c = t=1 k∗c (8) < A1 , A2 , . . . , Ak >= Cf (M c , k, m) end algorithm Output: - the final clustering C =< A1 , A2 , . . . , Ak > - the cumulative similarity matrix M C . Comparing this algorithm with the max-* ensemble clustering (Sect. 4.3) we may note that the main changes are in steps 5.1 and 7. Indeed, in step 5.1 the Crisp algorithm has a new parameter: α, the alpha-cut threshold value. In particular in the χ(t) = Crispα (U (t) ) operation, the assignment of an example to a specific cluster depends on the value of α (a parameter that is given as input to the algorithm): 1 ⇐⇒ Uij ≥ α, t χij = 0 otherwise. Note that the normalization method in step 7 comes from the following observations: 1. In the algorithm proposed it is used only one clustering function that works with a fixed number (k) of clusters for each clustering. 2. For a fixed α: χsi = 1 ⇐⇒ µsi > α χsi = 0 ⇐⇒ µsi ≤ α with 1 ≤ s ≤ k, hence 0≤
k
χsi ≤ k .
s=1
Considering that each k-clustering is repeated c times, we can observe that kc is the total number of clusters across the multiple clusterings. We may use base clusterings with different number of clusters for each execution; in this case the normalization factor v becomes: c kt (8) v= t=1
where kt is the number of clusters of the tth clustering.
Ensemble Clustering with a Fuzzy Approach
59
5 Experimental Results In this section we test our proposed fuzzy ensemble algorithms with both synthetic and real data. For all the experiments we used high-dimensional data to test the effectiveness of random projections with this kind of complex data. 5.1 Experiments with Synthetic Data Experimental Environment To test the performance of the proposed algorithms, we used a synthetic data generator [9]. Every synthetic data set is composed of 3 clusters with 20 samples each. Every example is 5000-dimensional. Each cluster is distributed according to a spherical gaussian probability distribution with a standard deviation of 3. The first cluster in centered at the null vector 0. The other two clusters are centered at 0.5e and −0.5e, where e is a vector with all the components equal to 1. We tested 4 of the 9 algorithms developed, two with the hard-clustering method applied to the consensus clustering and two using the α-cut approach: • max-max : hard-clustering applied to both the base clustering and consensus clustering; • fuzzy-max : the aggregation step is fuzzy, the consensus step crisp through hard-clustering; • max-alpha: crisp aggregation by hard-clustering, and crisp consensus through α-cut; • fuzzy-alpha: fuzzy aggregation and crisp consensus through α-cut. We repeated 20 times the four clustering ensemble algorithms using data sets randomly projected to a 410-dimensional feature space (corresponding to a distortion ε = 0.2, see Sect. 2). As a baseline clustering algorithm we used the classical fuzzy k-means, executed on the original data set (without data compression). Since clustering does not univocally associate a label to the examples, but only provides a set of clusters, we evaluated the error by choosing for each clustering the permutation of the classes that best matches the “a priori” known true classes. More precisely, considering the following clustering function: f (x) : Rd → P(Y), with Y = {1, . . . , k}
(9)
where x is the sample to classify, d its dimension, k the number of the classes, and P(Y) is the powerset of Y, the error function we applied is the following: 0 if (|Y | = 1 ∧ t ∈ Y ) ∨ Y = {λ} L0/1 (Y, t) = (10) 1 otherwise.
60
R. Avogadri and G. Valentini
with t the “real” label of the sample x, Y ∈ P(Y) and {λ} is the empty set. Other loss functions or measures of the performance of clustering algorithms may be applied, but we chose this modification of the 0/1 loss function to take into account the multi-label output of fuzzy k-means algorithms, and considering that our target examples belong only to a single cluster. Results The error boxplots of Fig. 1 show that our proposed fuzzy ensemble clustering algorithms perform consistently better than single fuzzy k-means algorithms, with the exception of the fuzzy-max algorithm, when a relatively high level of fuzziness is chosen. More precisely, in Fig. 1, we can observe how different degrees of fuzziness of the “component” k-means clusterings can change the performance of the ensemble, if the “fuzzy” information are preserved from the “crispization” operation. In fact, if the single k-means and the ensemble max-max algorithm (which use the hard-clustering operation in both the component and consensus level) performances are similar in both plots in Fig. 1, the result of the fuzzy-max ensemble algorithm (where the “defuzzification” operation is performed only on the final result) changes drastically. The good performance of the max-max algorithm with both the degrees of fuzziness could depend on the high level of “crispness” of the data: indeed each example is assumed to belong exactly to one cluster. A confirmation of this hypothesis is given by the improvement of performance of the fuzzy-max algorithm by lowering the level of fuzziness (2.0 to 1.1) of the base clusterings. A different consideration can be proposed regarding the fuzzy-max algorithm, in which the capacity to express “pure” fuzzy results on the base learners level can improve its degree of flexibility (possibility to adapt the algorithm to the nature of the clusterings). The analysis of the fuzzy-alpha and max-alpha algorithms (Figs. 2, 3) shows how the reduction of the fuzziness reduces the number of unclassified samples and the number of errors, especially for α ≤ 0.5); for higher level of α the error rate goes to 0, but the number of unclassified samples arises quickly. Note that fuzzy-alpha ensembles achieve the error rate very close to 0 with a small amount of unclassified examples for a large range of α values (Fig. 2, bottom plot). Fig. 3 shows how the max-alpha algorithm obtains inversely related error and unclassified rates while varying α, with an “optimal” value close to 0.5. Table 1 summarizes the results, showing that our proposed fuzzy-max and fuzzy-alpha ensemble methods outperform the other compared ensemble algorithms with respect to these high dimensional synthetic data.
61
0.6 0.4 0.0
0.2
Errors
0.8
1.0
Ensemble Clustering with a Fuzzy Approach
single
PMOmaxmax02
PMOfuzzymax02
0.6 0.4 0.0
0.2
Errors
0.8
1.0
Methods
single
PMOmaxmax02
PMOfuzzymax02
Methods
Fig. 1. Max-max and fuzzy-max algorithms (PMO data reduction and ε = 0.2) versus the single fuzzy k-means, followed by hard-clustering: (top plot) fuzziness=2.0, (bottom plot) fuzziness=1.1
R. Avogadri and G. Valentini
1.0
62
0.6 0.4 0.0
0.2
Error/Unclassified
0.8
Error rate Unclassified rate
0.2
0.4
0.6
0.8
1.0
1.0
Alpha
0.6 0.4 0.0
0.2
Error/Unclassified
0.8
Error rate Unclassified rate
0.2
0.4
0.6
0.8
1.0
Alpha
Fig. 2. Fuzzy-alpha ensemble clustering error and unclassified rate with respect to α: (top plot) fuzziness=2.0, (bottom plot) fuzziness=1.1
63
1.0
Ensemble Clustering with a Fuzzy Approach
0.6 0.4 0.0
0.2
Error/Unclassified
0.8
Error rate Unclassified rate
0.2
0.4
0.6
0.8
1.0
1.0
Alpha
0.6 0.4 0.0
0.2
Error/Unclassified
0.8
Error rate Unclassified rate
0.2
0.4
0.6
0.8
1.0
Alpha
Fig. 3. Max-alpha ensemble clustering error and unclassified rate with respect to α: (top plot) fuzziness=2.0, (bottom plot) fuzziness=1.1
64
R. Avogadri and G. Valentini
Table 1. Results of fuzzy and non-fuzzy ensemble clustering methods and single clustering algorithms. The last column represents the rate of the unclassified examples Algorithms
Mean error Std.Dev. % Uncl.
Fuzzy-Max Fuzzy-Alpha Max-Max Max-Alpha Rand-Clust Fuzzy (single) Hierarchical (single)
0.0058 0.0058 0.0758 0.0573 0.0539 0.3916 0.0817
0.0155 0.0155 0.1104 0.0739 0.0354 0.0886 0.0298
0 0.0008 0 0.0166 0 0 0
5.2 Experiments with Real Data Experimental Environment We performed experiments with high dimensional DNA microarray data. In this context each example corresponds to a patient and the features associated with the patients are gene expression levels measured through DNA microarray [14]. These high-throughput bio-technologies allow the parallel measurements of the mRNA levels of thousands of genes (and now of entire genomes) of the cells or tissues of a given patient, thus providing a sort of snapshot of the functional status of a given cell or tissues in a certain condition. In this way we can obtain the molecular portrait of a given phenotype in a given condition and at a given time. Among the different application of this technology, here we consider the analysis of gene expression data of patients to reconstruct known phenotypes using bio-molecular data (DNA microarray measurements). These data are characterized by a high dimension (high number of analyzed genes) and relatively low cardinality (number of patients), thus resulting in a challenging unsupervised problem. In our experiments we experimented with the Leukemia data set, that contains gene expression levels of 7129 genes in Affymetrix’s scaled average difference units relative to 47 patients with Acute Lymphoblastic Leukemia (ALL) and 25 cases of Acute Myeloid Leukemia (AML) [18]. We analyzed also the Melanoma data set (described in [19]) that it is composed by 31 melanoma samples and 7 control samples with 6971 genes. For both these data sets we applied the same pre-processing procedures described respectively in [18] and [19]. We tested the performance of the fuzzy ensemble algorithms fuzzy-max, fuzzy-alpha, max-max and max-alpha using the previously described data sets. We compared the results with Randclust , the corresponding “crisp” version of our proposed ensemble methods [9] and with Bagclust1, based on an unsupervised version of bagging [8], and with the single fuzzy k-means clustering algorithm. The Randclust ensemble method is similar to the algorithm
Ensemble Clustering with a Fuzzy Approach
65
presented in this paper, but it uses the hierarchical clustering algorithm to produce the base clusterings, and a crisp approach to combine the resulting clusters. Bagclust1 generates multiple instances of perturbed data through bootstrap techniques, and then it applies to each instance the base clustering algorithm (we used in our experiments k-means); the final clustering is obtained by majority voting. Each ensemble is composed of 50 base clusterings and each ensemble method has been repeated 30 times. Regarding ensemble methods based on random projections, we chose projections with bounded 1 ± 0.2 distortion, according to the JL lemma, while for Bagclust1 we randomly drew with replacement a number of examples equal to the number of the available data. Results The boxplots in Fig. 4 represent the distribution of the error across multiple repetitions of the fuzzy ensemble algorithms compared with the single fuzzy k-means, the Randclust and the Bagclust1 ensemble methods for the two data sets used in the experiments. With the Leukemia data set the results obtained with the different methods are quite comparable (Fig. 4 (top)), while with the Melanoma DNA microarray data (Fig. 4 (bottom)), the proposed fuzzy ensemble methods largely outperform all the other compared methods. The results show that ensembles based on random projection to lower dimensional spaces, using projections matrices that obey the JL lemma are well-suited to high dimensional data. Nevertheless note that in the experiments with the Leukemia data set both our proposed fuzzy ensemble clustering method and Randclust applied random projections to perturb the data, but our proposed fuzzy approach significantly outperforms the crisp Randclust ensemble method. In order to understand in which way the base fuzzy clustering may affect the overall results with respect to the base hierarchical clustering algorithms used in Randclust , we performed an analysis of the relationships between accuracy and similarity of each pair of base clusterings used in each ensemble, using measures based on the normalized mutual information (NMI), according to the approach originally proposed in [20] and [1]. By this approach the accuracy of each pair of base clusterings of the ensemble is measured by averaging the NMI of each base clustering with respect to the a priori known “true” clustering, while their diversity by measuring the NMI directly between the two base clusterings. The results are plotted in Fig. 5. Interestingly enough, the base clusterings of our proposed fuzzy ensemble approach are both more accurate (in the figure their NMI ranges approximately between 0.25 and 0.30, while in Randclust the accuracy is below 0.20), and more diverse (indeed their NMI in the y axis are between 0.55 and 0.80, while in Randclust are above 0.80: recall that a low value of NMI between base clusterings reveals a high diversity and viceversa).
R. Avogadri and G. Valentini
0.3 0.2 0.0
0.1
Errors
0.4
0.5
66
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(5)
(6)
(7)
0.15 0.20 0.25 0.30 0.35 0.40 0.45
Errors
Methods
(1)
(2)
(3)
(4) Methods
Fig. 4. Results of gene expression data analysis: (top plot) Leukemia data set, (bottom plot) Melanoma data set. (1)..(7) in abscissa refer to the results obtained respectively with Randclust (1), Bagclust1 (2), Single fuzzy k-means (3), max-max (4), fuzzy-max (5), max-alpha (6) and fuzzy-alpha (7) fuzzy ensemble algorithms
Ensemble Clustering with a Fuzzy Approach
67
0.8 0.7 0.5
0.6
similarity
0.9
1.0
It is well-known in the literature that a highly desirable property of ensembles consists in a high accuracy and diversity of base learners: there is a trade-off between accuracy and diversity, and the performances of ensembles in part depend on the relationships between these quantities [4]. Our results show that our proposed ensemble clustering approach improve both the accuracy and the diversity of base learners.
0.10
0.15
0.20
0.25
0.30
0.35
accuracy
Fig. 5. Melanoma data set: analysis of the relationships between accuracy and similarity between the base learners of the fuzzy max* ensemble clustering (triangles) and Randclust ensemble clustering (circles)
6 Conclusion In this paper we proposed an algorithmic scheme that combines a fuzzy approach with random projections to obtain clustering ensembles well suited to the analysis of complex high-dimensional data. On the one hand, the proposed approach exploits the accuracy and the effectiveness of the ensemble clustering techniques based on random projections, and on the other hand, the expressive capacity of the fuzzy sets, to obtain clustering algorithms both reliable and able to express the uncertainty of the data. From the algorithmic scheme several ensemble algorithms can be derived, by combining different fuzzy and defuzzification methods in the aggregation
68
R. Avogadri and G. Valentini
and consensus steps of the general algorithmic scheme. Our preliminary results with both synthetic and DNA microarray data are quite encouraging, showing that the fuzzy approach achieves a good compromise between the accuracy and diversity between the base learners. Moreover these results have been also confirmed by other recent experiments with DNA microarray data [21]. Several open problems need to be considered for future research work. For instance, we may consider the choice of the t-norm to be used in the fuzzy aggregation of multiple clusters. In our experiments we applied the algebraic product, but we need to experiment with other t-norm and we need to analyze their properties to understand what could be the better choice with respect to the characteristics of the data. Moreover we experimented with the PMO random projections, but we need also to experiment with other random projections, such as normal or Achlioptas random projections [11, 13], and we need also to get more theoretical insights into the reasons why random projections work on high dimensional spaces. Another interesting development of this work consists in studying if it possible to embed recently proposed stability-based methods based on random projections [22, 23] into ensemble clustering methods to steer the construction of the consensus clustering. In the experiments we used “crisp” data, showing that the proposed method can be successfully applied to analyze this kind of data. We are planning new experiments with examples that may belong to multiple clusters (e.g. unsupervised analysis of functional classes of genes) to show more clearly the effectiveness of the proposed approach. Moreover we plan experiments to analyze the structure of unlabeled data when the boundaries of the clusters are highly uncertain, with very partial memberships of the examples to the clusters.
References 1. Fern XZ, Brodley CE (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. In: Fawcett T, Mishra N (eds) Proc 20th Int Conf Mach Learning, Washington, DC, USA. AAAI Press, Menlo Rark 2. Monti S, Tamayo P, Mesirov J, Golub T (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learning 52:91–118 3. Topchy A, Jain A, Puch W (2005) Clustering ensembles: models of consensus and weak partitions. IEEE Trans Pattern Analysis Mach Intell 27:1866–1881 4. Kuncheva L, Vetrov D (2006) Evaluation of stability of k-means cluster ensembles with respect to random initialization. IEEE Trans Pattern Analysis Mach Intell 28:1798–1808 5. Kuncheva L (2004) Combining pattern classifiers: methods and algorithms. Wiley-Interscience, New York 6. Strehl A, Ghosh J (2002) Cluster ensembles – a knowledge reuse framework for combining multiple partitions. J Mach Learn Research 3:583–618 7. Hu X, Yoo I (2004) Cluster ensemble and its applications in gene expression analysis. In: Chen YPP (ed) Proc. 2nd Asia-Pacific Bioinformatics Conf,
Ensemble Clustering with a Fuzzy Approach
8. 9.
10. 11. 12. 13.
14. 15. 16. 17.
18.
19.
20.
21.
22. 23.
69
Dunedin, New-Zealand. Australian Computer Society, Darlinghurst, pp 297– 302 Dudoit S, Fridlyand J (2003) Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19:1090–1099 Bertoni A, Valentini G (2006) Ensembles based on random projections to improve the accuracy of clustering algorithms. In: Apolloni B, Marinaro M, Nicosia G, Tagliaferri R (eds) Proc 16th Italian Workshop Neural Nets, Vietri sul Mare, Italy. Springer, Berlin/Heidelberg, pp 31–37 Hadjitodorov S, Kuncheva L, Todorova L (2006) Moderate diversity for better cluster ensembles. Inf Fusion 7:264–275 Achlioptas D (2003) Database-friendly random projections: JohnsonLindenstrauss with binary coins. J Comp Sys Sci 66:671–687 Johnson WB, Lindenstrauss J (1984) Extensions of Lipshitz mapping into Hilbert space. Contemporary Math 26:189–206 Bertoni A, Valentini G (2006) Randomized maps for assessing the reliability of patients clusters in DNA microarray data analyses. Artif Intell Medicine 37:85–109 Eisen M, Spellman P, Brown P, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. PNAS 95:14863–14868 Zadeh, L (1965) Fuzzy sets. Inf Control 8:338–353 Klement EP, Mesiar R, Pap E (2000) Triangular norms. Kluwer Academic Publishers, Dordrecht Yang L, Lv H, Wang W (2006) Soft cluster ensemble based on fuzzy similarity measure. In: IMACS Multiconf Comp Eng Systems Appl, Beijing, China, pp 1994–1997 Golub T, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537 Bittner M, Meltzer P, Chen Y, Jiang Y, Seftor E, Hendrix M, Radmacher M, Simon R, Yakhini Z, Ben-Dor A, Sampas N, Dougherty E, Wang E, Marincola F, Gooden C, Lueders J, Glatfelter A, Pollock P, Carpten J, Gillanders E, Leja D, Dietrich K, Beaudry C, Berens M, Alberts D, Sondak V (2000) Molecular classification of malignant melanoma by gene expression profiling. Nature 406:536–540 Dietterich T (2000) An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting and randomization. Mach Learn 40:139–158 Avogadri R, Valentini G (2007) Fuzzy ensemble clustering for DNA microarray data analysis. In: Proc 4th Int Conf Bioinformatics and Biostatistics, Portofino, Italy. Springer, Berlin/Heidelberg, pp 537–543 Valentini G (2006) Clusterv: a tool for assessing the reliability of clusters discovered in DNA microarray data. Bioinformatics 22:369–370 Bertoni A, Valentini G (2007) Model order selection for bio-molecular data clustering. BMC Bioinformatics 8:S7
Collaborative Multi-Strategical Clustering for Object-Oriented Image Analysis Germain Forestier, C´edric Wemmert, and Pierre Gan¸carski Image Sciences, Computer Sciences and Remote Sensing Laboratory (UMR 7005), forestier, wemmert,
[email protected]
Summary. In this chapter, the use of a collaborative multi-strategy clustering method, applied to image analysis, is presented. This method integrates different kinds of clustering algorithms and produces, for each algorithm, a result built according to the results of all the other methods: each method tries to make its result to converge towards the results of the other methods by using consensus operators. This chapter highlights how clustering methods can collaborate and presents results in the paradigm of object-oriented classification of a very high resolution remotely sensed images of an urban area. Key words: collaborative clustering, multi-strategy clustering, agreement between clusterings, image analysis, k-means
1 Introduction In the last decade automatic interpretation of remotely sensed images became an increasingly active domain. Sensors are now able to get images with a Very High spatial Resolution (VHR) (i.e. 1 meter resolution) and high spectral resolution (up to 100 spectral bands). This increase of precision generates a significant amount of data. These new kinds of data become more and more complex and a challenging task is to design algorithms and systems able to process all data in reasonable time. In these VHR images, the abundance of noisy, correlated and irrelevant bands disturbs the classical per-pixel classification procedures. Moreover, another important problem in these images is that the interesting objects (houses, parks, . . . ) are represented by a set of many different heterogeneous pixels, showing all the parts of the object (for example the cars on a parking). That is why in the paradigm of object-oriented classification [21] [14] the image is segmented into regions (or segments) and the segments are classified using spectral and spatial attributes (e.g. shape index, texture . . . ). These new heterogeneous types of data need specific algorithms to discover relevant groups of objects in order to present interesting information to geographers and geoscientists. G. Forestier et al.: Collaborative Multi-Strategical Clustering for Object-Oriented Image Analysis, Studies in Computational Intelligence (SCI) 126, 71–88 (2008) c Springer-Verlag Berlin Heidelberg 2008 www.springerlink.com
72
G. Forestier et al.
Within the framework of our research, we studied in our team many methods of unsupervised classification. All have advantages, but also some limitations which seem sometimes to be complementary. From this study is born the idea to combine, in a multi-strategical collaborative classifier [9], many clustering algorithms to make them collaborate in order to propose a single result combining their various results. For that, we define a generic model of an unsupervised classifier. Our system allows several entities of clusterers from this model to collaborate without having to know their type (neural networks, conceptual classifiers, probabilistic clusterer . . . ). In this chapter, we present how this collaborative classification process can be used in the context of automatic image analysis. We first present the multi-strategical algorithm and highlight how the methods collaborate to find a unified result. Then we present experiments made using this approach in the domain of automatic object-oriented classification of remote sensing images. Finally we conclude on the perspective to integrate high level knowledge on the data in the different steps of the collaboration.
2 Clustering Combination Many works focus on combining different results of clustering which are commonly called clustering aggregation [10], multiple clusterings [2] or cluster ensembles [5, 17]. Topchy, Jain and Punch [18] introduced a unified representation for multiple clusterings and formulated the corresponding categorical clustering problem. To compute the combined partition, the EM algorithm is applied to the maximum-likelihood problem corresponding to the probabilistic model they proposed, consisting of a finite mixture of multinomial distributions in a space of clusterings. They also define a new consensus function that is related to the classical intraclass variance criterion using the generalized mutual information definition. In [17], Strehl and Ghosh have considered three different consensus functions for ensemble clustering. The Cluster-based Similarity Partitioning Algorithm (CSPA) which induces a graph from a co-association matrix and clusters it using the METIS algorithm, the Hypergraph partitioning algorithm (HGPA) which tries to find the best hypergraph partition using the HMETIS algorithm (each cluster is represented by an edge in a graph where the nodes correspond to a given set of objects), and lastly the Meta-Clustering algorithm (MCLA) that uses hyperedge collapsing operations to determine soft cluster membership values for each object. Fred and Jain [7] have proposed the Evidence Accumulation Clustering method (EAC) which try to summarize various clustering results in a coassociation matrix. Co-association values represent the strength of association between objects by analyzing how often each pair of objects appears in the
Collaborative Clustering for Image Analysis
73
same cluster. The final clustering is formed by applying a hierarchical (singlelink) clustering using the co-association matrix as a similarity matrix for the data items. Zhou and Tang [22] presented four methods for building ensembles of kmeans clusterers, based on an alignment of the clusters found by the different methods. The alignment is realized by counting their overlapped data items. More recently, Ayad and Kamel [1] presented a new cumulative voting consensus method for partitions with variable number of clusters. The main idea is to find a compressed summary of the ensemble of clusterings that preserves maximum relevant information. This is done by using an agglomerative algorithm that minimizes the average generalized Jensen-Shannon divergence within the clusters. Whereas these kinds of approaches try to combine different results of clustering in a final step, our approach makes the methods collaborate during the clustering process to make them converge so that they obtain relatively similar solutions. After that, only these solutions are unified into a unique clustering.
3 Collaborative Multi-Strategy Clustering It is often difficult to compute a unique result from a heterogeneous set of clustering results, representing all the knowledge given by various methods unless there is a trivial correspondence between the clusters of the different results. To tackle this problem, we propose to carry out a collaborative process which consists in an automatic and mutual refinement of the clustering results. Each clustering method produces a result which is built according to the results of all the other methods: they search an agreement about all their proposals. The goal of these refinements is to make the results of the various clustering methods converge in order to try that they have almost the same number of clusters and that all these clusters are statistically similar. Thus we obtain very similar results for all the method occurrences and links between them, representing the correspondences between their clusters. It is then possible to apply a unifying technique, such as a multi-view voting method [20]. The entire clustering process is presented in Fig. 1. It is decomposed in three main phases: 1. Initial clusterings - First a phase of initial clusterings is performed: a clustering is computed by each method with its parameters. 2. Results refinement - An iterative phase of convergence of the results which corresponds to alternations between two steps, as long as the convergence and the quality of the results improve: 2.1 A step of evaluation of the similarity between the results and mapping of the clusters. 2.2 A step of refinement of the results. 3. Unification - The refined results are unified using a voting algorithm.
74
G. Forestier et al.
3.1 Initial Clustering The first step of the process is the computation of the result for each instance of clustering method, each using its own strategy and parameters. 3.2 Results Refinement The mechanism we propose for refining the results is based on the concept of distributed local resolution of conflicts. It consists in the iteration of four phases: • Detection of the conflicts by evaluating the dissimilarities between pairs of results • Choice of the conflicts to solve • Local resolution of these conflicts (concerning the two results implied in the conflict) • Management of these local modifications in the global result (if they are relevant)
Evaluation and Mapping In the continuation of this chapter we will assume that we have N clustering results (R1 to RN ) given by the N clustering method instances. Each result Ri is composed of ni clusters Cki where 1 k ni . In order to evaluate the similarity between two results we compute the confusion matrix between these two clustering results. We calculate 12 N × (N − 1) confusion matrices, one for each pair of clustering results. From these confusion matrices we observe the repartition of the objects through the clusters of each result. The confusion matrix between two results Ri and Rj is defined as : ⎞ ⎛ i,j α1,1 · · · αi,j i j 1,nj Ck Cl ⎟ ⎜ .. . . i,j . i,j .. ⎠ with αk,l = i . M =⎝ . . C k i,j i,j αni ,1 · · · αni ,nj For each cluster of each result Ri we search which cluster from each other results Rj (j = i) is its corresponding cluster. A cluster Ckj of the result Rj is the corresponding cluster of the cluster Cki of the result Ri if it is most similar to Cki . For that, we have defined a similarity measure without using a distance between objects. The similarity measure ωki,j between a cluster Cki of Ri and the clusters of the result Rj is defined as follow:
Collaborative Clustering for Image Analysis
75
Fig. 1. Collaborative multi-strategical clustering process j,i i,j i,j ωki,j = ρi,j k αkm ,k where αk,km = max {αk,l }l=1...nj .
(1)
It is evaluated by observing the relationship between the size of their intersection and the size of the cluster itself i,j i,j αi,j k = (αk,l )l=1,...,nj where αk,l =
|Cki ∩ Clj | |Cki |
(2)
and by taking into account the distribution of the data in the other clusters ρi,j k
nj 2 αi,j = . k,l
(3)
l=1
Refinement During a phase of refinement of the results, several local resolutions are performed in parallel. The detection of the conflicts consists, for all the clusters
76
G. Forestier et al.
of all the results, in seeking all the couples Cki , Rj for two clustering results Ri and Rj , such as Cki = Ckj , where Ckj is the corresponding cluster of Cki in Rj . A conflict importance coefficient is calculated according to the similarity measure as follow: (4) ψki,j = 1 − ωki,j . Then a conflict is selected according to the conflict importance coefficient (the most important for example) and its resolution is started. This conflict and all those concerning the two associated methods are removed from the list of conflicts. This process is repeated until the list of conflicts is empty. The resolution of a conflict consists in applying an operator to Ri and an operator to Rj . These operators are chosen according to the clusters Cki i and Ckj j involved in the conflict: • merging of clusters: the clusters to merge are chosen according to the representative clusters of the treated cluster. The representative clusters of a cluster Cki compared to the result Rj are the set of clusters from Rj having more than pcr % of their objects included in Cki (pcr is a parameter of the method). • splitting a cluster into sub-clusters: all the objects of the cluster are classified in new sub-clusters. • reclassification of a cluster: the cluster is removed and its objects are reclassified in all the other existing clusters. But, the simultaneous application of operators on Ri and Rj is not always relevant. Indeed, it does not always increase the similarity of the results implied in the conflict treated (Red Queen effect: “success on one side is felt by the other side as failure to which must be responded in order to maintain one’s chances of survival”[15]), and the iteration of conflict resolutions may lead to a trivial solution where all the methods are in agreement: a result with only one cluster including all the objects to classify, or a result having one cluster for each object. So we defined the local concordance and quality rate which estimates the similarity and the quality for a pair of results by ⎞ ⎛ i,j j,i j i nj ni p p ω + p δ ω + p δ s q s q k k k k 1 ⎠ (5) + γ i,j = ⎝ 2 ni nj k=1
k=1
where ps + pq = 1 and δki is a cluster quality criterion chosen by the user (0 < δki 1). For example, with methods which include a distance measure, the user can select the intra-class inertia. Without distance measure, the cluster predictivity (Cobweb) or variance (EM algorithm) could be used for example. At the end of each conflict resolution, after the application of the operators, the couple of results (the two new results, the two old results, or one new result with one old result) which maximizes this rate is kept.
Collaborative Clustering for Image Analysis
77
After that, a global application of the modifications proposed by the refinement step is decided according to the improvement of the global agreement coefficient : m 1 i Γ (6) Γ = m i=1 where Γi =
1 m−1
m
γ i,j
(7)
j=1,j=i
is the global concordance and quality rate of the result Ri with all the other results. Then a new iteration of the phase of convergence is started if the global agreement coefficient has increased and an intermediate unified result is calculated by combining all the results. 3.3 Combination of the Results In the final step, all the results tend to have the same number of clusters which are increasingly similar. We use the adapted voting algorithm [20] to compute a unified result combining the different results. This is a multi-view voting algorithm which enables to combine many different clustering results that does not necessarily have the same number of clusters. The main idea is, for each object, to make all the methods vote for the cluster it has found and the corresponding clusters of the other methods. So if we have N clustering results, N × N votes are computed for each object. Then a cluster is created for all the clusters having a majority and the objects are associated to these clusters. During this phase of combination we define two new concepts which help us to give relevant information to the user: the concepts of relevant clusters and non-consensual objects. The relevant clusters correspond to the groups of objects of a same cluster, which were classified in an identical way in a majority of the results. We can quantify this relevance by using the percentage of clustering methods that are in agreement. These clusters are interesting to highlight, because they are, in the majority of the cases, the relevant classes for the user. Reciprocally, a non-consensual object is an object that has not been classified identically in a majority of results, i.e. it does not belong to any of the relevant clusters. These objects often correspond to the edges of the clusters in the data space (Sect. 4 presents examples of non-consensual object in object-oriented image analysis).
78
G. Forestier et al.
4 Object-Oriented Image Analysis In this section we present results obtained by the application of our multistrategical method for object-oriented image analysis. The object-oriented approach is composed of two steps : • the segmentation: this step consists in regrouping neighbor pixels in the image which have homogeneous spectral values. The set of regions found forms a paving of the image. Each region is then characterized with a set of attributes derived from the spectral values of the pixels (e.g. texture) of the region or from the form (e.g. area, elongation) of the region. • the classification: this step consists in assigning a class to each obtained region by using its characteristics. Two different experiments are presented. The first one is the study of an urban area (Strasbourg – France) and the second one is the study of a coastal zone (Normandy – France). The aim of these experiments is to show that our system can be used for object-oriented image analysis to improve the extraction of knowledge from images and to provide a better scene understanding. 4.1 Creation of the Regions The object-oriented approach consists in identifying in the image the objects composed of several connected pixels and having an interest for the domain expert. The first step of this kind of approach is usually the image segmentation which decomposes the image into a set of regions. There exists many algorithms of segmentation like the watershed transform [19] or the region growing [13]. In the following experiments, we have used a supervised segmentation algorithm [3]. The basic idea of this method is to apply a watershed transform on a fuzzy classification of the image to obtain the segmentation. The method is well adapted to complex remote sensing image segmentation and produces generally good results. The second step of an object-oriented approach is the characterization of the different regions obtained thanks to the segmentation. The idea is to compute different features which describe the characteristics of each region. The features we used can be regrouped into two types: 1. spectral features: means of the spectral reflectance of the region, NDVI, SBI. 2. geometric shape features: area, perimeter, elongation and circularity ratio. Spectral features: The spectral features we used are the means for each band of the spectral reflectance of each pixel composing the region. We also compute from these means the NDVI (Normalized Difference Vegetation Index) index and SBI (Soil Brightness Index) index, defined as N DV I =
(N IR − RED) (N IR + RED)
SBI =
(RED2 + N IR2 )
Collaborative Clustering for Image Analysis
79
where RED and NIR are the spectral reflectance measurements respectively of the red band and the near-infrared band. The NDVI index varies between −1.0 and +1.0. For vegetated land it generally range from about 0.1 to 0.7, with values greater than 0.5 indicating dense vegetation. These two indexes have been suggested by the geographers of the LIV (Image & Ville Laboratory) because they are discriminant to characterize some land covers. Geometric shape features: We used the GeOxygene library1 to compute shape features. GeOxygene is an open source project developed by the IGN (Institut G´eographique National), the French National Mapping Agency. It provides an extensible object data model (geographic features, geometry, topology and metadata) which implements OGC (Open Geospatial Consortium) specifications and ISO standards in the geographic information domain. Geometric shape features like area or perimeter are trivial to compute with GeOxygene. For the elongation we use the smallest value between the elongation from the ratio of the eigenvalues of the covariance matrix, and the elongation approximated using the bounding box and its degree of filling. The circularity ratio is defined as circularity ratio =
4πS P2
with S the surface and P the perimeter of the region. The circularity ratio range goes from 0 (for a line) to 1 (for a circle). These different shape features are useful to regroup regions according to their form and not only their spectral values (as a traditional per-pixel classification). When all these different features are computed for each region, we are ready to use our multi-strategical collaborative method on the objects (the regions and their features) to try to identify relevant groups. In the two next sections we provide details of experiments for two different case studies. 4.2 Urban Area Analysis The image selected for this first experiment concerns Strasbourg (France) and has been taken by Quickbird sensors. These sensors give one panchromatic channel with a resolution of 0.7 meter and 3 color channels with a resolution of 2.8 meters. Color channels are resampled to obtain a four channel image with a resolution of 0.7 meter [16]. This area is representative of the urban structure of Western cities and is characterized by many different objects (e.g. buildings, streets, water, vegetation) that exhibit a diverse range of spectral reflectance values. We choose four clustering methods with different parameters for experimental tests on our image. Each of the four clustering methods was set up with different parameters: • M1 : K-means [12] k = 10 with random initializations. 1
http://oxygene-project.sourceforge.net
80
G. Forestier et al.
• M2 : K-means k = 15 with random initializations. • M3 : Growing Neural Gas [8] with the maximum of nodes of 100, the max edge age of 88, the number of iterations of 10000 and the node insertion frequency of 100. • M4 : Cobweb [6] with an acuity of 7.5. These four clustering methods have been applied to our set of regions (Fig. 2 shows the segmentation used) and the collaborative system presented in Sect. 3 is then applied. The obtained results are presented in Table 1, before and after the application of our refining algorithm (ni is the number of clusters, Γ i is the global agreement coefficient as defined in Sect. 3 and µ represents the mean of the values).
Fig. 2. The segmentation obtained for urban area analysis
We have applied to these results our multi-view voting algorithm and have obtained the unifying result presented in Fig. 3. This result is composed of eight different clusters. Figure 4 represents the voting result for all the objects before and after the application of our collaborative method with the following color legend: • in white: all the methods agree on the clustering (relevant clusters),
Collaborative Clustering for Image Analysis
81
Table 1. Results before and after application of our refinement process ni (initial) Γ i (initial) ni (final) Γ i (final) R1 R2 R3 R4 µ
10 15 14 21 15
0.379 0.336 0.317 0.186 0.304
8 8 7 8 7.75
0.532 0.538 0.476 0.362 0.477
• in gray: one method disagrees with the other ones, • in black: the non-consensual objects (two or more methods classified these objects differently). One can notice a significant reduction of non-consensual objects during the refining process. Figure 5 represents the evolution of the Γ value through the iterations of the algorithm and Fig. 6 the evolution of the number of clusters. One can also notice here that the different clusterers are more and more in agreement on a consensual clustering of the objects. The study of non-consensual objects is very useful. In this experiment, it helps us to highlight the objects that are difficult to cluster. This information is very interesting for the geographer who can concentrate his attention on these objects. In this image, non-consensual objects are often small houses which have a spectral reflectance near to the road one and an unconventional form. This information can also be used to improve the segmentation as nonconsensual objects are often the consequence of errors in the segmentation.
4.3 Coastal Area Analysis The second experiment has been carried out on the image of a coastal zone (Normandy Coast, Northwest of France) taken by the satellite SPOT. This area is very interesting because it is periodically affected by natural and anthropic phenomena which modify the structure of the area. Furthermore urbanization pressure due to the increasing tourism activities and the build-up of recreational resorts also affect the morphology of the coast. Figure 7 shows the image used in this experiment and Fig. 8 the segmentation. Our collaborative method has been used with four K-means algorithms with different initializations. The four methods were set up to produce respectively 5, 10, 15 and 20 clusters. The results obtained after and before the application of our refinement method are presented in Table 2. Here again, one can notice that the clusterers are more and more in agreement as the global agreement coefficient increases from 0.44 to 0.776. To illustrate this increase of agreement between the clusterers, Fig. 9 shows the evolution of the result of the voting method used to unify the different results.
82
G. Forestier et al.
Fig. 3. The unified result. We can identify four interesting clusters: vegetation, road, water and houses
(a) first iteration
(b) last iteration
Fig. 4. Non-consensual objects
Collaborative Clustering for Image Analysis
Fig. 5. Evolution of the Γ index (global agreement coefficient)
Fig. 6. Evolution of the number of clusters
Fig. 7. Normandy Coastal zone studied
83
84
G. Forestier et al.
Fig. 8. The segmentation obtained for coastal area analysis Table 2. Results before and after application of our refinement process ni (initial) Γ i (initial) ni (final) Γ i (final) R1 R2 R3 R4 µ
5 10 15 20 12.5
0.419 0.518 0.440 0.382 0.440
10 10 10 10 10
0.760 0.780 0.782 0.781 0.776
For each iteration of the algorithm, we compute for each object (i.e. region) how many clusterers are in agreement on the classification of this object. This figure shows that the different methods more and more agreed on the classification of the objects as the number of total agreement increases significantly through the iterations.
Fig. 9. Evolution of the agreement during the vote (1/4: all methods have found different clusters for the object; 4/4: all the methods have found the same cluster
Figure 10 shows the evolution of the number of conflicts between the different clustering methods. The number of conflicts slowly decreases when the
Collaborative Clustering for Image Analysis
85
refinement process starts and falls at the end when no major conflicts are discovered.
Fig. 10. Evolution of the number of conflicts during the refinement
Figure 11 presents the non-consensual objects before and after the application of the collaborative process. At the first iteration (Fig. 11(a)) the different clusterers are highly in disagreement on the clustering of the objects. It can be explain by the high difference between the number of initial clusters of each method. At the last iteration (Fig. 11(b)) the different methods are more in agreement as the number of clusters for each method is identical.
5 Conclusion In this paper we presented a process of collaborative multi-strategy clustering of complex data. We also have shown how it can be used in the domain of automatic object oriented classification of remote sensing images. Our method enables to carry out an unsupervised multi-strategy classification on these data, giving a single result combining all the results suggested by the various clustering methods. This enables to make them converge towards a single result (without necessarily reaching it), and to obtain very similar clusters. Doing this, it is possible to put in correspondence the clusters found by the various methods and finally to apply a unification algorithm like a voting method. So we can propose to the user a single result representing all the results found by the various methods of clustering. We are now interested in using domain knowledge to improve each step of the collaborative mining process. Indeed in the collaborative step, some operators has been designed to lead the different algorithms to a unified solution (split, merge, reclassification). We believe that equipping these operators with some knowledge or designing new specific operators using background knowledge will help to improve the final result. We recently design an ontology
86
G. Forestier et al.
(a) first iteration
(b) last iteration
Fig. 11. Non-consensual objects
Fig. 12. Final unified classification with ten clusters
of urban objects [4]. An ontology [11] is a specification of an abstract, simplified view of the world represented for some purpose. It models a domain in a formal way. The ontology defines a set of concepts (buildings, water, etc) and their relations to each other. These definitions allow to describe and to use reasoning on the studied domain. Preliminary tests have been already carried out in order to integrate knowledge extracted from this ontology during the collaborative process and the first results are promising.
Collaborative Clustering for Image Analysis
87
Acknowledgments The authors would like to thank Nicolas Durand and S´ebastien Derivaux for their work on the segmentation. This work is a part of the FoDoMuST2 and ECOSGIL projects3 .
References 1. Ayad HG, Kamel MS (2007) Cumulative voting consensus method for partitions with variable number of clusters. IEEE Trans Pattern Analysis Mach Intell, to appear 2. Boulis C, Ostendorf M (2004) Combining multiple clustering systems. In: Boulicaut J-F, Esposito F, Giannotti F, Pedreschi D (eds) Proc 8th European Conf Principles and Practice of Knowl Discov in Databases, Pisa, Italy. Springer, Berlin/Heidelberg, pp 63–74 3. Derivaux S, Lef`evre S, Wemmert C, Korczak J (2006) Watershed segmentation of remotely sensed images based on a supervised fuzzy pixel classification. In: Proc IEEE Int Geoscience and Remote Sensing Symp, Denver, CO, USA, pp 3695–3698 4. Durand N, Derivaux S, Forestier G, Wemmert C, Gan¸carski P, Boussa¨ıd O, Puissant A (2007) Ontology-based object recognition for remote sensing image interpretation. In: Proc 19th IEEE Int Conf Tools with Artif Intell, Patras, Greece. IEEE Computer Society, Los Alamitos, pp 472–479 5. Fern XZ, Brodley CE (2004) Solving cluster ensemble problems by bipartite graph partitioning. In: Proc 21st Int Conf Mach Learn, Banff, AL, Canada. ACM, New York, pp 281–288 6. Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2:139–172 7. Fred A, Jain AK (2005) Combining multiple clusterings using evidence accumulation. IEEE Trans Pattern Analysis Mach Intell 27:835–850 8. Fritzke B (1994) Growing cell structures – a self-organizing network for unsupervised and supervised learning. Neural Networks 7:1441–1460 9. Gan¸carski P, Wemmert C (2007) Collaborative multi-step mono-level multistrategy classification. J Multimedia Tools Appl 35:1–27 10. Gionis A, Mannila H, Tsaparas P (2005) Clustering aggregation. In: Proc 21st Int Conf Data Eng, Tokyo, Japan. IEEE Computer Society, Washington, pp 341–352 11. Gruber TR (1995) Toward principles for the design of ontologies used for knowledge sharing. Int J Human-Comp Studies 43:907–928 12. MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Le Cam LM, Neyman J (eds) Proc 5th Berkeley Symp Math Stat and Probability, Berkeley, CA, USA. University of California Press, Berkeley, pp 281–297
2 3
http://lsiit.u-strasbg.fr/afd/sites/fodomust/ http://ecosgil.u-strasbg.fr
88
G. Forestier et al.
13. Mueller M, Segl K, Kaufmann H (2004) Edge- and region-based segmentation technique for the extraction of large, man-made objects in high-resolution satellite imagery. Pattern Recogn 37:1619–1628 14. Oruc M, Marangoz AM, Buyuksalih G (2004) Comparison of pixel-based and object-oriented classification approaches using landsat-7 etm spectral bands. Int Archiv Photogrammetry Remote Sensing Spatial Inf Sci 35: pp 1118–1122 15. Paredis J (1997) Coevolving cellular automata: be aware of the red queen! In: B¨ ack T (ed) Proc the 7th Int Conf Genetic Algorithms, East Lansing, MI, USA. Morgan Kauffmann, San Fransisco, pp 393–400 16. Puissant A, Ranchin T, Weber C, Serradj A (2003) Fusion of Quickbird MS and Pan data for urban studies. In: Goossens R (ed) Proc the 23rd EARSel Symp Remote Sensing in Transition, Ghent, Belgium, Millpress, Rotterdam, pp 77–83 17. Strehl A, Ghosh J (2002) Cluster ensembles – a knowledge reuse framework for combining multiple partitions. J Mach Learn Research 3:583–617 18. Topchy A, Jain AK, Punch W (2005) Clustering ensembles: models of consensus and weak partitions. IEEE Trans Pattern Analysis Mach Intell 27:1866–1881 19. Vincent L, Soille, P (1991) Watersheds in digital spaces: an efficient algorithm based on immersion simulations. IEEE Pattern Analysis Mach Intell 13:583–598 20. Wemmert C, Gan¸carski P (2002) A multi-view voting method to combine unsupervised classifications. In: Hamza MH (ed) Proc IASTED Artif Intell Appl, Malaga, Spain. ACTA Press, Calgary, pp 447–452 21. Whiteside T, Ahmad W (2004) A comparison of object-oriented and pixel-based classification methods for mapping land cover in nothern Australia. In: Proc Spatial Sci Inst Biennial Conf, Melbourne, Australia 22. Zhou Z-H, Tang W (2006) Clusterer ensemble. Knowl-Based Syst 19:77–83
Intrusion Detection in Computer Systems Using Multiple Classifier Systems Igino Corona, Giorgio Giacinto, and Fabio Roli Department of Electrical and Electronic Engineering - University of Cagliari, Italy Piazza d’Armi - 09123 Cagliari, Italy, igino.corona, giacinto,
[email protected] Summary. Multiple Classifier Systems (MCS) have been applied successfully in many different research fields, among them the detection of intrusions in computer systems. As an example, in the intrusion detection field, MCS may be motivated by the presence of different network protocols (and related services, with specific features), multiple concurrent network connections, distinct host applications and operating systems. In such a heterogeneous environment the MCS approach is particularly suitable, so that different MCS designs have been proposed. In this work we present an overview of different MCS paradigms used in the intrusion detection field, and discuss their peculiarities. In particular, MCS appear to be suited to the anomaly detection paradigm, where attacks are detected as anomalies when compared to a model of normal (legitimate) event patterns. In addition, MCS may be used to increase the robustness of Intrusion Detection System (IDS) against attacks to the IDS itself. Finally, a practical application of MCS for the designing of anomaly-based IDS is presented. Key words: intrusion detection system, multiple classifier system, one-class classifier, anomaly detection
1 Introduction Intrusion Detection Systems (IDS) are employed to detect patterns related to computer system attacks. Attacks (respectively, legitimate actions) may be defined as actions performed by users accessing computer system services and resources deviating from (respectively, according to) the use they have been deployed for. Such actions are evaluated through their measurable features. Depending upon the type of input data, two types of IDS are currently used. Network-based IDS analyze the traffic of a computer network, whereas host-based IDS analyze audit data recorded by networked hosts. Traditionally, the design of IDS relies upon the expert, hand-written, definition of models describing either legitimate or attack patterns in computer systems [24, 30]. According to the employed intrusion detection model, an alert is produced if I. Corona et al.: Intrusion Detection in Computer Systems Using Multiple Classifier Systems, Studies in Computational Intelligence (SCI) 126, 91–113 (2008) c Springer-Verlag Berlin Heidelberg 2008 www.springerlink.com
92
I. Corona et al.
a pattern is not included in the model of legitimate actions patterns (a.k.a. anomaly-based IDS), or if it is included in the models of attacks (a.k.a. misuse or signature-based IDS). In addition to the need for human expertise to model legitimate or attack patterns, it is time and effort expensive to guarantee the effectiveness of this approach, that is, attaining high detection rates and low false alarm rates. Finally, hand-written models offer poor novel (zeroday) attack detection capabilities. Especially, this is true for attack pattern models, that, by definition, describe only known attacks. Conversely, in principle, anomaly-based IDS are able to detect new attacks. The problem, in this case, is that it is difficult to produce effective models of legitimate patterns by hand. To cope with these problems, the intrusion detection task has been formulated as a pattern recognition task based on machine learning algorithms. However, a large number of patterns related to computer system events, at different places and abstraction levels have to be analyzed. Consequently, many features have to be considered, during the intrusion detection task. For example, sequences of operating system calls, executing applications, open IP ports, web server logs, users logged-in, file property changes, database logs, are host-side features. Furthermore, network traffic, i.e. packets exchanged between different hosts, must be analyzed. Particularly, many protocols, at different abstraction levels and with different semantic, have to be considered: ICMP, IP, TCP, UDP, HTTP, FTP, SMTP, IMAP are some examples. Patterns extracted from this heterogeneous domain are very difficult to characterize. In fact, it is very difficult to take into account the domain knowledge as a whole, in the design of the classifier. In general, by increasing the number of features, the pattern recognition task become even more complex and less tractable. Furthermore, the larger the number of features, the larger the number of training patterns needed to avoid the well-known “curse of dimensionality” problem [7]. MCS provide a solution to address these issues. The intuitive advantage given by the MCS approach is supported by the results in [20], where Lee et al. measured the “complexity” of the classification task by the information theory approach. The subdivision of the classification problem in multiple sub-problems turns into a decrease of the entropy of each subset of training patterns, which in general coincides to the ability to construct a more precise model of the normal traffic. For example, the feature set can be subdivided into subsets, and for each subset a classifier can be trained. It can be proven that the combination of multiple classifiers trained with complementary information contents may outperform a monolithic classifier [15, 38]. The intrusion detection domain can be seen as the union of different subdomains, each one providing for complementary information. A sub-domain may be characterized by a specific: abstraction level i.e. a group of network protocols, user logins on different hosts;
Intrusion Detection Using Multiple Classifier Systems
93
place i.e. software applications (managing the routing table) in a router, software applications running in a host; Each sub-domain may be decomposed and analyzed more in detail. For example, considering a group of protocols with a specific abstraction level (e.g. HTTP, FTP and SMTP), syntax and semantics related to each protocol can be implemented in the design of service1 -specific classifiers. This allows for including the sub-domain knowledge, with a more precise and clear modeling of patterns. In fact, recent works focused on analyzing specific applications or services for this suitable capability. An interesting example of a service (HTTP) specific IDS is presented in [16]. Particularly, in this paper the MCS approach is further applied, combining different classifiers each one trained on a specific feature of the HTTP requests. We note that such an approach may be useful to deal with statistically correlated features. By considering each feature singularly, information derived from the joint analysis of different features may be lost. However, if such features are correlated, such a loss may be small with respect to the increased precision of the model for each feature. Such precision may allow for outperforming the monolithic approach. Moreover, it is possible to use more than one classifier for a specific feature. This is the case of the service (FTP) specific IDS proposed in [1], where legitimate sequences of commands submitted by a FTP client are modeled through a combination of Hidden Markov Models. Conversely, in [4] the MCS approach is proposed to detect attacks related to different network services. Particularly, a different design, based on serially connected classifiers is presented. MCS has also proven to be useful in the intrusion detection domain where the training set is made up of unlabeled data, as in typical real-world case it is very effort and time expensive to assign a pattern to the legitimate or to the attack class. On the other hand, unlabeled data are readily available as they can be collected at network hubs without human intervention. A novel issue is recently drawing the attention of the security community, as the data collected for deploying the detection engine of an IDS can be polluted by an adversary. Such a pollution is produced by submitting specifically crafted malicious patterns aimed at either providing fake attack patterns, or fake normal patterns. This aspect clearly affects those IDS based on pattern recognition techniques, which typically assume that the training set is only affected by random noise, so that the resulting IDS is “mis-trained”. This problem could be neglected when a supervised approach is used, where such fakes can be identified and removed from the training set. Instead, this problem cannot be neglected when using an unsupervised approach. A suitable MCS approach could address the problem by using classifiers based on different training algorithms for a certain set of features: for an adversary it could be necessary to inject different types of malicious noise, each one conceived to mis-train a specific classifier of the ensemble. Also, such a noise could affect 1
In fact, for each protocol is associated a specific network service, e.g. SMTP is used to send e-mails.
94
I. Corona et al.
some feature subsets, without decreasing significantly the overall performance of the MCS. However, to date, the contribution of MCS against the problem of learning in presence of malicious patterns have to be further researched. On the other hand, a recent paper showed that the MCS approach is able to increase the robustness against an adversarial environment during the operational phase of an IDS [28]. Finally, the MCS approach could be very helpful to implement effective alert verification mechanisms as a practical problem of IDS is represented by the high amounts of false alarms. Alert verification mechanisms can be defined as post-processing operations used to assess whether alerts may be actually related to attacks or not, reducing false alarms. Using multiple, specific, classifiers it is possible to provide for specialized alert and more reliable alert verification processes also. It is worth noting that to the best of our knowledge no work addressed this issue, even if it appears to be important. The paper is organized as follows. Section 2 outlines the peculiarities of some recent intrusion detection systems based on multiple classifiers [1, 4, 16]. In Sect. 3 the design of an anomaly-based IDS using unlabeled data is presented, while Sect. 3.1 describes a solution based on a modular MCS architecture. The detection algorithms and the combination techniques employed are described in Sect. 3.2. Experimental results on the KDD-Cup 1999 dataset are reported in Sect. 3.3, where comparisons with results in the literature are also shown. Conclusions are drawn in Sect. 4.
2 Multiple Classifier System Designs for Intrusion Detection As discussed in the introduction, the intrusion detection domain is heterogeneous in the sense that many different systems, also at different abstraction levels, and data flowing with specific protocols, must be monitored. The MCS paradigm can be used to handle the problem complexity, subdividing the general classification problem into multiple, more simple, classification problems. In this section three different MCS designs are discussed. 2.1 One-to-One Mapping between Features and Classifiers Using a service-specific IDS it is possible to analyze the network traffic by including the specific protocol knowledge. The service specific approach becomes useful because of protocol semantics and syntax differences. Moreover, as new versions of protocols are usually extensions of current protocols, a flexible protocol-specific IDS could deal with such changes. Analogously, an application specific IDS can exploit the knowledge on the application internal mechanisms. A service-specific IDS designed to detect intrusions in web servers is presented in [16]. Incoming HTTP requests extracted from web
Intrusion Detection Using Multiple Classifier Systems
95
server logs2 are analyzed to model normal patterns (anomaly detection). In particular, the IDS analyzes HTTP requests made with the GET method, and the related request URI are inspected. This analysis is motivated by the fact that the majority of web requests, as well as web attacks, use the GET method. This method allows user inputs to be submitted through the request URI to the web application which subsequently processes it. The vulnerabilities of Web applications are typically exploited by an attacker by submitting well crafted inputs. The features used to characterize the HTTP traffic may be subdivided into two categories: spatial and temporal features. Spatial features are extracted from a single request URI, e.g. length, character distribution, character sequences, enumeration. Temporal features are extracted from multiple requests, generated from a certain host, i.e. frequency of requests on a specific resource, order of requests, time delay between two requests. For each feature, a single classifier is used. It is worth noting that some features are correlated. For example, if Hidden Markov Models (HMM) are used to model the sequences of characters, the length of the sequence is also modeled, as that the probability of a certain sequence decrease with the increase of its length. Using different models focusing on specific aspects of correlated features and combining the information allows attaining effective models. In that work, a simple linear combination of outputs (i.e. the probability that the pattern is normal) is used, so that an anomaly score for each pattern is computed as follows: M wi · (1 − pi {pattern}) anomalyScore(pattern) = i=1
where M is the number of classifiers, wi the weight applied to the i-th classifier output and pi {pattern} is the probability of normality assigned to the pattern by the i-th classifier. The design proposed so far can be useful to deal with an adversary environment. By employing a dedicated classifier for each feature, it can be more simple to deal with malicious patterns in training data that are conceived to negatively affect the model selection. An adversary must explicitly conceive patterns with values that are “malicious” simultaneously for different features to mis-train the correspondent classifiers. In Fig. 1 is shown the MCS design schema applied in [16]. This schema is applicable not only to HTTP traffic, but in general to every network protocol or applications. It is worth noting that in [16] the adversary environment has not been considered during the training phase. 2.2 Multiple Classifiers to Model Individual Features For each feature, different classifiers can be used. This approach can be useful when the classifiers are based on training algorithms that may stop on differ2
Web server logs contain every processed HTTP request, incoming from a web client, that is, typically a web browser.
96
I. Corona et al.
Fig. 1. MCS design schema followed in [16]. For each feature a model is used to characterize the statistical distribution of feature values. Each model assigns a probability to the input pattern of being normal. A sum of such outputs is used to calculate an anomaly score. A pattern is evaluated as intrusive if the anomaly score exceeds a certain threshold
ent local minima depending upon their initial parameters. It has been shown that an ensemble of such classifiers may outperform a single one [7]. In [1] such an approach is applied to the analysis of FTP command sequences issued by a client toward a server. In particular, classifiers are based on Hidden Markov Models (HMM), aiming at modeling legitimate command sequences (anomaly detection). The Baum-Welch training algorithm used for HMM require to set initial values for both state transition and state-conditional symbol emission probabilities [2]. Usually these values are chosen randomly as typically there is not a priori information regarding the structure of the sequences. In addition, the choice of the most suited number of “hidden” states is usually performed by a trial-and-error procedure, thus producing different models. The output of the different HMM models generated by varying the design parameters can be combined by a number of techniques. In [1] the following techniques are used: the geometric mean, the arithmetic mean, and the decision templates [17]. In Fig. 2 this MCS design is shown. Reported results show that the number of false alarms decreases when multiple models are combined, while the detection rate increases. In addition, this schema can increase the robustness of detection in an adversarial environment, as the final decision depends on
Intrusion Detection Using Multiple Classifier Systems
97
the combination of multiple different classifiers. However, this aspect needs to be further researched.
Fig. 2. MCS design schema applied in [1]. K different HMMs are used to characterize a single feature, that is, command sequences issued by a FTP client. Each HMM is trained with different initial training parameters. The number of states, the outputs correlation method, and the decision threshold must be chosen so that a good tradeoff between detection rate and false alarm rate is obtained
2.3 Multiple-Stage MCS The two previous MCS design paradigms were based on a parallel combination of classifiers. In other words, classifier outputs are evaluated and correlated at the same time. For sure, this is not the unique way to design a MCS. For example, classifier outputs may be evaluated serially, and the more general way to combine classifier outputs is a mixture of the two techniques. In [4] a multi-stage classification system for the analysis of the network traffic is proposed. Here, the intrusion detection task is performed through a cascade of stages. For each stage a classifier is trained to distinguish between legitimate patterns, and patterns pertaining to a specific attack class. If a stage assigns a certain pattern to the legitimate class, this pattern is forwarded to the next stage for further inspection. Thus, a pattern is evaluated as pertaining to the
98
I. Corona et al.
legitimate class by the MCS only if all stages assign it to such class. Conversely, if a pattern is evaluated as intrusive, two actions are possible: (1) if the estimated classification reliability is higher than a fixed threshold, an alert is raised, (2) otherwise the pattern is sent to another stage, or the relative connection is logged. Figure 3 summarizes the MCS design for intrusion detection proposed in [4]. It is worth noting that this figure generalizes the original schema proposed in [4] by introducing the condition on the “sufficient attack detail level”. Actually, this condition and further inspections are implemented through a cascade of other two stages to distinguish between different attack classes (e.g., Denial of Service, and Probe attacks), provided that in the first stage a generic attack is recognized. If an attack is recognized in any of the other stages, no further inspections are performed, that is, either an alert is raised or the related connection is stored. It is worth noting that this MCS can be designed only if a supervised training is used. Unfortunately, in typical real cases a supervised training is not applicable. Therefore, it may be interesting to devise such a multi-stage MCS paradigm for unlabeled data. In our opinion this schema presents at least two drawbacks: • the overall response (classification) time can be higher than that offered by previous schemas. Intuitively, whereas in the first two cases the response time depends upon the model with larger response time, in the last case this depends upon the sum of the response times of each step involved. • the models of output correlation is less flexible than that of the previous schemas, as they have been heavily codified on the specific task at hand; On the other hand, the originality of this design approach for the intrusion detection task lies in the proposal of a hierarchy of classifiers. For example, if the first stage detects a generic attack, the corresponding pattern may be inspected more thoroughly by specialized classifiers, looking for different attacks. This allows for detailed analysis “on-demand”, providing for specific alert verification mechanisms.
3 Unsupervised, Anomaly-Based Intrusion Detection Using Multiple Classifier Systems Anomaly-based detection has been the first approach to be developed, in account of its theoretical ability to detect intrusions regardless of the specific system vulnerability and the type of attack [6]. Even though anomaly-based approaches are promising, they usually produce a relatively high number of false alarms, due to the difficulties in modeling the normal patterns. Moreover, some patterns may be different from normal ones even if actually they do not represent any attack. For this reason, signature-based detectors are usually chosen to be deployed in many organizations, instead, thanks to their ability to reliably detect known attacks while producing a relatively low number of false alarms. However, it is worth noting that signature-based IDS may
Intrusion Detection Using Multiple Classifier Systems
99
Fig. 3. MCS design schema applied in [4]. A pattern is analyzed serially: if evaluated as normal by a classifier, it is submitted to the next one. Otherwise, an alarm is generated if the classification reliability is higher than a threshold. If pattern classification reliability is under this threshold, packets related to this pattern are stored. To define more thoroughly the attack class, further analysis may be employed
also produce false alarms when overstimulated, i.e. when they are explicitly induced to generate false alarms to pollute the IDS alerting logs [27, 37]. It is worth noting that the secrecy of attack signature as made in commercial IDS does not help to mitigate the problem [26]. Furthermore, attackers are constantly developing new attack tools designed to evade signature-based IDS. Techniques based on metamorphism and polymorphism are used to generate instances of the same attack that look syntactically different from each other, yet retaining the same semantic and therefore the same effect on the victim [35]. Pattern recognition techniques for misuse-based IDS have also been explored thanks to their generalization ability, which may support the recognition of new “variants” of known attacks that cannot be reliably detected by signature-based IDS [5, 8, 10, 11]. In order to apply these techniques a dataset containing examples of attack patterns as well as normal traffic is needed. However, it is very difficult and expensive to create such a dataset and previous works in this field funded by DARPA and developed by the MIT Lincoln Laboratory group [22] have been largely criticized [23, 25]. The reasons mentioned above motivate the renewed interest in networkbased anomaly detection techniques, and, in particular, in unsupervised or
100
I. Corona et al.
unlabeled anomaly detection. Because it is very hard and expensive to obtain a labeled dataset, clustering and outlier detection techniques are applied on completely unlabeled traffic samples extracted from a real network. Alternatively, when a (small) set of labeled data samples is available in addition to unlabeled data, semi-supervised techniques may be used [3]. Here an unsupervised method is presented, where the only a priori knowledge about the data is represented by the following assumptions: i) the extracted dataset contains two classes of data, normal and anomalous traffic; ii) the amount of the anomalous traffic class is by far less than that of the normal traffic class. The first assumption comes from the consideration that pure legitimate traffic can only be generated by simulations in an isolated environment. However, this simulation process cannot reproduce traffic patterns of a real network [23]. In turn, an anomaly detector derived from such an artificial dataset of normal traffic may easily produce a large number of false alarms during the operational phase, because real network traffic is in general different from the one artificially generated. On the other hand, it is very hard, time-consuming and expensive to produce a dataset of pure legitimate traffic by thoroughly “cleaning” real network traffic traces. As a result, traffic traces collected from real networks are not guaranteed to be attack free. The second assumption is due to the fact that in general the majority of the network traffic is legitimate [29]. Moreover, it is possible to use existing signature-based IDS to eliminate the know attacks from the collected traffic, thus further reducing the percentage of attack events in the dataset. Recently proposed anomaly detectors use clustering and outlier detection techniques on unlabeled data [9, 21, 29]. These techniques try both to cope with the possible presence of outliers attack patterns in the training data and to detect anomalous network events during the operational phase. They are based on the common assumption that in a suitable feature space attack traffic is statistically different from normal traffic [6, 14]. Here an unlabeled approach for Network Anomaly IDS based on a modular Multiple Classifier System (MCS) is presented [12], whereby i) A module is designed for each group of protocols and services, so that they fit the characteristics of the normal traffic related to that specific group of services [20, 36]. ii) Each module can be implemented by using an individual classifier as well as a combination of different classifiers. The presented modular architecture allows the designer to choose the rejection threshold of each module, so that the overall attack detection rate can be optimized given a desired total false alarm rate for the ensemble. The work presented in the following is mainly inspired by [9] and [10]. The unlabeled anomaly detection problem is faced by applying multiple one-class classification techniques, which are often referred to also as outlier detection techniques. In particular, an heuristic to tune the false alarm rate produced by each anomaly detection module is proposed so that, given a fixed tol-
Intrusion Detection Using Multiple Classifier Systems
101
erable false alarm rate for the IDS, the overall detection rate is optimized. Besides, the combination of one-class classifiers is a new and not completely explored research area. A heuristic approach to combine multiple one-class classifiers’ output was proposed by Tax et al. in [32]. Nevertheless, the proposed heuristic may present some problems, in particular when density-based and “distance-based” one-class classifiers are combined. Here a new heuristic to map the output of two different “distance-based” one-class classifiers to a class-conditional probability is used, which aims at overcoming the mentioned problem. 3.1 Modular MCS Architecture Problem Definition The traffic over a TCP/IP network consists of packets related to communications between hosts. The exchange of packets between hosts usually fits into the client-server paradigm, whereby a client host requests some information offered by a service running on a server host. The set of packets related to the communication established between the client and (the service running on) the server forms a connection. Each connection can be viewed as a pattern to be classified and the network-based anomaly detection problem can be formulated as follows [11]: Given the information about connections between pairs of hosts, assign each connection to the class of either normal or anomalous traffic.
Modular Architecture As mentioned in Section 3.1, each connection is related to a particular service. Different services are characterized by different peculiarities, e.g., the traffic related to the HTTP service is different from the traffic related to the SMTP service. Besides, as different services involve different software applications, attacks launched against different services manifest different characteristics. In this work, network services are divided into m groups, each one containing a number of “similar” services [10]. Therefore, m modules are used, each one modeling the normal traffic related to one group of services. An example of how the services can be grouped is shown in Fig. 3.1, where the groupings refer to the network from which the KDD-Cup 1999 dataset was derived (see Sect. 3.3). Overall vs. Service-Specific False Alarm Rate Anomaly detection requires setting an acceptance threshold t, so that a traffic pattern x is labeled as anomalous if its similarity s(x, M ) to the normal model
102
I. Corona et al.
Fig. 4. Modular architecture
M is less then t. Let m be the number of service-specific modules of the IDS; F AR be the overall tolerable false alarm rate; F ARi be the false alarm rate related to the i-th module; ti be the acceptance threshold for the i-th module; P (Mi ) = ni /n be the prior distribution of the patterns related to the ith group of services (i.e. the module) Mi in the training data, where ni is the number of patterns related to the services for which the module Mi is responsible and n is the total number of patterns in the training dataset. Given a fixed value of the tolerable false alarm rate F AR for the IDS, there are many possible ways to “distribute” the overall F AR on the m modules. Once a F ARi has been set for each module Mi , the thresholds ti can be chosen accordingly. In this work it has been chosen: F ARi =
1 F AR . mP (Mi )
(1)
This choice allowed us to attain an higher overall detection rate DR than that attained by choosing a fixed value F ARi = F AR for each module. In order to set an acceptance threshold ti for the module Mi , to obtain the false alarm rate F ARi computed as in (1), we propose the following heuristic. Let us first note that for a given value of ti , the fraction pri (ti ) of patterns rejected by Mi may contain both patterns related to attacks and false alarms. Let us denote with pai (ti ) the fraction of rejected attack patterns using the threshold ti , and with f ari (ti ) the related fraction of false alarms. It is easy to see that pri (ti ) = pai (ti ) + f ari (ti ). We propose to assume pai (ti ) = Pai , where Pai is the expected attack probability for the i-th service2 . In other words, we assume that for a given threshold value, the rejected patterns are 2
In practice, if the network is already protected by “standard” security devices (e.g. firewall, signature-based IDS, etc.), we may be able to estimate Pai from
Intrusion Detection Using Multiple Classifier Systems
103
made up of all the attacks related to that service contained in the training set, plus a certain number of normal patterns. Thus, having fixed the value of pai (ti ) = Pai , we can tune ti in order to obtain f ari (ti ) = F ARi . Service-Specific MCS Lee et al. [19] proposed a framework for constructing the features used to describe the connections (the patterns). The derived set of features can be subdivided into two groups: i) features describing each single connection; ii) features related to statistical measures on “correlated” connections, namely different connections that have in common either the type of service they refer to or the destination host (i.e. the server host). The latter subset of features is usually referred as traffic features. On the other hand, the first group of features can be further subdivided into two subsets, namely intrinsic features and content features. The intrinsic features are extracted from the headers of the packets related to the connection, whereas the content features are extracted from the payload (i.e. the data portion of the packets). We call F the entire set of features and I, C and T the subsets of intrinsic, content and traffic features respectively, so that F = I ∪ C ∪ T . The problem of modeling the normal traffic for each module of the IDS can be formulated essentially in two different ways: i) a “monolithic” classifier can be trained using all the available features to describe a pattern; ii) subsets of features from the three groups described above can be used separately to train different classifiers whose outputs can be combined. Depending on the dimensionality of the feature space d and the size of the training set, one approach can outperform the other. In the following is used, when needed, a MCS that consists of either two or three classifiers, depending on the considered module Mi . When a two-classifiers MCS is used, the module is implemented by training two classifiers on two different features subsets, namely I ∪ C and I ∪ T . On the other hand, when a three-classifiers MCS is used, the module is implemented by training a classifier on each single subset of features, namely one classifier is trained by using the subset I, one by using C and one by using T (see Fig. 5). 3.2 Ensembles of Unsupervised Intrusion Detection Techniques One-Class Classification One-class classification (also referred to as outlier detection) techniques are particularly useful in those two-class problems where one of the classes of objects is well-sampled, whereas the other one is severely undersampled due to the fact that it is too difficult or expensive to obtain a significant number of historical data related to attacks to the network service i that occurred in the past.
104
I. Corona et al.
Fig. 5. Feature subsets for service-specific MCS
training patterns. The goal of one-class classification is to distinguish between a set of target objects and all the other possible objects, referred as outliers [32, 33]. A number of one-class classification techniques have been proposed in the literature. They can be subdivided into three groups, namely density methods, boundary methods and reconstruction methods [33]. Here, one classification method from each category is used to implement the service-specific MCS modules described in Sect. 3.1. This allows for comparing different approaches that showed good results in other applications. In particular, we chose the Parzen density estimation [7] from the density methods, the ν-SVC [31] from the boundary methods and the the k-means algorithm [13] from the reconstruction methods. These one-class classifiers exhibited good performance on a number of applications [33]. Besides, the output of the k-means and ν-SVC classifiers can be redefined as class-conditional probability density functions, so that they can be correctly combined with the output of the Parzen classifier (see Sect. 3.2). We also trained the clustering technique proposed by Eskin et al. [9] in order to compare the results of the combination of “standard” pattern recognition techniques with an algorithm tailored to the unlabeled intrusion detection problem. Parzen Density Estimation The Parzen-window approach [7] can be used to estimate the density of the target objects distribution. When the Gaussian kernel is used p(x|ωt ) =
1 n
n i=1
1 e (2πs)d/2
2 i , − x−x 2s
s = h2 .
(2)
where n is the total number of training patterns belonging to the target class ωt , xi is the i-th training pattern, h is the width of the Parzen-window and p(x|ωt ) is the estimated class-conditional probability density distribution. The one-class Parzen classifier can be obtained by simply setting a threshold θ whereby a pattern z is rejected (i.e. deemed an outlier) if p(z|ωt ) < θ [33]. k-means The k-means classifier is based on the well-known k-means clustering algorithm [13]. The algorithm identifies k clusters in the data by iteratively as-
Intrusion Detection Using Multiple Classifier Systems
105
signing each pattern to the nearest cluster. In order to allow this algorithm to produce an output that can be interpreted as a probability density function, we propose to use all the k distances between the test pattern x and the centroids µi as follows p(x|ωt ) =
1 k
k i=1
1 e (2πs)d/2
2 i , − x−µ 2s (3)
s = avg ||µi − µj ||,
i, j = 1, 2, . . . , k.
i,j
In other words, we model the distribution of the target class by a mixture of k normal densities, each one centered on a centroid µi . A heuristic is used to compute s as the average distance between the k centroids. As with the one-class Parzen classifier, the one-class k-means classifier based on (3) can be obtained by setting a threshold θ whereby a pattern x is rejected if p(x|ωt ) < θ. The number of centroids is in general chosen to be low, therefore s can be efficiently computed. This means that the proposed probability density estimate does not add appreciable complexity to the classifier. ν-SVC The ν-SVC classifier was proposed by Sch¨olkopf et al. in [31] and is inspired by the Support Vector Machine classifier proposed by Vapnik [34]. The one-class classification problem is formulated to find an hyperplane that separates a desired fraction of the training patterns from the origin of the feature space F. When the Gaussian kernel is used, the output of the ν-SVC can be formulated in terms of a class conditional in the form: p(x|ωt ) =
n i=1
αi
1 d (2πs) 2
2 1 ||x−xi || s
e− 2
(4)
where αi is the ith Lagrange multiplier, and s is a parameter of the Gaussian kernel. Combining One-Class Classifiers Traditional pattern classifiers can be combined by using many different combination rules and methods [18]. Among the combination rules, the min, max, mean and product rules [15] are some of the most commonly used. These combination rules can be easily applied when the output of the classifiers can be viewed as an a posteriori probability Pi (ωj |x), where pi refers to the output of the i-classifier, whereas ωj is the j-class of objects. In this case, there are two classes. By considering the distribution of the outliers to be constant in a suitable region of the feature set [32], the a posteriori probability for the target class can be approximated as
106
I. Corona et al.
Pi (ωt |x) =
pi (x|ωt )P (ωt ) , pi (x|ωt )P (ωt ) + θi P (ωo )
i = 1, . . . , L
(5)
where ωt represents the target class, ωo represent the outlier class and θi is the uniform density distribution assumed for the outlier patterns. Let’s consider now the traditional mean combination rule. We need to compute µ(ωt |x) =
1 L
µ(ωo |x) =
1 L
L i=1
Pi (ωt |x) (6)
L
i=1 Pi (ωo |x)
and the decision criterion is µ(ωt |x) < µ(ωo |x) ⇒ x is an outlier
(7)
If we assume pi (x) p(x), ∀i, we can write 1 P (ωj ) 1 pi (x|ωj )P (ωj ) = pi (x|ωj ) L i=1 p(x) L p(x) i=1 L
µ(ωj |x) =
L
(8)
where j = t, o (i.e. (8) is applied to both the target and the outlier class). In this case we can compute 1 pi (x|ωt ) yavg (x) = L i=1 L
θ =
P (ωo ) 1 θi P (ωt ) L i=1
(9)
L
(10)
and the decision criterion (7) becomes simply
yavg (x) < θ ⇒ x is an outlier,
(11)
which means that we can combine the class-conditional probability density functions, instead of the a posteriori probabilities estimated by each classifier. The obtained yavg (x) can be used as a standard one-class classifier output and the threshold θ can be independently tuned to attain the desired tradeoff between false positives (i.e. target objects classified as outliers) and false negatives (i.e. outliers classified as belonging to the target class). This approach is (almost) exactly like the one proposed in [32] and [33] and can be extended to the min, max and product rules. 3.3 Experimental Results Experiments were carried out on a subset of the DARPA 1998 dataset distributed as part of the UCI KDD Archive (http://kdd.ics.uci.edu/databases/
Intrusion Detection Using Multiple Classifier Systems
107
kddcup99/kddcup99.html). The DARPA 1998 dataset was created by the MIT Lincoln Laboratory group in the framework of the 1998 Intrusion Detection Evaluation Program (http://www.ll.mit.edu/IST/ideval). This dataset was obtained from the network traffic produced by simulating the computer network of an air-force base. In order to perform experiments with unlabeled intrusion detection techniques, we removed the labels of all the training patterns to simulate the unlabeled collection of network traffic. Nominal features have been converted into numerical values according to the procedure in [9]. According to the description of the modular architecture presented in Section 3.1, we divided the traffic of the data set into six subsets, each one related to “similar” services: HTTP, containing the traffic related to the HTTP protocol; FTP, containing the traffic related to the control flow and data flow for the FTP protocol, and the traffic related to the TFTP protocol; Mail, containing the traffic related to the SMTP, POP2, POP3, NNTP, and IMAP4 protocols; ICMP, containing the traffic related to the ICMP protocol; Private&Other, containing the traffic related to TCP/UDP ports higher than 49,152; Miscellaneous, containing all the remaining traffic. For each module, the features taking a constant value for all patterns have been discarded, provided that these features have a constant value by “definition” for that service, and not by chance. As mentioned above, the prior probabilities of the attack classes in the training portion of the KDD-Cup 1999 dataset cannot be considered representative of the traffic in a real network. In the experiments, we reduced the overall percentage of attacks to the 1.5% of all the training patterns, so that the resulting training set is made up of 97,277 patterns of normal traffic, and 1,482 patterns related to attacks. It is worth noting that attacks are not distributed uniformly among different services, as the percentages of attacks related to different services range from the 0.17% of the HTTP and Mail traffic, to the 30.34% of the ICMP traffic. On the other hand, the test set contains a very large fraction of attacks, as it was designed to test the performance of IDS and not to be representative of a realistic network traffic. In particular, 60,593 patterns are related to normal traffic, while 248,816 patterns are related to attacks. More details on the data set can be found in [12]. Performance Evaluation We divided the performance evaluation experiments into two phases. In the first phase, we evaluated the performance of one module of the IDS at a time. In particular, for each module the performance of a “monolithic” classifier is compared to the performance attained by combining classifiers trained on distinct feature subsets (see Sect. 3.1). Table 1 summarizes the performance results on the test set in terms of the Area Under the Curve (AUC), for the ν-SVC, the k-means, the Parzen classifier, and the clustering algorithm proposed in [9], respectively. For each algorithm, the parameters have been tuned on the training set. It is worth noting that in the case of the ICMP protocol only intrinsic and traffic features were available, thus only the third
108
I. Corona et al.
kind of experiment could be performed by combining two one-class classifiers trained on intrinsic and traffic features, respectively. Table 1. Performance attained by the ν-SVC, k-means, Parzen, Cluster classifiers on the six modules in terms of AUC. Only combinations that attained the best result for at least one module are showed (the best performance is reported in bold) HTTP FTP Mail ν-SVC ν-SVC - max rule ν-SVC - min rule ν-SVC - mean rule ν-SVC - mean rule k-means - max rule k-means - max rule k-means - min rule k-means - mean rule k-means - product rule Parzen Parzen - max rule Parzen - min rule Parzen - mean rule Parzen - product rule Cluster Cluster - max rule Cluster - mean rule Cluster - mean rule
F I+C+T I+C+T I ∪C +I ∪T I+C+T I ∪C +I ∪T I+C+T I+C+T I ∪C +I ∪T I ∪C +I ∪T F I+C+T I ∪C +I ∪T I+C+T I+C+T F I+C+T I ∪C +I ∪T I+C+T
0.995 0.807 0.773 0.952 0.865 0.864 0.872 0.814 0.859 0.858 0.977 0.858 0.987 0.858 0.959 0.967 0.740 0.932 0.983
0.894 0.566 0.973 0.962 0.972 0.874 0.335 0.926 0.778 0.777 0.878 0.368 0.868 0.867 0.924 0.839 0.478 0.829 0.874
0.971 0.956 0.954 0.970 0.953 0.926 0.930 0.630 0.913 0.913 0.932 0.581 0.940 0.582 0.941 0.891 0.949 0.962 0.970
ICMP Private Miscell 0.862 0.929 0.913 0.879 0.913 0.750 0.743 0.872 0.872 0.725 0.739 0.918 0.872
0.922 0.918 0.904 0.957 0.921 0.917 0.917 0.907 0.965 0.965 0.921 0.864 0.988 0.891 0.891 0.847 0.390 0.915 0.847
0.987 0.939 0.944 0.965 0.988 0.974 0.889 0.284 0.932 0.932 0.982 0.909 0.974 0.909 0.898 0.973 0.141 0.876 0.958
In the second phase, the modules related to different services are combined, and the performance of the overall IDS is evaluated. Performance evaluation has been carried out by Receiver Operating Characteristic (ROC) curve analysis, i.e. by computing the detection rate as a function of the false alarm rate. Different ROCs can be compared by computing AUC. AUC measures the average performance of the related classifier, so that the larger the value of AUC of a classifier the higher the performance [33]. In order to analyze the performance of the overall IDS, we built three systems: (1) an “optimal” system made up, for each module, of the classification techniques that provided the highest value of AUC, according to Table 2; (2) a system made up of one “monolithic” ν-SVC for each module. We chose to use ν-SVC classifiers because on average they provide better results than the other considered classifiers; (3) as in the second system, we chose to use ν-SVC classifiers. Then, for each module we chose between a “monolithic” versus a MCS approach, according to best performance results reported in Table 1. In order to evaluate the performance of the three IDS systems, we computed some working points according to the heuristic proposed in Sect. 3.1.
Intrusion Detection Using Multiple Classifier Systems
109
Table 2. Summary of the best results in terms of AUC attained for each module HTTP FTP
Mail
ICMP
Private
Miscell
ν-SVC ν-SVC ν-SVC ν-SVC Parzen ν-SVC min rule max rule min rule mean rule F F I+C+T I +C +T I ∪C +I ∪T I +C +T
The attained results are reported in Table 3. Particularly, if the false alarm rate is set to 1%, the algorithms trained on the entire training set provide a detection rate near 18% (see [12] for more details), while the proposed modular approaches provide detection rates from 67% to 79% (see Table 3). As the effectiveness of IDS depends on the capability of providing high detection rates at small false alarms rates, the proposed modular approaches are very effective compared to the “monolithic” approaches. A low false alarm rate is fundamental in practical applications, and for small false alarm rates the MCS approach evidently outperforms the monolithic one. This result is even more evident if we compare the Bayesian detection rates for the different approaches at a fixed false positive rate. Let us to denote with A and I, the occurrence of an alarm and an intrusion, respectively. We may chose a reasonable value for the false positive rate P (A|¬I) = 0.01. The a priori probabilities are P (I) = 0.985 and P (¬I) = 0.015. In the case of the monolithic approach using the clustering algorithm proposed in [9], the detection rate is P (A|I) = 0.1837, and the Bayesian detection rate is P (I|A) = 0.2186. In the case of the monolithic ν-SVC, P (A|I) = 0.1791, and the Bayesian detection rate is P (I|A) = 0.2143. On the other hand, in the case of the proposed modular approach with the ν-SVC classifier, P (A|I) = 0.6796, and the obtained Bayesian detection rate is P (I|A) = 0.5085, which is much higher than the Bayesian detection rate attained using the monolithic approach. Although additional efforts are needed to further increase the Bayesian detection rate, the modular approach is promising and should be considered as a basic scheme for the development of more accurate anomaly detection systems. Table 3. Results attained by the proposed three modular systems. FAR and DR stand for the False Alarm Rate and Detection Rate, respectively. ‘Best Modules’ and ‘Best ν-SVC Modules’ mean the best modules in terms of AUC Best Modules ν-SVC FAR 0.87% 2.10% 2.64%
DR 75.34% 80.35% 80.80%
FAR 0.91% 2.06% 2.65%
Best ν-SVC Modules DR 67.31% 75.61% 77.10%
FAR 0.88% 2.07% 2.66%
DR 79.27% 89.45% 89.67%
110
I. Corona et al.
4 Conclusion In this paper we discussed the application of the MCS approach for the designing of intrusion detection systems. Such an approach may be used to cope with the complexity of the classification problem, decomposing it in complementary sub-problems. Generally, a MCS design allows for increasing the performances of an IDS, i.e. attaining high detection rate and low false alarm rate, with respect to the monolithic approach. In the intrusion detection domain it is time and effort expensive to assign a label to each analyzed pattern. Also, there is the need for recognizing new threats through the analysis of these patterns. Therefore, recent works focused on both unsupervised training and anomalybased detection techniques. On the other hand, the model selection performed by unsupervised training techniques may be controlled by an adversary, by submitting well-crafted malicious patterns. We noted that the MCS approach may help to enhance the robustness of a system in front of this malicious noise. However, a more rigorous research is necessary. Instead, in practical applications it has been shown that MCS are more robust in an adversarial environment during the operational phase. Three different MCS design schemas for intrusion detection proposed in literature have been discussed. Each one showed specific peculiarities. Furthermore, a detailed example of anomaly-based IDS based on unsupervised training has been discussed. A problem of MCS is that classifier outputs must be uniform and comparable to be combined. A technique aimed at mapping the outputs of one-class classifiers to density functions is used in this IDS. In such a way, the outputs of different classifiers can be correlated using well-known fixed combination techniques developed for multi-class classifiers. Results confirmed the validity of the MCS approach for the intrusion detection problem. Finally, to the best of our knowledge there are no works that researched the advantages of the MCS for the implementation of reliable and specific alert verification mechanisms. This post-processing task appears interesting to decrease the number of false alarms, as in fact classification errors cannot be avoided.
Acknowledgements The authors would like to thank Roberto Perdisci and Mauro Del Rio who permitted to fulfill this paper with an excerpt of their previous work [12], presented in Sect. 3.
References 1. Ariu D, Giacinto G, Perdisci R (2007) Sensing attacks in computers networks with hidden markov models. In: Perner P (ed) Proc the 5th Int Conf
Intrusion Detection Using Multiple Classifier Systems
2.
3.
4.
5.
6. 7. 8. 9.
10.
11. 12. 13. 14. 15. 16. 17. 18. 19.
111
Mach Learn Data Mining in Pattern Recognition, Leipzig, Germany. Springer, Berlin/Heidelberg, pp 449–463 Baum LE, Petrie T, Soules G, Weiss N (1970) A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann Math Stat 41:164–171 Cohen I, Cozman FG, Sebe N, Cirelo MC, Huang T (2004) Semi-supervised learning of classifiers: theory, algorithms and their applications to humancomputer interaction. IEEE Trans Pattern Analysis and Mach Intell 26:1553– 1567 Cordella LP, Limongiello A, Sansone C (2004) Network intrusion detection by a multi-stage classification system. In: Roli F, Kittler J, Windeatt T (eds) Proc the 5th Int Workshop Multiple Classifier Syst, Cagliari, Italy. Springer, Berlin/Heidelberg, pp 324–333 Debar H, Becker M, Siboni D (1992) A neural network component for an intrusion detection system. In: Proc 1992 IEEE Symp Research in Security and Privacy, Oakland, CA, USA. IEEE Computer Society, Los Alamitos, pp 240– 250 Denning DE (1987) An intrusion-detection model. IEEE Trans Software Engin 13:222–232 Duda RO, Hart PE, Stork DG (2000) Pattern classification. Wiley-Interscience, Hoboken Elkan C (2000) Results of the KDD’99 classifier learning. ACM SIGKDD Explorations 1:63–64 Eskin E, Arnold A, Prerau M, Portnoy L, Stolfo S (2002) A geometric framework for unsupervised anomaly detection: detecting intrusions in unlabeled data. In: Barbara D, Jajodia S (eds) Applications of Data Mining in Computer Security. Springer, Berlin/Heidelberg Giacinto G, Roli F, Didaci L (2003) A modular multiple classifier system for the detection of intrusions in computer networks. In: Windeatt T, Roli F (eds) Proc the 4th Int Workshop Multiple Classifier Syst, Guildford, UK. Springer, Berlin/Heidelberg, pp. 346–355 Giacinto G, Roli F, Didaci L (2003) Fusion of multiple classifiers for intrusion detection in computer networks. Pattern Recognition Letters 24:1795–1803 Giacinto G, Perdisci R, Del Rio M, Roli F (2008) Intrusion detection in computer networks by a modular ensemble of one-class classifiers. Inf Fusion 9:69–82 Jain AK, Dubes RC (1998) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs Javits H, Valdes A (1993) The NIDES statistical component: description and justification. SRI Annual Rep A010, Comp Sci Lab, SRI Int Kittler J, Hatef M, Duin RPW, Matas J (1998) On combining classifiers. IEEE Trans Pattern Analysis and Mach Intell 20:226–229 Kruegel K, Vigna G, Robertson W (2005) A multi-model approach to the detection of web-based attacks. Int J Comp Telecomm Networking 48:717–738 Kuncheva L, Bezdek JC, Duin RPW (2001) Decision templates for multiple classifier fusion. Pattern Recognition 34:299–314 Kuncheva L (2004) Combining pattern classifiers: methods and algorithms. Wiley-Interscience, Hoboken Lee W, Stolfo S (2000) A framework for constructing features and models for intrusion detection systems. ACM Trans Inf Syst Security 3: 227-261
112
I. Corona et al.
20. Lee W, Xiang D (2001) Information-theoretic measures for anomaly detection. In: Proc 2001 IEEE Symp Security and Privacy, Oakland, CA, USA, IEEE Computer Society, Los Alamitos, pp 130–143 21. Leung K, Leckie C (2005) Unsupervised anomaly detection in network intrusion detection using clusters. In: Estivill-Castro V (ed) Proc the 28th Australasian Comp Sci Conf, Newcastle, NSW, Australia. Australian Computer Society, pp 333–342 22. Lippmann R, Haines JW, Fried DJ, Korba J, Das K (2000) The 1999 DARPA off-line intrusion detection evaluation. Computer Networks 34:579–595 23. Mahoney MV, Chan PK (2003) An analysis of the 1999 DARPA/Lincoln laboratory evaluation data for network anomaly detection. In: Vigna G, Jonsson E, Kr¨ ugel C (eds) Proc 6th Int Symp Recent Advances in Intrusion Detection, Pittsburgh, PA, USA. Springer, Berlin/Heidelbeg, pp 220–237 24. McHugh J, Christie A, Allen J (2000) Defending yourself: the role of intrusion detection systems. IEEE Software 17:42–51 25. McHugh J (2000) Testing intrusion detection systems: a critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by Lincoln laboratory. ACM Trans Inf Syst Security 3:262–294 26. Mutz D, Kruegel C, Robertson W, Vigna G, Kemmerer RA (2005) Reverse engineering of network signatures. In: Clark A, Kerr K, Mohay G (eds) Proc the 4th AusCERT Asia Pacific Inf Technology Security Conf, Gold Coast, QE, Australia, pp 1–12 27. Patton S, Yurcik W, Doss D (2001) An Achilles’ heel in signature-based IDS: squealing false positives in SNORT. In: Lee W, M´e L, Wespi A (eds) Proc the 4th Int Symp Recent Advances in Intrusion Detection, Davis, CA, USA. Springer, Berlin/Heidelberg 28. Perdisci R (2006) Statistical pattern recognition techniques for intrusion detection in computer networks: challenges and solutions, PhD Thesis, University of Cagliari, Cagliari 29. Portnoy L, Eskin E, Stolfo S (2001) Intrusion detection with unlabeled data using clustering. In: Proc ACM CSS Workshop Data Mining Applied to Security, Philadelphia, PA, USA, pp 76–105 30. Proctor PE (2001) Practical intrusion detection handbook. Prentice-Hall, Upper Saddle River 31. Sch¨ olkopf B, Platt J, Shawe-Taylor J, Smola AJ, Williamson RC (2001) Estimating the support of a high-dimensional distribution. Neural Comp 13:1443– 1471 32. Tax DMJ, Duin RPW (2001) Combining one-class classifiers. In: Kittler J, Roli F (eds) Proc the 2nd Multiple Classifier Syst, Cambridge, UK. Springer, Berlin/Heidelberg, pp 299–308 33. Tax DMJ (2001) One-class classification: concept learning in the absence of counter examples. PhD Thesis, Delft University of Technology, Delft 34. Vapnik V (1998) Statistical learning theory. Wiley, Hoboken 35. Vigna G, Robertson W, Balzarotti D (2004) Testing network-based intrusion detection signatures using mutant exploits. In: Atluri V, Pfitzmann B, McDaniel PD (eds) Proc the 11th ACM Conf Comp and Communications Security, Washington DC, USA. ACM, New York, pp 21–30 36. Wang K, Stolfo SJ (2004) Anomalous payload-based network intrusion detection. In: Jonsson E, Valdes A, Almgren M (eds) Proc the 7th Int Symp
Intrusion Detection Using Multiple Classifier Systems
113
Recent Advances on Intrusion Detection, Sophia Antipolis, France. Springer, Berlin/Heidelberg, pp 203–222 37. Yurcik W (2002) Controlling intrusion detection systems by generating false positives: squealing proof-of-concept. In: Proc the 27th Annual IEEE Conf Local Comp Networks, Tampa, FL, USA. IEEE Computer Society, Los Alamitos, pp 134–135 38. Xu L, Krzyzak A, Suen CY (1992) Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Trans Syst Man Cybern 22:418–435
Ensembles of Nearest Neighbors for Gene Expression Based Cancer Classification Oleg Okun1 and Helen Priisalu2 1 2
University of Oulu, P.O.Box 4500, FI-90014 Oulu, Finland,
[email protected] Teradata, Valkij¨ arventie 7E, FI-2130 Espoo, Finland,
[email protected]
Summary. Gene expression levels are useful in discriminating between cancer and normal examples and/or between different types of cancer. In this chapter, ensembles of k-nearest neighbors are employed for gene expression based cancer classification. The ensembles are created by randomly sampling subsets of genes, assigning each subset to a k-nearest neighbor (k-NN) to perform classification, and finally, combining k-NN predictions with majority vote. Selection of subsets is governed by the statistical dependence between dataset complexity and classification error, confirmed by the copula method, so that least complex subsets are preferred since they are associated with more accurate predictions. Experiments carried out on six gene expression datasets show that our ensemble scheme is superior to a single best classifier in the ensemble and to the redundancy-based filter, especially designed to remove irrelevant genes. Key words: Ensemble of classifiers, k-nearest neighbor, gene expression, cancer classification, dataset complexity, copula, bolstered error
1 Introduction Gene expression is a two-stage process including the transcription of deoxyribonucleic acid (DNA) into messenger ribonucleic acid (mRNA) which is then translated into protein by the ribosome. When a protein is produced, a gene is said to be expressed. Proteins are large compounds of amino acids joined together in a chain and they are essential parts of organisms and participate in every process within cells. Cancer classification based on gene expression levels is one of the topics of intensive research in bioinformatics, since it was shown in numerous works [1, 2, 3] that expression levels provide valuable information for discrimination between normal and cancer examples. However, the classification task is not easy since there are typically thousands of expression levels versus few dozens of examples. In addition, expression levels are noisy due to the complex procedures and technologies involved O. Okun and H. Priisalu: Ensembles of Nearest Neighbors for Gene Expression Based Cancer Classification, Studies in Computational Intelligence (SCI) 126, 115–134 (2008) c Springer-Verlag Berlin Heidelberg 2008 www.springerlink.com
116
O. Okun and H. Priisalu
in the measurements of gene expression levels, thus causing ambiguity in classification. Hence, the original set of genes must be reduced to those genes that are relevant to discrimination between different classes. This operation is called feature or gene selection3 . Genes preserved or selected as a result of feature selection are then used to classify data. Basically, general feature selection methods widely applied to the datasets with many more examples than features can be (and they are) utilized for gene selection, too. These methods can be approximately divided into filters and wrappers. Wrappers base their decisions on which feature(s) to select by employing a classifier. For small sample size gene expression datasets, they can easily introduce the induction bias when some genes are preferred over the others. Since those preferred genes might not be always relevant to cancer, we turn our attention to the filters that do gene selection independently of any classifier and solely based on data characteristics. Hence, they are less prone to the induction bias. This brief analysis prompted us to concentrate on random gene selection where genes to be used with a classifier are randomly sampled from the original set of genes, irrespectively of class information and a classifier. The additional fact that caused us to make such a decision was the work [4], where it was concluded that differences in classification performance among feature selection algorithms are less significant than performance differences among the error estimators used to implement these algorithms. In other words, the way of how error is computed has a larger influence on classification accuracy than the choice of a feature selection algorithm. Among several error estimators we opted for the bolstered resubstitution error because it provides a low-bias, low-variance estimate of classification error, which is what is needed for high dimensional gene expression data [5]. However, a single random sample cannot guarantee that sampled genes will lead to good classification results. Hence, we need to sample genes several times to be more certain about the outcome, which, in turn, implies several classifications have to be done. Thus, it is natural to combine predictions of several classifiers into a single prediction. Such a scheme is termed an ensemble of classifiers in the literature [6]. It is well known that under certain conditions an ensemble can outperform its most accurate member. In the context of dimensionality reduction, an ensemble composed of a small number of classifiers, each working with a small subset of genes, results in the desired effect. For instance, if the original set comprises 1000 genes, five classifiers, each employing 20 genes4 , lead to a 10-fold dimensionality reduction. Thus, using an ensemble instead of a single classifier can be beneficial for both dimensionality reduction and classification performance.
3 4
Further, the words ‘feature’ and ‘gene’ will be used interchangeably, since they have the same meaning in this work. Let us assume that there is no overlap between different subsets of genes.
Ensembles of Nearest Neighbors for Cancer Classification
117
As a base classifier in the ensemble, a k-nearest neighbor (k-NN) is used because it performed well for cancer classification, compared to more sophisticated classifiers [7]. Besides, it is a simple method that has a single parameter (the number of nearest neighbors) to be pre-defined, given that the distance metric is Euclidean. Six gene expression datasets containing different types of cancer are utilized in our work. We begin with the recently proposed redundancy-based filter [8], especially designed to filter out irrelevant genes. As this filter turned to be quite aggressive in removing genes, we proceed to experiment with k-NN ensembles. In particular, we compare ensembles created based on the concept called dataset complexity. Based on the copula method [9, 11, 10], which is useful in exploring association (dependence or concordance) relations in multivariate data, we hypothesize that there is positive dependence between dataset complexity measured by the Wilcoxon rank sum statistic [12] and the bolstered resubstitution error [5], with low (high) complexity associated with small (large) error. As a result, selecting a low-complexity subset of genes implies an accurate k-NN, which, in turn, implies an accurate k-NN ensemble. Experimental results clearly favor the complexity-based scheme of k-NN ensemble generation over 1) a single best k-NN in the ensemble and 2) the redundancy-based filter. The chapter has the following structure. Section 2 describes gene expression datasets employed in experiments. The redundancy-based filter is briefly introduced in Section 3. Section 4 defines the dataset complexity characteristic while Section 5 defines bolstered resubstitution error. The link between the two is explored and analyzed in Section 6. Ensemble generation is presented in Section 7 and experimental results obtained on six datasets are given in Section 8. Finally, Section 9 concludes the paper.
2 Gene Expression Datasets 2.1 SAGE Dataset 1 In this dataset, there are expressions of 822 genes in 74 cases (24 cases are normal while 50 cases are cancerous) [13]. The dataset contains 9 different types of cancer. We decided to ignore the difference between cancer types and to treat all cancerous cases as belonging to a single class. No preprocessing was done. 2.2 Colon Dataset This dataset introduced in [1] contains expressions of 2000 genes for 62 cases (22 normal and 40 colon tumor cases). Preprocessing includes the logarithmic transformation to base 10, followed by normalization to zero mean and unit variance as usually done with this dataset.
118
O. Okun and H. Priisalu
2.3 Brain Dataset This dataset introduced in [2] contains two classes of brain tumor. The dataset (also known as Dataset B) contains 34 medulloblastoma cases, 9 of which are desmoplastic and 25 are classic. Preprocessing consists of thresholding of gene expressions with a floor of 20 and ceiling of 16000; filtering with exclusion of genes with max / min ≤ 3 or max − min < 100, where max and min refer to the maximum and minimum expressions of a certain gene across the 34 cases, respectively; base 10 logarithmic transformation; normalization across genes to zero mean and unit variance. As a result, 5893 out of 7129 original genes are only retained. 2.4 SAGE Dataset 2 This is a larger SAGE dataset containing 31 normal and 59 cancer (10 types of cancer) cases with 27679 expressed genes. As with the smaller dataset, no preprocessing was done and all cancer types were assigned to a single class. 2.5 Prostate Dataset 1 This dataset introduced in [3] includes the expressions of 12600 genes in 52 prostate and 50 normal cases. No preprocessing of the data was done. 2.6 Prostate Dataset 2 This dataset was obtained independently of the one described in the previous section. It has 25 prostate and 9 normal cases with 12600 expressed genes. No preprocessing was done.
3 Redundancy-Based Filter The redundancy-based filter (RBF) [8] is based on the concept of an approximate Markov blanket (AMB). Finding the complete Markov blanket is computationally prohibitive for high dimensional gene expression data, thus the approximation is used instead. The goal is to find for each gene Fi an AMB Mi that subsumes the information content of Fi . In other words, if Mi is a true Markov blanket for Fi , the class C is conditionally independent of Fi given Mi , i.e. p(C|Fi , Mi ) = p(C|Mi ). To efficiently find an AMB two types of correlations are employed: 1) individual C-correlation between a gene Fi and the class C and 2) combined C-correlation between a pair of genes Fi and Fj (i = j) and the class C. Both correlations are defined through symmetrical uncertainty SU (X, C), where X is either Fi (individual C-correlation) or Fi,j (combined C-correlation), with SU (X, C) defined as
Ensembles of Nearest Neighbors for Cancer Classification
SU (X, C) = 2
119
IG(X|C) , H(X) + H(C)
where H(·) is entropy, IG(X|C) = H(X) − H(X|C) is information gain from knowing the class information. SU is a normalized characteristic whose values lie between 0 and 1, where 0 indicates that X and C are independent. To reduce the variance and noise of the original data, continuous expression levels were converted to nominal values -1, 0, and +1, representing the over-expression, baseline, and under-expression of genes, which correspond to (−∞, µ − σ/2), [µ − σ/2, µ + σ/2], and (µ + σ/2, +∞), respectively, with µ and σ being the mean and the standard deviation of all expression levels for a given gene. For nominal variables, the entropies needed in the formulas above are computed as follows: P (xi ) log2 (P (xi )), H(X) = − i
H(X|C) = −
P (ck )
P (xi |ck ) log2 (P (xi |ck )),
i
k
where P (xi ) is the probability that X = xi and P (xi |ck ) is the probability that X = xi given C = ck . For combined C-correlation, xi in these formulas should be replaced with the pair (xi , xj ) so that P (xi , xj ) log2 (P (xi , xj )), H(X) = − i,j
H(X|C) = −
k
P (ck )
P (xi , xj |ck ) log2 (P (xi , xj |ck )).
i,j
Since xi (xj ) can take only three values: -1, 0, +1, there are nine pairs ((-1,1), (-1,0), (-1,+1), (0,-1), (0,0), (0,+1), (+1,-1), (+1,0), (+1,+1)) for which probabilities (and hence, entropies) need to be evaluated. RBF starts from computing individual C-correlation for each gene and sorting all correlations in descending order. The gene with the largest correlation is considered as predominant (no AMB exists for it) and hence it is put to the list S of the selected genes and used to filter out other genes. After that, the iteration begins with picking the first gene Fi from S and proceeds as follows. For all remaining genes, if Fi forms an AMB for Fj , the latter is removed from further analysis. The following conditions must be satisfied for this to happen: 1) individual C-correlation for Fi must be larger than or equal to individual C-correlation for Fj , which means that a gene with a larger individual correlation provides more information about the class than a gene with a smaller individual correlation, and 2) individual C-correlation for Fi must be larger than or equal to combined C-correlation for Fi and Fj , which means that if combining Fi and Fj does not provide more discriminating power than Fi alone, Fj is decided to be redundant. After one round of filtering, RBF
120
O. Okun and H. Priisalu
takes the next (according to the magnitude of individual C-correlation) still unfiltered gene and the filtering process is repeated again. Since a lot of genes are typically removed at each round (gene expression data contain a lot of redundancy) and removed genes do not participate in the next rounds, RBF is much faster than the typical hill climbing (greedy forward or backward search). Another attractive characteristic of RBF is that the number of genes to be selected is automatically found.
4 Dataset Complexity In general, the performance of classifiers is strongly data-dependent. To gain insight into a supervised classification problem5 , one can adopt dataset complexity characteristics. The goal of such characteristics is to provide a score reflecting how well classes of the data are separated. Given a set of features, the data of each class are projected onto the Fisher’s linear discriminant axis by using only these features (for details, see [14]). Projection coordinates then serve as input for the Wilcoxon rank sum test for equal medians [12] (the null hypothesis of this test is two medians are equal at the 5% significance level). Given a sample divided into two groups according to class membership, all the observations are ranked as if they were from a single sample and the rank sum statistic W is computed as the sum of the ranks in the smaller group. The value of the rank sum statistic is employed as a score characterizing separability power of a given set of features. The higher this score, the larger the overlap in projections of two classes, i.e. the worse separation between classes. To compare W coming from different datasets, each W can be normalized by the sum of all ranks, i.e. if N is the sample size, then the sum of all ranks will be N i=1 i. The normalized W lies between 0 and 1.
5 Bolstered Resubstitution Error This is a low-variance and low-bias classification error estimation method proposed in [5]. Unlike the cross-validation techniques reserving a part of the original data for testing, it permits to use the whole dataset for error estimation. Since sample size of gene expression datasets is very small compared to the data dimensionality, using all available data is an important positive factor. However, one should be aware of the effect of overfitting in this case when a classifier demonstrates excellent performance on the training data but fails on independent unseen data. Braga-Neto and Dougherty [5] avoided this pitfall by randomly generating a number of artificial points (examples) in the neighborhood of each training point. These artificial examples then act as a test set and classification error on this set is called bolstered. In this paper, 5
Two-class problems are assumed.
Ensembles of Nearest Neighbors for Cancer Classification
121
we utilize the bolstered variant of the conventional resubstitution error known as bolstered resubstitution error . Briefly, bolstered resubstitution error is estimated as follows [5]. Let A0 and A1 be two decision regions corresponding to the classification generated by a given algorithm, N be the number of training points, and MMC be the number of random samples drawn from the D-variate normal distribution per training point (MMC = 10 as advocated in [5]). The bolstered resubstitution error is then defined as ⎛ ⎞ M M N MC MC 1 ⎝ Ix ∈A Iy =0 + Ixij ∈A0 Iyi =1 ⎠ , (1) εbresub ≈ N MMC i=1 j=1 ij 1 i j=1 2
2
where {xij }j=1,...,MM C are samples drawn from 1/((2π)D/2 σiD )e−x /(2σi ) . The bolstered resubstitution error is thus equal to the sum of all error contributions divided by the number of points. Samples are drawn based on the Marsaglia polar normal random number generator [15]. In a 2-D space, samples come from a circle centered at a particular training point. In a D-dimensional case, they are drawn from a hypersphere. Hence, the radius of this hypersphere, determined by σi , is of importance since its selection amounts to choosing the degree of bolstering. Typically, σi should ˆ i )/cp vary from point to point in order to be robust to the data. In [5] σi = d(y ˆ i ) is the mean minimum distance between points for i = 1, . . . , N , where d(y belonging to class of yi (yi can be either 0 or 1)6 , and cp is the constant called the correction factor defined as the inverse of the chi-square cdf (cumulative distribution function) with parameters 0.5 and D, because interpoint distances in the Gaussian case are distributed as a chi random variable with D degrees of freedom. Thus, cp is the function of the data dimensionality. The parameter 0.5 is chosen so that points inside a hypersphere will be evenly sampled.
6 Dataset Complexity and Classification Error Our main idea to build ensembles of k-NNs is based on the hypothesis that the dataset complexity and bolstered resubstitution error are related. To verify our hypothesis, 10000 feature subsets were randomly sampled for each dataset (subset size ranged from 1 to 50) and both complexity and bolstered resubstitution error for 3-NN were computed. Anomalous complexity values lying three standard deviations from the average complexity were treated as outliers and therefore removed from further analysis. The result of such simulation is shown in Figs. 1-6 together with marginal histograms for each variable. It can be observed that univariate distributions vary from dataset to dataset and 6
ˆ i ) is determined by first computing the minimum distance from each point xi d(y to all other points xj (j = i) of the same class as that of xi and then by averaging thus obtained minimum distances.
122
O. Okun and H. Priisalu
often they are non-Gaussian. However, the dependence between complexity and error is clearly detectable when looking at Figs. 1–6.
Fig. 1. (SAGE 1) Bivariate distribution of normalized complexity and bolstered resubstitution error and univariate marginal histograms
To quantify this dependence, the rank correlation coefficients Spearman’s and Kendall’s τ were computed (see Table 1) and the test on positive correlation at the significance level 0.05 was done which confirmed the existence of such correlation (all p-values were equal to zero). The rank correlations measure the degree to which large (small) values of one random variable correspond to large (small) values of another variable (concordance relations7 among variables). They are useful descriptors in our case since high (low) complexity implies that the data are difficult (easy) to accurately classify, which, in turn, means high (low) classification error. Unlike the linear correlation coefficient, and τ are preserved under any monotonic (strictly increasing) transformation of the underlying random variables. To explore dependence relations, we employed the copula method [9, 10, 11]. Copulas are functions that describe dependencies among variables and allow to model correlated multivariate data by combining univariate distributions. A copula is a multivariate probability distribution, where each random variable has a uniform marginal distribution on the interval [0,1]. The depen7
Since the definitions of these relations by and τ are different, there is a difference in absolute values in Table 1.
Ensembles of Nearest Neighbors for Cancer Classification
123
Fig. 2. (Colon) Bivariate distribution of normalized complexity and bolstered resubstitution error and univariate marginal histograms
Fig. 3. (Brain) Bivariate distribution of normalized complexity and bolstered resubstitution error and univariate marginal histograms
124
O. Okun and H. Priisalu
Fig. 4. (SAGE 2) Bivariate distribution of normalized complexity and bolstered resubstitution error and univariate marginal histograms
Fig. 5. (Prostate 1) Bivariate distribution of normalized complexity and bolstered resubstitution error and univariate marginal histograms
Ensembles of Nearest Neighbors for Cancer Classification
125
Fig. 6. (Prostate 2) Bivariate distribution of normalized complexity and bolstered resubstitution error and univariate marginal histograms Table 1. Spearman’s and Kendall’s τ estimated for all datasets Dataset no.
τ
1 2 3 4 5 6
0.6685 0.7542 0.6919 0.6319 0.7252 0.5365
0.8570 0.9001 0.8176 0.8170 0.8982 0.6294
(SAGE 1) (Colon) (Brain) (SAGE 2) (Prostate 1) (Prostate 2)
dence between random variables is completely separated from the marginal distributions in the sense that random variables can follow any marginal distributions, and still have the same rank correlation. Though there are multivariate copulas, we will only talk about bivariate ones since our dependence relation includes two variables. Sklar’s theorem, which is the foundation theorem for copulas, states that for a given joint multivariate distribution function H(x, y) = P (X ≤ x, Y ≤ y) of a pair of random variables X and Y and the relevant marginal distributions F (x) = P (X ≤ x) and G(y) = P (Y ≤ y), there exists a copula function C relating them, i.e. H(x, y) = C(F (x), G(y)). If F and G are continuous, C is unique. Otherwise, C is uniquely determined on RanX×RanY , where ‘RanX’ (‘RanY’) stands for the range of X (Y ).
126
O. Okun and H. Priisalu
Thus, a copula is a function C from I2 to I with the following properties [10]: 1. For every u, v in I, C(u, 0) = 0 = C(0, v) and C(u, 1) = u, C(1, v) = v. 2. for every u1 , u2 , v1 , v2 in I such that u1 ≤ u2 and v1 ≤ v2 , C(u2 , v2 ) − C(u2 , v1 ) − C(u1 , v2 ) + C(u1 , v1 ) ≥ 0. If F and G are continuous, the following formula is used to construct copulas from the joint distribution functions: C(u, v) = H(F −1 (u), G−1 (v)) [10], where F −1 means a quasi-inverse of F , G−1 means a quasi-inverse of G, and U and V are uniform random variables distributed between 0 and 1. That is, the typical copula-based analysis of multivariate (or bivariate) data starts with the transformation from the (X, Y ) domain to the U, V domain, and all manipulations with data are then done in the latter. Such a transformation to the copula scale (unit square I2 ) can be achieved through a kernel estimator of the cumulative distribution function (cdf). After that the copula function C(u, v) is generated according to the appropriate definition for a certain copula family (see, e.g. (2) below). In [16] it was shown that Spearman’s and Kendall’s τ can be expressed solely in terms of the copula function as follows: = 12 (C(u, v) − uv)dudv = 12 C(u, v)dudv − 3, τ =4 C(u, v)dC(u, v) − 1, where integration is over I2 . The integrals in these formulas can be interpreted as the expected value of the function C(u, v) of uniform [0,1] random variables U and V whose joint distribution function is C, i.e. = 12E(U V ) − 3, τ = 4E(C(u, v)) − 1. As a consequence, for a pair of continuous random variable X and Y is identical to Pearson’s linear correlation coefficient for random variables U = F (X) and V = G(Y ) [10]. In general, the choice of a particular copula may be based on the observed data. Among numerous copula families, we preferred the Frank copula belonging to the Archimedean family based on the visual look of plots in Figs. 1-6 and for dependence in the tail. Besides, this copula type permits negative as well as positive dependence. We are particularly concerned with lower tail dependence when low complexity is associated with small classification error
Ensembles of Nearest Neighbors for Cancer Classification
127
as this forms the basis for ensemble construction in our approach. The Frank copula is a one-parameter (θ is a parameter, θ ∈] − ∞, +∞[\0) copula defined for uniform variables U and V (both are defined over the unit interval) as (e−θu − 1)(e−θv − 1) 1 Cθ (u, v) = − ln 1 + , (2) θ e−θ − 1 with θ determining the degree of dependence between the marginals (we set θ to Pearson’s correlation coefficient between U and V so that as θ increases, the positive dependence also increases). Correlation coefficients measure the overall strength of the association, but give no information about how that varies across the distribution. The magnitude of τ or is not an absolute indicator of such strength since for some distributions the attainable interval can be very small. Hence, additional characteristics of dependence structure are necessary. They are quadrant dependence, tail monotonicity, and stochastic monotonicity. 6.1 Quadrant Dependence Random variables X and Y are positively quadrant dependent (PQD) if ∀(x, y) in R2 , either inequality holds [10]: P (X ≤ x, Y ≤ y) ≥ P (X ≤ x)P (Y ≤ y), P (X > x, Y > y) ≥ P (X > x)P (Y > y). X and Y are PQD if the probability that they are simultaneously small (or simultaneously large) is at least as it would be were they independent. In terms of C, the PQD conditions can be written as C(u, v) ≥ uv for all (u, v) in I2 . By checking the last inequality, we found that complexity and bolstered resubstitution error are PQD for all datasets. Spearman’s (or, to be precise, /12) can be interpreted as a measure of “average” quadrant dependence (both positive and negative) for random variables whose copula is C [10]. It is interesting to ask when one continuous bivariate distribution H1 is more PQD (more concordant) than another H2 . The answer is readily provided by comparing or τ [11]: if (H1 ) ≤ (H2 ) or τ (H1 ) ≤ τ (H2 ), then H2 is more PQD (more concordant) than H1 . From Table 1 it can be seen that Colon and Prostate 1 are more PQD than other datasets, i.e. concordance relations between complexity and bolstered resubstitution error are much stronger for these data than errors for other datasets. 6.2 Tail Monotonicity As we mentioned above, we are interested in tail dependence when low (high) complexity associates small (large) classification error. Tail monotonicity reflects this type of association and it is a stronger condition for dependence than PQD. Let X and Y be random variables. Then four types of tail monotonicity can be defined as follows [10]:
128
O. Okun and H. Priisalu
• Y is left tail decreasing in X (LTD(Y|X)) creasing function of x for all y. • X is left tail decreasing in Y (LTD(X|Y)) creasing function of y for all x. • Y is right tail increasing in X (RTI(Y|X)) creasing function of x for all y. • X is right tail increasing in Y (RTI(X|Y)) creasing function of y for all x.
if P (Y ≤ y|X ≤ x) is a noninif P (X ≤ x|Y ≤ y) is a noninif P (Y > y|X > x) is a nondeif P (X > x|Y > y) is a nonde-
In terms of a copula and its first-order partial derivatives these conditions are equivalent to • LTD(Y|X) iff for any v in I, ∂C(u, v)/∂u ≤ C(u, v)/u for almost all u. • LTD(X|Y) iff for any u in I, ∂C(u, v)/∂v ≤ C(u, v)/v for almost all v. • RTI(Y|X) iff for any v in I, ∂C(u, v)/∂u ≤ (v − C(u, v))/(1 − u) for almost all u. • RTI(X|Y) iff for any u in I, ∂C(u, v)/∂v ≤ (u − C(u, v))/(1 − v) for almost all v. For the Frank copula, the first-order partial derivatives are e−θu (e−θv − 1) ∂Cθ (u, v) = −θ , ∂u e − 1 + (e−θu − 1)(e−θv − 1)
(3)
∂Cθ (u, v) e−θv (e−θu − 1) = −θ . ∂v e − 1 + (e−θu − 1)(e−θv − 1)
(4)
Tail monotonicity is also guaranteed if ≥ τ ≥ 0 is met [10]. We verified that for all datasets bolstered resubstitution error is left tail decreasing in complexity, complexity is left tail decreasing in bolstered resubstitution error, bolstered resubstitution error is right tail increasing in complexity, and complexity is right tail increasing in bolstered resubstitution error. Thus, dependence in the tail between these two variables exists. 6.3 Stochastic Monotonicity Stochastic monotonicity is stronger than tail monotonicity. According to [10], • Y is stochastically increasing in X (SI(Y|X)) if P (Y > y|X = x) is a nondecreasing function of x for all y. • X is stochastically increasing in Y (SI(X|Y)) if P (X > x|Y = y) is a nondecreasing function of y for all x. Alternatively, stochastic monotonicity can be expressed as • SI(Y|X) iff for any v in I, C(u, v) is a concave function of u. • SI(X|Y) iff for any u in I, C(u, v) is a concave function of v.
Ensembles of Nearest Neighbors for Cancer Classification
129
A concave function implies that the second-order derivatives must be less than or equal to zero. For the Frank copula, these derivatives are θe−θu (e−θv − 1)(e−θv − e−θ ) ∂ 2 Cθ (u, v) = 2, ∂u2 [e−θ − 1 + (e−θu − 1)(e−θv − 1)]
(5)
θe−θv (e−θu − 1)(e−θu − e−θ ) ∂ 2 Cθ (u, v) = 2. ∂v 2 [e−θ − 1 + (e−θu − 1)(e−θv − 1)]
(6)
Since θ > 0 in our case (positive dependence as expressed by the rank corre2 2 θ (u,v) θ (u,v) lation coefficients), it is easy to verify that ∂ C∂u ≤ 0 and ∂ C∂v ≤ 0, 2 2 which, in turn, implies that Cθ (u, v) is concave. Thus, for all datasets in our study, bolstered resubstitution error is stochastically increasing in complexity and complexity is stochastically increasing in bolstered resubstitution error.
7 Ensemble of k-NNs An ensemble of classifiers consists of several classifiers (members) that make predictions independently of each other. After that, these predictions are combined together to produce the final prediction. As a base classifier, 3-NN was used. As a combination technique, the conventional majority vote was chosen. An ensemble can outperform its best performing member if ensemble members make mistakes on different cases so that their predictions are uncorrelated and diverse as much as possible. On the other hand, an ensemble must include a sufficient number of accurate classifiers since if there are only few good votes, they can be easily drowned out among many bad votes. So far many definitions of diversity were proposed [6], but unfortunately the precise definition is still largely illusive. Because of this fact, we decided not to follow any explicit definition of diversity, but introduce diversity implicitly instead: each ensemble member works with its own feature subset randomly sampled from the original set of genes. Given that it is difficult to carry out biological analysis of many genes, we restricted the number of genes to be sampled to 50, i.e. each ensemble member works with 1 to 50 randomly selected genes. An ensemble consisting of L 3-NN classifiers is formed as follows. Randomly select M > L (e.g. M = 100) feature subsets and compute the dataset complexity for each of them. Rank subsets according to their complexity and select L least complex subsets while ignoring the others. Classify the data with each classifier and combine votes.
8 Experimental Results We set L to 3, 5, 7, 9, and 11. Table 2 represents the dataset complexity as estimated by the normalized rank sum statistic W for different values of L. For
130
O. Okun and H. Priisalu
each dataset, two values are given: average minimum and average maximum complexity (averaging over 100 runs) of the selected feature subsets. It can be observed that complexity for each dataset is rather stable as L grows. Prostate 1 appears to be far more complex than the other datasets while Prostate 2 and Brain seem to be the least complex. For comparison, Table 3 lists dataset complexity when all features are considered in computing W . Again Prostate 1 looks the most complex while Prostate 2 and Brain are among least complex.
Table 2. Average minimum and maximum normalized W for feature subsets selected with our ensemble generating approach for various values of L Dataset no. 1 2 3 4 5 6
L = 3 L = 5 L = 7 L = 9 L = 11 avr.min avr.max avr.min avr.max avr.min avr.max avr.min avr.max avr.min avr.max avr.min avr.max
0.1082 0.1087 0.1295 0.1295 0.0756 0.0756 0.1279 0.1324 0.2438 0.2452 0.0756 0.0756
0.1082 0.1096 0.1295 0.1295 0.0756 0.0756 0.1273 0.1362 0.2437 0.2466 0.0756 0.0756
0.1082 0.1106 0.1295 0.1295 0.0756 0.0756 0.1278 0.1383 0.2437 0.2481 0.0756 0.0756
0.1082 0.1115 0.1295 0.1295 0.0756 0.0756 0.1279 0.1407 0.2439 0.2488 0.0756 0.0756
0.1082 0.1125 0.1295 0.1295 0.0756 0.0756 0.1275 0.1425 0.2438 0.2502 0.0756 0.0756
Table 3. Unnormalized and normalized rank sum statistic W when all features are used Dataset no.
1
2
3
4
5
6
W 465 409 48 496 1959 45 N 74 62 34 90 102 34 normalized W 0.17 0.21 0.08 0.12 0.37 0.009
Table 4 summarizes the average bolstered resubstitution error (over 100 runs) and its standard deviation achieved with 3-NN ensembles using randomly sampled subsets of genes. As L increases, the average error tends to become smaller. It should be noted that one should not seek dependence between ensemble error in Table 4 and feature subset complexity in Table 2, since our hypothesis is only applied to the error of the individual classifiers. For comparison, we also included experiments with RBF [8], followed by 3-NN classification using selected genes. Table 5 lists the average bolstered
Ensembles of Nearest Neighbors for Cancer Classification
131
Table 4. Average bolstered resubstitution error and its standard deviation for our ensemble scheme for different values of L Dataset no. L = 3
L=5
L=7
L=9
L = 11
1 2 3 4 5 6
0.101±0.017 0.077±0.012 0.078±0.020 0.111±0.020 0.075±0.013 0.019±0.015
0.096±0.013 0.072±0.011 0.072±0.020 0.099±0.018 0.067±0.013 0.010±0.011
0.092±0.014 0.067±0.011 0.068±0.019 0.097±0.018 0.064±0.011 0.006±0.008
0.092±0.013 0.064±0.009 0.062±0.016 0.089±0.017 0.062±0.009 0.005±0.007
0.120±0.018 0.093±0.014 0.107±0.026 0.131±0.019 0.093±0.018 0.032±0.022
resubstitution error and its standard deviation computed over 100 runs when RBF was applied to each dataset prior to 3-NN classification. The third column contains the number of genes retained after filtering. Results of 3-NN classification without prior gene selection are given in the last column. It can be observed that our ensemble scheme almost always outperforms RBF+3NN, except for Brain and Prostate 2 datasets, which were easy to classify according to dataset complexity. Table 5. Average bolstered resubstitution error and its standard deviation 1) when RBF was applied before 3-NN classification (RBF+3-NN) and 2) with 3-NN classification without gene selection Dataset no. RBF+3-NN #genes 3-NN 1 2 3 4 5 6
0.199±0.011 0.107±0.010 0.055±0.010 0.145±0.005 0.117±0.008 0.003±0.003
12 3 6 152 2 1
0.160±0.005 0.098±0.006 0.074±0.008 0.132±0.001 0.099±0.002 0.029±0.000
We also provide a comparison of an ensemble and a single best classifier (SBC) in the ensemble. Let eSBC and eEN S be bolstered resubstitution error achieved with a SBC and an ensemble, respectively. The following statistics were computed over 100 ensemble generations: • win-tie-loss count, where ‘win’/‘tie’/‘loss’ means the number of times when an ensemble was superior/equal/inferior in terms of bolstered resubstitution error to a SBC in the ensemble (in other words, the number of times when eEN S < eSBC , eEN S = eSBC , eEN S > eSBC , respectively). • ‘min. win’, ‘max. win’, ‘avr. win’ (minimum, maximum, and average differences eSBC − eEN S when an ensemble outperforms its SBC, • ‘min. loss’, ‘max. loss’, ‘avr. loss’ (minimum, maximum, and average differences eEN S − eSBC when a SBC outperforms an ensemble.
132
O. Okun and H. Priisalu
Tables 6-10 contain values of these statistics. If there were no losses, this fact is marked as ‘no’. As one can see, an ensemble was largely superior to a SBC on all six datasets. The degree of success, however, varied, depending on dataset complexity. For example, Prostate 2 was much easier to classify compared to other datasets and therefore a SBC often reached the top performance so that an ensemble had nothing to improve on. Table 6. Comparison of a SBC and an ensemble when L = 3 Dataset no. win-tie-loss min.win max.win avr.win min.loss max.loss avr.loss 1 2 3 4 5 6
93/0/7 96/0/4 94/1/5 97/0/3 94/0/6 68/7/25
0.0014 0.0048 0.0029 0.0022 0.0020 < 10−4
0.0743 0.0565 0.0912 0.0811 0.0735 0.0618
0.0231 0.0263 0.0370 0.0414 0.0329 0.0217
0.0014 0.0032 0.0029 0.0022 0.0010 0.0029
0.0243 0.0177 0.0176 0.0078 0.0235 0.0412
0.0085 0.0089 0.0094 0.0044 0.0113 0.0161
Table 7. Comparison of a SBC and an ensemble when L = 5 Dataset no. win-tie-loss min.win max.win avr.win min.loss max.loss avr.loss 1 2 3 4 5 6
97/0/3 99/0/1 100/0/0 100/0/0 95/1/4 82/1/17
0.0041 0.0048 < 10−4 0.0167 0.0059 < 10−4
0.0770 0.0581 0.1000 0.0911 0.0892 0.0559
0.0346 0.0340 0.0525 0.0584 0.0388 0.0222
0.0027 0.0032 no no 0.0010 0.0029
0.0054 0.0032 no no 0.0049 0.0265
0.0041 0.0032 no no 0.0027 0.0119
Table 8. Comparison of a SBC and an ensemble when L = 7 Dataset no. win-tie-loss min.win max.win avr.win min.loss max.loss avr.loss 1 2 3 4 5 6
95/1/4 100/0/0 98/1/1 99/0/1 98/0/2 73/9/18
0.0041 0.0032 0.0029 0.0156 0.0020 0.0029
0.0649 0.0629 0.1059 0.1056 0.0833 0.0529
0.0336 0.0371 0.0547 0.0630 0.0408 0.0208
0.0014 no 0.0029 0.0033 0.0020 0.0029
0.0068 no 0.0029 0.0033 0.0127 0.0353
0.0041 no 0.0029 0.0033 0.0074 0.0096
Ensembles of Nearest Neighbors for Cancer Classification
133
Table 9. Comparison of a SBC and an ensemble when L = 9 Dataset no. win-tie-loss min.win max.win avr.win min.loss max.loss avr.loss 1 2 3 4 5 6
96/0/4 100/0/0 99/0/1 100/0/0 99/0/1 79/9/12
0.0014 0.0081 0.0059 0.0056 0.0029 <10−4
0.0649 0.0694 0.0941 0.1033 0.1029 0.0529
0.0345 0.0400 0.0556 0.0664 0.0412 0.0193
0.0027 no 0.0059 no 0.0078 0.0029
0.0189 no 0.0059 no 0.0078 0.0118
0.0081 no 0.0059 no 0.0078 0.0042
Table 10. Comparison of a SBC and an ensemble when L = 11 Dataset no. win-tie-loss min.win max.win avr.win min.loss max.loss avr.loss 1 2 3 4 5 6
97/0/3 99/0/1 100/0/0 100/0/0 98/0/2 88/3/9
0.0014 0.0032 0.0206 0.0333 0.0010 <10−4
0.0676 0.0742 0.1000 0.1133 0.0794 0.0441
0.0312 0.0383 0.0574 0.0737 0.0431 0.0172
0.0068 0.0016 no no 0.0069 0.0029
0.0284 0.0016 no no 0.0167 0.0176
0.0149 0.0016 no no 0.0118 0.0078
9 Conclusion We proposed a new ensemble generating scheme using a 3-NN as a base classifier. Our approach leads to lower bolstered resubstitution error compared to a single best classifier in the ensemble and to a 3-NN preceded by the RBF algorithm [8], designed to deal with redundancy among genes. Our approach originates from the link between dataset complexity and bolstered resubstitution error established through the copula method. We found that there is positive dependence between complexity and error, where low (high) complexity corresponds to small (large) error. Hence, the dataset complexity serves as a reliable indicator of the expected classification performance. As a result, selection of least complex subsets of features implies more accurate ensemble members and therefore it ensures better ensemble performance. Extensive experiments with six gene expression datasets containing different types of cancer show feasibility of our approach. Its extra attractiveness comes from the fact that good performance is achieved with a few 3-NNs, which limits the number of genes to further analyze.
References 1. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Proc Natl Acad Sci 96:6745–6750 2. Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME, Kim JYH, Goumnerova LC, Black PM, Lau C, Allen JC, Zagzag D, Olson JM,
134
3.
4.
5. 6. 7.
8.
9. 10. 11. 12. 13. 14. 15. 16.
O. Okun and H. Priisalu Curran T, Wetmore C, Biegel JA, Poggio T, Mukherjee S, Rifkin R, Califano A, Stolovitzky G, Louis DN, Mesirov JP, Lander ES, Golub TR (2002) Nature 415:436–442 Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR (2002) Cancer Cell 1:203–209 Sima C, Attoor S, Braga-Neto U, Lowey J, Suh E, Dougherty ER (2005) Error estimation confounds feature selection in expression-based classification. In: Proc IEEE Int Workshop Genomic Sign Proc and Stat, Newport, Rhode Island Braga-Neto U, Dougherty ER (2004) Pattern Recognition 37:1267–1281 Kuncheva L (2004) Combining pattern classifiers: methods and algorithms. John Wiley & Sons, Hoboken Dudoit S, Fridlyand J (2003) Classification in microarray experiments. In: Speed T (ed) Statistical analysis of gene expression microarray data. Chapman & Hall\CRC Press, Boca Raton Yu L (2008) Feature selection for genomic data analysis. In Liu H, Motoda H (eds) Computational methods of feature selection. Chapman & Hall\CRC, Boca Raton Sklar A (1959) Fonctions de r´epartition ` a n dimensions et leurs marges. Publications of the Institute of Statistics, University of Paris Nelsen RB (2006) An inroduction to copulas. Springer Science+Business Media, New York Joe H (1997) Multivariate models and dependence concepts. Chapman & Hall\CRC Press, Boca Raton Zar JH (1999) Biostatistical analysis. Prentice Hall, Upper Saddle River Gandrillon O (2004) Guide to the gene expression data. In: Proc ECML/PKDD Discovery Challenge Workshop, Pisa, Italy, pp 116–120 Bø TH, Jonassen I (2002) Genome Biology 3:0017.1–0017.11 Box GEP, M¨ uller ME (1958) The Annals of Mathematical Statistics 29:610–611 Schweizer B, Wolff EF (1981) The Annals of Statistics 9:879–885
Multivariate Time Series Classification via Stacking of Univariate Classifiers∗ ´ Prieto1 , Juan Jos´e Rodr´ıguez2 , and An´ıbal Breg´ on1 Carlos Alonso1 , Oscar 1 2
Grupo de Sistemas Inteligentes, Departamento de Inform´ atica, Universidad de Valladolid, Espa˜ na Lenguajes y Sistemas Inform´ aticos, Universidad de Burgos, Espa˜ na
Summary. This work explores the capacity of Stacking to generate multivariate time series classifiers from classifiers of their univariate time series components. The Stacking scheme proposed uses k-nearest neighbors (K-NN) with dynamic time warping (DTW) as a dissimilarity measure for the level 0 learners. Support vector machines and Na¨ıve Bayes are applied at level 1. The method has been tested on two data sets: Continuous plant diagnosis and Japanese vowels. Experimental results show that for these data sets the proposed Stacking configuration performs well when multivariate DTW fails to produces precise K-NN classifiers, increasing the accuracy achieved by K-NN as a stand alone method by the order of magnitude. This is an interesting issue because good univariate time series classifiers do not always perform satisfactory when adapted to the multivariate case. On the contrary, if the multivariate classifier is accurate, Stacking univariate classifiers may perform worse. Key words: multivariate time series classifier, dynamic time warping, stacking
1 Introduction Time series classification is a current research problem. Time series is of interest in several research areas like statistics, signal processing, and control theory [1]. Consequently, several techniques have been proposed to tackle this problem; a detailed review of them may be found in [2]. Hidden Markov Models [3, 4], Artificial Neural Networks [5, 6] or Dynamic Time Warping [4, 7, 8, 9] have been successfully applied to univariate time series classification. Feature extraction is also a popular method. The complexity of the problem increases when we consider multivariate time series: not every univariate method scales up to cope with multivariate problems. For instance, k-neighbors (K-NN) using Dynamic Time Warping (DTW) as a dissimilarity measure behaves reasonably well for most univariate ∗
This work has been partially funded by Spanish Ministry of Education and Culture through grant DPI2005-08498, and Junta Castilla y Le´ on VA088A05.
C. Alonso et al.: Multivariate Time Series Classification via Stacking of Univariate Classifiers, Studies in Computational Intelligence (SCI) 126, 135–152 (2008) c Springer-Verlag Berlin Heidelberg 2008 www.springerlink.com
136
C. Alonso et al.
problems [7] but it may degrade in the multivariate case. In this work we explore the possibility of using univariate classification methods to handle each multivariate time series component – itself a univariate time series – by introducing an additional classifier to obtain the final class. This approach is a variant of the Stacking ensemble method. Stacking is characterized by two classification levels: level 0, which produces different classifiers for the same data set, and level 1, which tries to generalize from level 0 classifications. Level 1 generalization is accomplished by inducing another classification model from the same data set, once it has been processed by the level 0 classifiers [10]. Usually, Stacking applies different classification methods to the same data set at level 0. In this work, we first split each multivariate time series into each of its components and then we induce a K-NN classifier for each component. Given that the multivariate series classes are not usually characterized by a single component, we do not expect that a simple combination method, e.g. majority voting, will produce accurate results. Nevertheless, the principles of Stacking generalization also apply to this particular configuration. In this chapter, we analyze whether the generalization introduced by the first level classifier may compensate the information loss by the univariate classifiers at level 0 and produce an accurate multivariate classifier. To test the method, we have selected two data sets: Continuous plant diagnosis and Japanese vowels. Continuous plant diagnosis is the data set that inspired the use of Stacking of univariate time series classifiers, because multivariable DTW does not work well with this data set. Japanese vowels has been selected because it is an example of the opposite situation: multivariate DTW produces excellent K-NN classifiers for this data set. In a first set of experiments, we compare the behavior of Stacking against K-NN with DTW as a stand alone method. Afterwards, to cope with the difficulties found on the Japanese vowels data set and to further explore the properties of this proposal, we extend the set of Stacking configurations. The rest of this chapter is organized as follows. We first explore the method on the Continuous plant data set: the motivating example that inspires the work, the machine learning scheme proposed, the experimental setting and the results obtained are presented in detail. In the third section we apply the same learning method to the Japanese vowel data set. The fourth section explores new Stacking configurations on both data sets. Finally, some conclusions are stated.
2 Continuous Plant Diagnosis Data Set 2.1 Motivating Example Continuous dynamic industrial processes have two peculiarities related to the accessible observations. The first one is that a set of measured observations
Stacking for Multivariate Time Series Classification
137
is fixed because of the cost and additional complexity of introducing new sensors to the system. Hence, reasoning about the process situation is usually limited to available observations and knowledge of the process. Second one, as a consequence of its dynamic nature, observations values vary in time and are usually stored. Hence, the historic values of each observable variable may be considered as a univariate time series and the set of historic values of all the variables of interest as a multivariate time series. Therefore diagnosis of past and current faults may be achieved by analyzing the past multivariate time series, particularly if each diagnosed fault generates similar behavior patterns on the time series. If this is the case, each fault mode may be associated to a unique class and the diagnosis problem may be cast as the induction of multivariate time series classifiers. Of course, machine learning has been widely used for diagnosis problems. For instance, Inductive Logic Programming [11], Neural Networks [12], KDD techniques [13], decision trees [14] and combination of techniques like, Wavelet On-Line Pre-processing (WOLP), Autonomous Recursive Task Decomposition (ARTD) [15], have been used with different degree of success. However, to the best of our knowledge, few authors formulate the problem as a multivariate time series classification [15, 16]. For this work, we have use the laboratory scale plant shown in Fig. 1. FT 01
T1
LT 01
LC 01
P1
P2 ON/OFF
TT 03
T3
FT 03
FT 02
T2
TT 02
P3
P4 ON/OFF
v
v
ON/OFF R3
ON/OFF R2
T4
LT 04 TT 04
LC 04
P5
Fig. 1. The diagram of the plant
ON/OFF
FT 04
138
C. Alonso et al.
Although a laboratory plant, its complexity is comparable to the one encountered in several subsystems of real processes. It is made of four tanks {T 1...T 4}, five pumps {P 1...P 5}, and two PID’s controllers acting on pumps P 1, P 5 to keep the level of {T 1, T 4} close to their specified set point. Temperatures of tanks {T 2, T 3} are controlled by two PID’s that command, respectively, resistors {R2, R3}. The plant may work with different configurations and a simple setting without recirculation – pumps P 3, P 4 and resistor R2 are off – has been chosen. There are eleven different measured variables or parameters in the system: levels of tanks T 1 and T 4, the value of the PID control signal on pumps P 1 and P 5, in-flow on tank T 1, out-flow on tanks T 2, T 3 and T 4 and temperatures on tanks T 2, T 3 and T 4. We have considered fourteen different faults, listed in Table 1. Each fault produces a different pattern in the time series of eleven components that describe the historic behavior of the plant. Figure 2 shows fourteen instances of the series, one for each faulty class. In addition, in absence of faults, the plan works in the stationary state and time series values remain constant. Table 1. Fault types or classes Class Component Description FM01 FM02 FM03 FM04 FM05 FM06 FM07 FM08 FM09 FM10 FM11 FM12 FM13 FM14
T1 T1 T1 T1 T3 T3 T2 T2 T4 T4 P1 P2 P5 R2
Small leakage in tank T1 Big leakage in tank T1 Pipe blockage T1 (left outflow) Pipe blockage T1 (right outflow) Leakage in tank T3 Pipe blockage T3 (right outflow) Leakage in tank T2 Pipe blockage T2 (left outflow) Leakage in tank T4 Pipe blockage T4 (right outflow) Pump failure Pump failure Pump failure Resistor failure in tank T2
As Fig. 2 illustrates, classes can not be characterized by a single univariate series. However, there are univariate patterns that seem to be strongly associated to particular faults. For instance, class FM11 is the only one that produces overflow in tank 1, as shown LT1 time series. This circumstance may be better identified by a univariate time series classifier. 2.2 Stacking of Univariate Classifiers We want to examine whether Stacking allows the induction of multivariate time series classifiers from univariate classifiers induced from each component
Stacking for Multivariate Time Series Classification
139
FM14 FM13 FM12 FM11 FM10 FM09 FM08 FM07 FM06 FM05 FM04 FM03 FM02 FM01
FT01 FT02 FT03 FT04 LT01 LT04 PI.1 PI.4 TT02 TT03 TT04 control control
Fig. 2. Examples of different fault classes. Each row shows an instance of a certain fault class. Each column shows one of the measurable variables
140
C. Alonso et al.
of the original series. To accomplish it, we have used the Stacking configuration shown in Fig. 3. If the original multivariate time series has dimension n, it is split into its n components to obtain n univariate times series. For each time series, K-NN algorithm is used. Instead of returning only the class, KNN computes the confidence for every class. Consequently, the output vector of level 0 has nN components, being N the total number of classes: 11 × 14 in our setting. This output vector is the level 1 input. As level 1 learner we have chosen Na¨ıve Bayes, NB. Level 1 output is the output of the classifier. Ten-fold stratified cross validation is used to create the level 1 learning set.
output Level 1
Naive Bayes
Kneighbours DTW distance
Kneighbours DTW distance
Kneighbours DTW distance
Time series 1
Time series 2
Time series 3
...
Kneighbours DTW distance
Level 0
Time series n
inputs Fig. 3. Schema of the Stacking variant used in this work
For the level 0 classifiers we have opted for K-NN with DTW as dissimilarity measure because of its simplicity and good behavior as univariate time series classifier. NB has been selected for the level 1 classifier because it is a global and continuous classifier; these are desirable properties for the level 1 classifier [10]. NB is based on the independence assumption, but this assumption does not usually hold for time series. For the domain problem that we are interested in, it seems that some subsets of components are adequate to predict some classes and other subsets to predict other classes. Hence, it is reasonable to expect some independence between different subsets of the multivariate time series components. On the contrary, some dependence may exist among the compo-
Stacking for Multivariate Time Series Classification
141
nents of each subset. Nevertheless, the fact of training the level 0 classifiers with different and disjoint data gives a chance to increase independence. 2.3 Experimental Settings Due to the high cost of obtaining enough data for the fourteen class classification problem from the laboratory plant, we have to resort to a detailed, non linear quantitative simulation of the plant. We ran twenty simulations for each class, adding noise to the sensors readings. In order to obtain more realistic data, each fault mode is modeled via a parameter, affected by a factor α in the [0, 1] range which allows to graduate the severity of the fault. Simulations start at a stationary state and run for 900 seconds. The time of occurrence of each fault is randomly generated in the interval [180, 300]. To train the learners, we have sampled the original series with a three seconds period. Hence, the length of each univariate time series is 300. We randomly chose the fault magnitude, α. The multivariate time series obtained have been used to train the following learners: NB, 1-NN DTW, 3-NN DTW, 5-NN DTW, Stacking (level 0 1NN DTW, level 1 NB), Stacking (level 0 3-NN DTW, level 1 NB), Stacking (level 0 5-NN DTW, level 1 NB). This allows us comparing the behavior of Stacking with respect to NB and K-NN when they are employed alone. K-NN for the multivariate times series uses a simple extension of one-dimensional DTW. DTW is computed for each component, normalized and summed over all components to obtain a global dissimilarity measure. 2.4 Results and Discussion All the experiments have been performed on the WEKA tool [17]. Error estimations and hypotheses testing have been obtained with the slightly conservative corrected resampled t-test. Instead of the standard cross validation, we have made fifteen training-test splits. In each split, we have randomly selected 90% of available data for the training set and 10% for the test set. This experimental setting is proposed in [18]. Table 2 shows the error estimation for the seven learners considered. As was expected, neither NB nor K-NN are accurate classifiers for this problem. NB seems to be affected by the dependence among the univariate time series. K-NN, estimating a global dissimilarity measured by adding DTW results for every univariate component, obtains somewhat better error rates, but the improvement is not significant. However, Stacking increases dramatically the accuracy of the classifier, reducing error estimation by the order of magnitude. Table 3 shows the significance test results at 0.01 level. Any of the Stacking configurations is significantly better than the stand alone methods. The dependence of the univariate time series does not seem to affect NB at the level 1, which behaves extremely well with this data set.
142
C. Alonso et al.
Table 2. Continuous Plant Diagnosis data set: percentage of error for each method (N0 and N1 mean level 0 and level 1, respectively) Methods
Mean (deviation)
NB 17.14 (5.26) 1-NN DTW dist. 12.38 (6.31) 3-NN DTW dist. 15.00 (4.90) 5-NN DTW dist. 16.19 (6.02) Stacking: N0: 1-NN DTW dist. N1: NB 0.95 (1.63) Stacking: N0: 3-NN DTW dist. N1: NB 1.19 (2.20) Stacking: N0: 5-NN DTW dist. N1: NB 1.19 (2.58) Table 3. Continuous Plant Diagnosis data set: t-test results ( means that the method in the column is significantly better than the method in the row; N0 and N1 mean level 0 and level 1, respectively) a
b
c
-
d e f - -
g
-
a = NB b = 1-NN DTW dist. c = 3-NN DTW dist. d = 5-NN DTW dist. e = Stacking: N0: 1-NN DTW dist. N1: NB f = Stacking: N0: 3-NN DTW dist. N1: NB g = Stacking: N0: 5-NN DTW dist. N1: NB
In this work we try to evaluate the capacity of Stacking to address multivariate time series classification from univariate time series classifiers induced for each component of the original time series. This is an important question because good univariate classifiers methods are known but they do not always produce good results when extended to the multivariate case. The results obtained for the Continuous Plant Diagnosis data set with the proposed Stacking scheme are good enough to suggest that this approach may be viable. Nevertheless, the plant data set has the property that multivariate DTW does not behave satisfactory for the classification task. Hence, to test the generality of this approach, we have selected a data set where K-NN with multivariate DTW produces classifiers with a low error rate: Japanese vowels. We have repeated the same set of experiments with this data set.
3 Japanese Vowels Data Set 3.1 Data Set The data set was introduced in [19]; it is a speaker recognition problem. It is available in the UCI KDD Archive [20]. In this archive, the data set is
Stacking for Multivariate Time Series Classification
143
described as “this dataset records 640 time series of 12 LPC cepstrum coefficients taken from nine male speakers.” Hence, the number of classes is 9, the number of examples is 640 and the number of variables is 12. The series length is variable, the minimum length is 7 and the maximum length is 29. Figure 4 shows the first 6 variables for two examples of the first two classes. 3.2 Results and Discussion The same methodology as used with the Continuous Plant Diagnosis data set has been applied to the Japanese Vowels data set. The experiments have been performed on the WEKA tool [17]. Error estimations and hypothesis testing have been obtained with the corrected resampled t-test. Instead of the standard cross validation, we have made fifteen training-test splits. As was suggested in [18], in each split we have randomly selected 90% of available data for the training set and 10% for the test set. Table 4 shows the error estimation for the seven learners considered. First, it must be noticed that the classifiers induced with the global DTW, that is, 1,3,5-NN, provide the lowest error rate for this data set. NB alone behaves worse than K-NN. More important, Stacking univariate K-NN classifiers with NB does not work for this data set: Stacking with 1-NN as a level 0 classifier and NB as a level 1 classifier provides the worst error rate. In the best configuration with 5-NN at level 0 and NB at level 1, the error rate is comparable to the obtained with NB alone. Table 4. Japanese Vowels data set: Percentage of error for each method (N0 and N1 mean level 0 and level 1, respectively) Methods
Mean (deviation)
NB 7.41( 2.69) 1-NN DTW dist. 1.35( 1.27) 3-NN DTW dist. 1.46( 1.38) 5-NN DTW dist. 1.87( 1.34) Stacking: N0: 1-NN DTW dist. N1: NB 16.90( 4.26) Stacking: N0: 3-NN DTW dist. N1: NB 10.11( 3.32) Stacking: N0: 5-NN DTW dist. N1: NB 7.61( 3.64)
Table 5 shows the significance test results at 0.01 level. Any the stand alone K-NN classifiers are significantly better than the Stacking configurations. For this data set, the non independence of the univariate time series seems to affect severely NB at the level 1. There are two properties of the Japanese Vowels data set that might explain the obtained results. One is the strong dependency between the univariate time series for a given class; this could explain why NB can not produce a good classifier from univariate K-NN classifiers. The other is that there seems
144
C. Alonso et al. 1.8
1.8
1.6
1.6
1.4
1.4
1.2
1.2
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2 0
5
10
15
20
25
0.2
0
5
10
15
20
25
0
5
10
15
20
25
0
5
10
15
20
25
0
5
10
15
20
25
0
5
10
15
20
25
0
5
10
15
20
25
0.2
0
0
-0.2
-0.2
-0.4
-0.4
-0.6
-0.6
-0.8
-0.8
-1
-1 0
5
10
15
20
25
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2 0.1
0.1
0
0 0
5
10
15
20
25
0.4
0.4
0.2
0.2
0
0
-0.2
-0.2
-0.4
-0.4
-0.6
-0.6
-0.8
-0.8 0
5
10
15
20
25
0.6
0.6
0.4
0.4
0.2
0.2
0
0
-0.2
-0.2
-0.4
-0.4 0
5
10
15
20
25
0.2
0.2
0.1
0.1
0
0
-0.1
-0.1
-0.2
-0.2
-0.3
-0.3
-0.4
-0.4
-0.5
-0.5
-0.6
-0.6
-0.7
-0.7 0
5
10
15
20
25
Fig. 4. Several series for the Japanese Vowels data set. The plots on the left are for the first class and the plots on the right are for the second class. Two examples of the same class are shown in each plot. Six variables (the first half) are shown, one for each row of graphs
Stacking for Multivariate Time Series Classification
145
Table 5. Japanese Vowels data set: t-test results ( means that the method in the column is significantly better than the method in the row; N0 and N1 mean level 0 and level 1, respectively) a
b
c
d e f
g
-
-
a = NB b = 1-NN, DTW dist. c = 3-NN, DTW dist. d = 5-NN, DTW dist. e = Stacking: N0: 1-NN DTW dist. N1: NB f = Stacking: N0: 3-NN DTW dist. N1: NB g = Stacking: N0: 5-NN DTW dist. N1: NB
to be no irrelevant univariate time series for any given class; this is coherent with the good behavior exhibited by K-NN with a global DTW dissimilarity measure. In order to test the former hypotheses, we decided to extend the set of Stacking configurations. First, we have considered Support Vector Machines, SVM, as a first level classifier. SVM are robust against attribute dependencies and it may improve the performance of Stacking on the Japanese Vowels data set. Second, we have added a global multivariate classifier to the level 0 univariate classifiers. In this way, the first level classifier might generalize from the univariate and the multivariate classifiers, which may improve the behavior of both of them. The next section describes in detail the new configurations.
4 Extending Stacking Configurations 4.1 Experimental Settings We propose two modifications for the Stacking scheme for multivariate time series classification. First one consists of using a method that was not affected by attribute dependencies at the first level. We chose Support Vector Machine (SVM) [21] and ran the algorithm with two different kernels: linear and perceptron [22]. The second modification is to add a global multivariate classifier at level 0; adding this classifier gives global information of the multivariate time series at level 1. We have used the K-NN algorithm with global DTW. The new configurations of the algorithm are shown in Fig. 5. We can use NB or SVM with linear kernel or with kernel perceptron at level 1 and we can employ the global multivariate classifier at level 0. This gives us a total of fifteen new configurations of the Stacking scheme for multivariate time series classification.
146
C. Alonso et al.
output Global Classifier Level 1 Classifier
K-NN GlobalDTW distance
K-NN
K-NN
DTW distance
DTW distance
DTW distance
Time series 2
Time series 3
Time series n
K-NN DTW distance
Time series 1
Level 1: SVM or NB
..
K-NN Level 0
input Fig. 5. Schema of the Stacking variant using the K-NN global DTW at level 0
4.2 Results and Discussion The same methodology has been used with both data sets and the fifteen new learning schemes. All the experiments have been performed on the WEKA tool [17]. Error estimations and hypothesis testing have been obtained with the corrected resampled t-test. We have made fifteen training-test iterations for each learning method: in each iteration we have randomly selected 90% of the available data for the training set and 10% for the test set. Table 6 shows the error estimation for the twenty two learners considered and the two data sets used. The results of the seven methods already examined have been included for comparison. For the Continuous Plant Diagnosis data set, the Stacking configuration with NB at level 1 and univariate 1-NN at level 0 provides the best error rate. The new configurations do not improve this previously obtained result. The change of NB to SVM at level 1 slightly increases the error rate. Nevertheless, Table 7 which shows the significance test results at 0.01 level, reveals that the differences in performance are not significant. The inclusion of the global multivariate classifier at level 0 does not improve Stacking generalization, either. For the Japanese Vowels data set, the best result is obtained with Stacking using SVM (perceptron kernel) at level 1 when adding the global multivariate classifier to the level 0 univariate time series classifiers. However, the error rate is quite similar to that achieved by 1-NN with global multivariate DTW. Actually, Table 8 shows that the difference is not significant. NB with global DTW also increases its performance respect to the univariate version, but
Stacking for Multivariate Time Series Classification
147
Table 6. Both data sets: Percentage of error for each method (N0 and N1 mean level 0 and level 1, respectively; P and L correspond to the linear and perceptron SVM kernels, respectively)
Methods
Mean (deviation) Plant Vowels
NB 1-NN DTW dist. 3-NN DTW dist. 5-NN DTW dist. Stacking: N0: 1-NN, N1: NB Stacking: N0: 3-NN, N1: NB Stacking: N0: 5-NN, N1: NB Stacking: N0: 1-NN DTW dist. N1: SVM (L) Stacking: N0: 3-NN DTW dist. N1: SVM (L) Stacking: N0: 5-NN DTW dist. N1: SVM (L) Stacking: N0: 1-NN DTW dist. N1: SVM (P) Stacking: N0: 3-NN DTW dist. N1: SVM (P) Stacking: N0: 5-NN DTW dist. N1: SVM (P) Stacking: N0: 1-NN DTW dist.+ global, N1: NB Stacking: N0: 3-NN DTW dist.+ global, N1: NB Stacking: N0: 5-NN DTW dist.+ global, N1: NB Stacking: N0: 1-NN DTW dist.+ global, N1: SVM Stacking: N0: 3-NN DTW dist.+ global, N1: SVM Stacking: N0: 5-NN DTW dist.+ global, N1: SVM Stacking: N0: 1-NN DTW dist.+ global, N1: SVM Stacking: N0: 3-NN DTW dist.+ global, N1: SVM Stacking: N0: 5-NN DTW dist.+ global, N1: SVM
17.14(5.26) 12.38(6.31) 15.00(4.90) 16.19(6.02) 0.95(1.63) 1.19(2.20) 1.19(2.58) 2.38(2.92) 2.14(2.96) 1.67(2.65) 2.62(3.16) 2.62(3.43) 1.67(2.65) 1.19(1.74) 1.19(2.20) 1.19(2.58) 1.67(2.65) 1.90(2.98) 1.67(2.65) 1.67(2.65) 2.38(3.21) 1.90(2.65)
(L) (L) (L) (P) (P) (P)
7.41(2.69) 1.35(1.27) 1.46(1.38) 1.87(1.34) 16.90(4.26) 10.11(3.32) 7.61(3.64) 10.12(2.77) 5.62(2.93) 4.59(1.82) 7.42(2.73) 4.48(2.50) 3.64(2.17) 7.00(2.49) 4.59(2.48) 4.28(2.62) 1.56(1.43) 1.35(1.29) 1.35(1.16) 1.35(1.27) 1.25(1.05) 1.25(1.05)
behaves worse than SVM, and its error rate is larger than that achieved by KNN as a stand alone classifier. All SVM configurations are significantly better than the corresponding scheme with NB at level 1. By summarizing, for these data sets, it seems that if a stand alone multivariate classifier behaves well, Stacking of classifiers trained for each univariate time series component does not improve on the simpler, easier to train, stand alone method. On the contrary, if the stand alone multivariate classifier is not satisfactory, the use of univariate time series combined by Stacking may be a good alternative.
5 Conclusion This work tries to evaluate the capacity of Stacking to address multivariate time series classification. This is an important question because good univariate classifiers methods are known but they do not always produce good results when extended to the multivariate case. In the proposed setting, the
-
a
-
b
-
c
-
d
-
-
f
e
-
g
-
h
-
i
-
j
-
k
-
l
-
m
-
n
-
o
-
p
-
q
-
r
-
s
-
t
-
u
-
v a = NB b = 1-NN DTW dist. c = 3-NN DTW dist. d = 5-NN DTW dist. e=Stacking:N0:1-NN DTW dist. N1:NB f=Stacking:N0:3-NN DTW dist. N1:NB g=Stacking:N0:5-NN DTW dist. N1:NB h=Stacking:N0:1-NN DTW dist. N1:SVM (L) i=Stacking:N0:3-NN DTW dist. N1:SVM (L) j=Stacking:N0:5-NN DTW dist. N1:SVM (L) k=Stacking:N0:1-NN DTW dist. N1:SVM (P) l=Stacking:N0:3-NN DTW dist. N1:SVM (P) m=Stacking:N0:5-NN DTW dist. N1:SVM (P) n=Stacking:N0:1-NN DTW dist.+global, N1:NB o=Stacking:N0:3-NN DTW dist.+global, N1:NB p=Stacking:N0:5-NN DTW dist.+global, N1:NB q=Stacking:N0:1-NN DTW dist.+global, N1:SVM (L) r=Stacking:N0:3-NN DTW dist.+global, N1:SVM (L) s=Stacking:N0:5-NN DTW dist.+global, N1:SVM (L) t=Stacking:N0:1-NN DTW dist.+global, N1:SVM (P) u=Stacking:N0:3-NN DTW dist.+global, N1:SVM (P) v=Stacking:N0:5-NN DTW dist.+global, N1:SVM (P)
Table 7. Continuous Plant Diagnosis data set: t-test results (means that the method in the column is significantly better than the method in the row; N0 and N1 mean level 0 and level 1, respectively; P and L correspond to the linear and perceptron SVM kernels, respectively)
148 C. Alonso et al.
b
c
d
-
a
e
-
f
h
i
j
k l m
-
g o p q r
s t u v
a=NB b=1-NN DTW dist. c=3-NN DTW dist. d=5-NN DTW dist. e=Stacking:N0: 1-NN DTW dist.N1:NB f=Stacking:N0:3-NN DTW dist.N1:NB g=Stacking:N0: 5-NN DTW dist.N1:NB h = Stacking:N0:1-NN DTW dist.N1:SVM (L) i=Stacking:N0:3-NN DTW dist.N1:SVM (L) j=Stacking:N0:5-NN DTW dist.N1:SVM (L) k=Stacking:N0:1-NN DTW dist.N1:SVM (P) l=Stacking:N0:3-NN DTW dist.N1:SVM (P) m=Stacking:N0:5-NN DTW dist.N1:SVM (P) n=Stacking:N0:1-NN DTW dist.+global,N1:NB o=Stacking:N0:3-NN DTW dist.+global,N1:NB p=Stacking:N0:5-NN DTW dist.+ global,N1:NB q=Stacking:N0:1-NN DTW dist.+global,N1:SVM (L) r=Stacking:N0:3-NN DTW dist.+global,N1:SVM (L) s=Stacking:N0:5-NN DTW dist.+global,N1:SVM (L) t=Stacking:N0:1-NN DTW dist.+global,N1:SVM (P) u=Stacking:N0:3-NN DTW dist.+global,N1:SVM (P) - v=Stacking:N0:5-NN DTW dist.+global,N1:SVM (P)
n
Table 8. Japanese Vowels data set: t-test results (means that the method in the column is significantly better than the method in the row; N0 and N1 mean level 0 and level 1, respectively; P and L correspond to the linear and perceptron SVM kernels, respectively)
Stacking for Multivariate Time Series Classification 149
150
C. Alonso et al.
classifiers at level 0 of Stacking are univariate classifiers, that is, there is a classifier for each variable of the multivariate time series. It is also possible to include an additional classifier at level 0, named “global”, that is trained with the multivariate series. The considered approach has been evaluated using two data sets, Continuous Plant Diagnosis and Japanese Vowels. It has been compared with classifiers that are obtained directly from the multivariate series. The results for these data sets are very different. We can conclude that if the multivariate classifier is good on a certain data set such as Japanese vowels, Stacking with univariate classifiers will not produce better classifiers. If the global classifier is also included at level 0, then it is possible to have better classifiers, but for this data set the differences are very small and not significant. On the other hand, if the multivariate classifier is not accurate (as is with Continuous Plant Diagnosis data set), Stacking with univariate classifiers is significantly better as demonstrated by our experiments. Moreover, many different Stacking configurations can be used to outperform the multivariate classifiers. A possible explanation for this difference can be that in the first case, the individual univariate series data are interrelated. Thus, the construction of independent classifiers for each variable is harmful since it breaks up these relations. In contrast, in the second case, some variables can be irrelevant or redundant. Given that the instance based methods can be very sensitive to irrelevant attributes, a combination of several redundant variables can further degrade classification performance. In the future, our approach will be evaluated by using other data sets in order to get extra insight into the expected ensemble behavior, depending on the characteristics of the multivariate time series data.
References 1. Kadous MW (2002) Temporal classification: extending the classification paradigm to multivariate time series. PhD Thesis, University of New South Wales, Sydney, http://www.cse.unsw.edu.au/~ waleed/phd/ 2. Rodr´ıguez JJ, Alonso CJ (2004) T´ecnicas de aprendizaje autom´ atico para la clasificaci´ on de series. In: Gir´ aldez R, Riquelme JC, Aguilar-Ruiz JS (eds), Tendencias de la Miner´ıa de Datos en Espa˜ na: Red Espa˜ nola de Miner´ıa de Datos, Universidad de Valladolid, Espa˜ na, pp 217–228 3. Bengio Y (1999) Markovian models for sequential data. Neural Computing Surveys 2:129–162, http://www.iro.umontreal.ca/~ lisa/bib/pub\ _subject/markov/pointeurs/hmms.ps 4. Rabiner L, Juang B-H (1993) Fundamentals of speech recognition. Prentice Hall, Upper Saddle River 5. Bishop CM (1995) Neural networks for pattern recognition. Oxford University Press, Oxford 6. Haykin S (1998) Neural networks: a comprehensive foundation. Prentice Hall, Upper Saddle River
Stacking for Multivariate Time Series Classification
151
7. Keogh E, Ratanamahatana CA (2005) Exact indexing of dynamic time warping. Knowl Inf Syst 7:358–386 8. Colomer J, Mel´endez J, Gamero FI (2002) Qualitative representation of process trends for situation assessment based on cases. In: Proc the 15th IFAC World Congress, Barcelona, Spain. Elsevier, Amsterdam 9. Myers CS, Rabiner LR (1981) A comparative study of several dynamic timewarping algorithms for connected word recognition. The Bell Syst Tech J 60:1389–1409 10. Wolpert DH (1992) Stacked generalization. Neural Networks 5:241–259, http: //citeseer.csail.mit.edu/wolpert92stacked.html 11. Feng C (1992) Inducting temporal fault diagnostic rules from a qualitative model. In: Muggleton S (ed) Inductive Logic Programming. Academic Press, London 12. Venkatasubramanian V, Chan K (1989) A neural network methodology for process fault diagnosis. The Amer Inst of Chemical Engineers J 35:1993–2002 13. Sleeman D, Mitchell F, Milne R (1996) Applying KDD techniques to produce diagnostic rules for dynamic systems. Tech Report AUCS/TR9604, University of Aberdeen, Scotland 14. Su´ arez AJ, Abad PJ, Ortega JA, Gasca RM (2002) Diagnosis progresiva en el tiempo de sistemas din´ amicos. In: Ortega JA, Parra X, Pulido B (eds) Proc IV Jornadas de ARCA, Sistemas Cualitativos y Diagnosis, pp 111–120 15. Roverso D (2003) Fault diagnosis with the Aladdin transient classifier. In: Willett PK, Kirubarajan T (eds) Proc of the SPIE Conf Syst Diagnosis and Prognosis: Security and Condition Monitoring Issues III, Orlando, FL, USA. SPIE Press, Bellingham, pp 162–172 16. Alonso C, Rodr´ıguez JJ, Pulido B (2004) Enhancing consistency based diagnosis with machine learning techniques. In: Conejo R, Urretavizcaya M, P´erez-dela-Cruz J-L (eds) Proc the 10th Conf Spanish Assoc Artif Intell, San Sebastian, Spain. Springer, Berlin/Heidelberg, pp 312–321 17. Witten IH, Frank E (2000) Data mining: practical machine learning tools with Java implementations. Morgan Kaufmann, San Francisco 18. Nadeau C, Bengio Y (2003) Inference for the generalization error. Mach Learn 52:239–281 19. Kudo M, Toyama J, Shimbo M (1999) Multidimensional curve classification using passing-through regions. Pattern Recogn Lett 20:1103–1111 20. Hettich S, Bay SD (1999) The UCI KDD Archive, University of California, Irvine, http://kdd.ics.uci.edu 21. Gunn SR (1998) Support vector machines for classification and regression. Tech Report, University of Southampton, UK 22. Lin H-T, Li L (2005) Novel distance-based SVM kernels for infinite ensemble learning. In: Proc the 12th Int Conf Neural Inf Proc, Taipei, Taiwan, pp 761– 766
Gradient Boosting GARCH and Neural Networks for Time Series Prediction Jos´e M. Mat´ıas1 , Manuel Febrero2 , Wenceslao Gonz´alez-Manteiga3, and Juan C. Reboredo4 1 2 3 4
Dpt. of Statistics, Dpt. of Statistics, Dpt. of Statistics, Dpt. of Economic
[email protected]
University of Vigo,
[email protected] University of Santiago de Compostela,
[email protected] University of Santiago de Compostela,
[email protected] Analysis, University of Santiago de Compostela,
Summary. This work develops and evaluates new algorithms based on neural networks and boosting techniques, designed to model and predict heteroskedastic time series. The main novel elements of these new algorithms are as follows: a) in regard to neural networks, the simultaneous estimation of conditional mean and volatility through the likelihood maximization; b) in regard to boosting, its simultaneous application to trend and volatility components of the likelihood, and the use of likelihood-based models (e.g. GARCH) as the base hypothesis rather than gradient fitting techniques using least squares. The behavior of the proposed algorithms is evaluated over simulated data, resulting in frequent and significant improvements in relation to the ARMA-GARCH models. Key words: gradient boosting, GARCH, ARMA, neural networks, time series, heteroskedasticity
1 Introduction A fundamental issue in financial economics is predicting the price of financial assets. Given the random nature of these prices, they are generally characterized as a stochastic process {Yt , t ∈ T} (where, henceforth, T = N or Z, natural or integer numbers, respectively), for which we wish to know the conditional density function pYt |At−1 , where At−1 is the σ-algebra generated by (Yt−1 , Yt−2 , ...), representing all the information available to the market prior to instant t. A large range of both parametric and non-parametric statistical techniques has been developed to predict prices, (e.g. [3, 5, 8, 9, 12, 17]), including, in the last decade, neural networks (e.g. [10, 11, 13]). The primal aim of these techniques is prediction. Traditional linear techniques (such as the ARMA models) that assume a constant variance for the J.M. Mat´ıas et al.: Gradient Boosting GARCH and Neural Networks for Time Series Prediction, Studies in Computational Intelligence (SCI) 126, 153–164 (2008) c Springer-Verlag Berlin Heidelberg 2008 www.springerlink.com
154
J.M. Mat´ıas et al.
process {Yt } have begun to lose ground in terms of financial time series prediction primarily because they usually show heteroskedasticity. From the financial point of view, investors require rewards that are proportionate to risk, so any variations in risk levels may alter the behavior of individual investors. From the statistical point of view, on the other hand, changes in variance over time have implications for the validity and efficiency of estimators that assume homoscedasticity. With a view to resolving these problems, recent prediction techniques take into account different hypotheses in relation to the conditional variance (volatility) VarYt |At−1 (Yt ), by endeavoring to model its temporal dynamics. Of the different approaches to tackling this problem such as direct modeling of the dynamics of volatility, ([2, 4]) or modeling of the conditional distribution ([15, 16]), etc., our article focuses on the former.
2 The GARCH Model The Generalized Autoregressive Conditional Heteroskedasticity Process or GARCH(r, s) models [2] combine a stationary series model (usually Autoregressive Moving Average (ARMA)) with a stationary linear squared innovations model in the form: Yt = µt + εt = f (Yt−1 , Yt−2 , ..., εt−1 , εt−2 , ...) + εt , ε t = h t θt with: h t = a0 +
r
aj ht−j +
j=1
s
bj ε2t−j
(1a) (1b)
(2)
j=1
where: {εt , t ∈ T} is a process with Eεt |At−1 (εt ) = 0 and Var(εt ) = σε2 , {θt , t ∈ T} is a process with independent variables of mean zero and Var(θt ) = 1 for all t ∈ T and they satisfy suitable restrictions on the process {θt } and the parameters {aj }rj=0 , {bj }sj=1 so that Varεt |At−1 (εt ) = Eεt |At−1 (ε2t ) > 0 and the stationarity hypothesis for {ε2t } are satisfied. The GARCH models are estimated using the maximum likelihood method (with restrictions in the parameters in order to guarantee the positivity of the variance). For example, in the case of a Gaussian conditional distribution Yt |At−1 ∼ N (µt , σt2 ) the minus log-likelihood is: (εt , ht ) = − ln p(εt ; ht ) =
1 ε2 (ln ht + t + ln 2π) . 2 ht
(3)
Other alternatives are to consider distributions with heavier tails for εt , such as the Student-t distribution with ν degrees of freedom.
Gradient Boosting GARCH and Neural Networks
155
3 New Algorithms for the Prediction of Heteroskedastic Time Series The new algorithms developed here achieve a flexible modeling of volatility dynamics. These algorithms are: 1. Algorithms based on neural networks aimed at simultaneously modeling conditional mean and conditional variance in time series through the likelihood maximization. 2. Boosting algorithms [14] (especially, Gradient Boost [7]) with a loss function based on likelihood and which can simultaneously model both conditional mean and volatility. GARCH models can be used as the base hypothesis for these algorithms. 3.1 Plain Neural Networks Algorithms This group of algorithms permits simultaneous estimation of series conditional mean and volatility and is described as follows: 1. M(m, p)-GARCH(r, s). A combination of a feed-forward neural network M(m, p), e.g. multilayer perceptron (MLP) or radial basis functions (RBF), for modeling the series conditional mean with p lags and m neurons, and a GARCH model for modeling the volatility dynamics. The models are estimated jointly through maximum likelihood (Gaussian in this work) using the conjugated gradient algorithm. The algorithm is similar to that described immediately below, with the sole difference that in the latter the GARCH equations are also modeled using neural networks. 2. M(m1 , p)-M(m2 , r, s). These algorithms are similar to those described above, but model the dynamics of volatility using a neural network M (RBF or MLP) with m2 neurons, r lags in ht and s lags in ε2t . Both models (conditional mean and volatility) are estimated jointly through maximum likelihood (Gaussian in this work) using the conjugated gradient algorithm. In other words, if {Yt }nt=1 is the set of observations and conditional mean and volatility are modeled using neural networks (MLP or RBF): µt = µt (θ1 ) = f (Yt−1 , ..., Yt−p ; θ1 ) , ht = ht (θ2 ) = g(ht−1 , ..., ht−r , ε2t−1 , ..., ε2t−s ; θ2 ) , the algorithm iteratively minimizes the empirical risk: n n 1 ˆ n (θ) ¯ = 1 R (εt (θ1 ), ht (θ2 )) = (yt − µt (θ1 ), ht (θ2 )) n t=u+1 n t=u+1
156
J.M. Mat´ıas et al.
where θ¯ = (θ1T , θ2T )T , u = p + max(r, s) and (εt , ht ) = − ln p(εt , ht ) is the minus log-likelihood loss. In the case p = r = s = 1, these models are: ⎧ m 1 ⎪ ⎪ cj ψ(wj Yt−1 + wj0 ) + c0 + εt , ⎨ Yt = j=1
m 2 ⎪ ⎪ cj ψ(wjT ht−1 + wj0 ) + c0 ⎩ ht = j=1
or
⎧ m 1 ⎪ ⎪ cj ψ(|Yt−1 − wj | /wj0 ) + c0 + εt , ⎨ Yt = j=1
m 2 ⎪ ⎪ cj ψ( ht−1 − wj /wj0 ) + c0 , ⎩ ht = j=1
depending on whether MLP or RBF networks are used, and where ht−1 = (ht−1 , ε2t−1 )T . If no significant conditional mean is postulated for the series of interest, the first of the algorithms is reduced to a GARCH model and the second will not include the first component M(m1 , p) (or this will possess a very small number m1 of hidden units, e.g. m1 = 1). 3.2 Boosting Algorithms The boosting algorithms developed here are designed to model the dynamics of volatility but can also simultaneously estimate the series conditional mean if that mean is postulated. These algorithms are generally denoted as Boost[M1 -M2 ], where M1 is the base hypothesis used for the estimation of the first component of the gradient in cases where conditional mean is postulated, and M2 is the base hypothesis used for the estimation of the gradient component resulting from the volatility. Assuming a likelihood-based loss (εt , ht ) = l(yt , (µt , ht )) the components µt and ht are estimated simultaneously via Gradient Boost applied to each of the base hypotheses M1 and M2 . For the sake of notational simplicity, we assume p = r = s = 1 (the general case is analogous) and that the estimations of the algorithm in the iteration (j − 1)-th, j = 2, 3... are for t = 3, ..., n: ⎧ j−1 ⎪ ⎪ ⎪ µ ˆ = fk (yt−1 ), ⎪ t,j−1 ⎨ k=1
j−1 ⎪ ⎪ ⎪ ˆ ˆ t−1,j−1 , εˆ2 ⎪ h = gk (h ⎩ t,j−1 t−1,j−1 ) k=1
where the components fk and gk are the estimation of the functional gradient in the k-th iteration (see below) and:
Gradient Boosting GARCH and Neural Networks
157
⎧ j−1 ⎪ ⎪ 2 2 ⎪ ˆt−1,j−1 ) = (yt−1 − fk (yt−2 ))2 , ⎪ ⎨ εˆt−1,j−1 = (yt−1 − µ j−1 ⎪ ⎪ ⎪ ˆ t−2,j−1 , εˆ2 ˆ ⎪ gk (h ⎩ ht−1,j−1 = t−2,j−1 ).
k=1
k=1
ˆ t−1,j−1 , εˆ2 Taking as the vector of covariables xt,j−1 = (yt−1 , h t−1,j−1 ), and if the base hypotheses M1 and M2 are of the form φ = φ(·; θ1 ), φ = φ(·; θ2 ) respectively, with φ, φ linear combinations of basic parameterized functions (as is the case of the RBF and MLP neural networks), the projection of the negative gradient to the hypotheses space in the j-th iteration is obtained by resolving: θ1,j = arg min θ1
θ2,j = arg min θ2
n t=3 n
(1)
(4a)
(2)
(4b)
[−ξj (xt,j−1 ) − φ(xt,j−1 ; θ1 )]2 , [−ξj (xt,j−1 ) − φ(xt,j−1 ; θ2 )]2 .
t=3
Thus, the new components fj (yt−1 ), gj (yt−1 , ε2t−1 ) for conditional mean and volatility are obtained by performing a line search in the direction of the projected negative gradient. In the end, the final hypothesis produced by the algorithm in the B-th iteration is a vectorial function: ⎛ ⎞T B B T ˆt , ˆ ht =⎝ fj (yt−1 ), gj (ht−1 , ε2t−1 )⎠ . fB,t = µ j=1
j=1
For the above general schema we have used the RBF and MLP neural networks as the base hypotheses; nonetheless, in principle, any kind of model that can be estimated using least squares through (4) can be used. However, in this work we have also used ARMA-GARCH as base hypothesis with a view to evaluating the benefits of modeling any possible heteroskedasticity that the gradient series may inherit from the original series, thereby obtaining a better fit for this gradient in each boosting iteration. These models have the peculiarity that they are not fitted using least squares, as in (4), but through the maximization of likelihood. They therefore use a discrepancy criterion with the negative gradient different from that of the least squares. Finally, when a significant conditional mean is not postulated for the series, the Boost[M1 -M2 ] model is reduced to the form BoostM, a particular case5 in the development above and the same kind of general algorithm used in [1], with the only difference being the base hypotheses used and the criterion of discrepancy for fitting the negative gradient. 5
In the BoostM algorithm, the theoretical search space is R∞ (variance functions), whereas in the Boost[M1 -M2 ] algorithm this is R2×∞ (conditional mean and variance functions). With n data, the search space is R2×n instead of Rn .
158
J.M. Mat´ıas et al.
4 Evaluation of the New Algorithms In this section we present the results of the evaluation of the proposed algorithms with different sets of artificial data. Unless otherwise indicated, all the models based on neural networks use a single lag, with the corresponding parameter therefore omitted from the symbology that identifies the model. For example, a model MLP(m)-GARCH(1,1) indicates a model MLP(m, 1)-GARCH(1,1), in other words, an MLP model with m hidden units and just a single lag for the mean, and a GARCH(1,1) model for volatility. In our evaluation of the different algorithms presented as reference the results are obtained by ARMA-GARCH models estimated via maximum Gaussian likelihood, except in certain cases (to be indicated), in which Student-t likelihood is used with several degrees of freedom. This will permit a comparison between the highly non-linear models described above with non-Gaussian models. The criteria used to evaluate the performance of the algorithms were as follows, with N denoting the sample size (training or test sample, as appropriate): 1. Average squared error for the series. If fˆ(yt−1 , yt−2 , ...) is the prediction at t: N 1 ASRy = [yt − fˆ(yt−1 , yt−2 , ...)]2 . (5) N t=1 2. Average squared error for the series of variances when these have been simulated and are known. If σt2 are the true variances and gˆ(ε2t−1 , ε2t−2 , ..., ht−1 , ht−2 , ...) the predictions: ASRh =
N 1 2 [σ − gˆ(ε2t−1 , ε2t−2 , ..., ht−1 , ht−2 , ...)]2 . N t=1 t
(6)
3. The minus log-likelihood (Gaussian unless otherwise indicated): N ˆ t) (yt ; µ ˆt , h −ln Lk =
(7)
t=1
with
⎧ (yt − µt )2 ⎪ ⎪ + ln 2π), ⎨ (yt , µt , ht ) = 12 (ln ht + ht µ ˆt = fˆ(yt−1 , yt−2 , ....), ⎪ ⎪ ⎩ˆ ht = gˆ(ε2t−1 , ε2t−2 , ..., ht−1 , ht−2 , . . .).
The model selection criteria used in the different algorithms were as follows:
Gradient Boosting GARCH and Neural Networks
159
1. Both the MLP and RBF networks were trained through maximum likelihood via conjugated gradient with line search and using no more than 20 iterations as stopping rule. The initial values of the parameters were obtained, in the case of the MLP networks, by the Bayesian algorithm (least squares) [6], and in the case of the RBF, by using the k-means clustering algorithm to select widths. Both methods permit automatic model selection. The maximum number of basic functions were determined via a preliminary exploration of a sample of each set of data used. 2. The Gradient Boost algorithms were stopped after a few iterations when their performance over the test sample started to worsen without investigating whether improvement could occur subsequently. A task for the future, therefore, will be to develop stop mechanisms. The sets of artificial data were generated using the following models, all with Gaussian innovations: 1. Heteroskedastic model with sinusoidal mean: Yt = Yt−1 sin(Yt−1 ) + εt with two alternative models for volatility: a) GARCH(1, 1): ht = 0.1 + 0.85ht−1 + 0.1ε2t−1
(8)
(9)
b) Logarithmic model: ln ht = 0.6 ln ht−1 + 0.3 ln(0.5 + 0.2ε2t−1 ) 2. Heteroskedastic zero-mean model Yt = ht εt ; ht = f(yt−1 , ht−1 )
(10)
(11)
with two alternative models for volatility: a) Additive volatility model used in [1]: f(y, h) = (0.1+0.2 |y|+0.9y 2)∗(0.8 exp(−1.5 |y| |h|))+(0.4y 2 +0.5h)3/4 (12) b) Threshold model: ⎧ if y < 0 ⎨ 0.2 + 0.3y 2 (13) f(y, h) = 0.3 + 0.2y 2 + 0.6h if y ≥ 0 and h < 0.5 ⎩ 0.7 + 0.8h if y ≥ 0 and h ≥ 0.5 Next we will describe the results of the best algorithms for each set of data, offering for reference purposes those obtained by the GARCH(1,1) model.
160
J.M. Mat´ıas et al.
4.1 Heteroskedastic Series with Sinusoidal Mean For the models defined by (8), (9) and (8), (10) 50 samples of 3000 observations each were generated, of which the n = 1000 first observations were used to train the algorithms and the ntest = 2000 remaining observations constituted the test sample for the evaluation. Table 1 shows the mean results produced by the various algorithms with the series with sinusoidal mean and the GARCH model for the variance. The table reflects, for each algorithm: the values ASRh (6), ASRy (5) and the minus log-likelihood (7) in the test sample and also the estimations of the parameters in the GARCH model in (9). Table 1. Results of the different algorithms for the series defined by (8) and (9): Yt sin(Yt )+GARCH(1,1) Model
− ln Lktest ASRhtest ASRytest a ˆ0
a ˆ1
ˆb1
ARMA(0,0)-GARCH(1,1) ARMA(1,0)-GARCH(1,1) GARCH(1,1) after RBF(20) RBF(20)-GARCH(1,1) GARCH(1,1) after MLP(13) MLP(13)-GARCH(1,1)
3901.94 3898.38 3536.65 3535.59 3500.11 3498.90
0.3156 0.2335 0.8358 0.8392 0.8817 0.8627
0.3645 0.4101 0.1018 0.0998 0.0717 0.0808
12.9037 16.6013 1.5550 1.5954 1.1005 1.0681
3.7077 3.8140 2.2807 2.2778 2.1493 2.1316
1.0418 1.1826 0.1241 0.1240 0.0857 0.0855
Table 2 shows the mean results obtained with the series with sinusoidal mean and the logarithmic model for the variances. Table 2. Results of the different algorithms for the series defined by (8) and (10): Yt sin(Yt )+Logarithmic. The results of the boosting algorithm were obtained after 10 iterations Model
− ln Lktest ASRhtest ASRytest
ARMA(0,0)-GARCH(1,1) ARMA(1,0)-GARCH(1,1) BoostARMA(1,0)GARCH(1,1) -ARMA(1,0)GARCH(1,1) GARCH(1,1) after RBF(20) RBF(20)-GARCH(1,1) GARCH(1,1) after MLP(13) MLP(13)-GARCH(1,1) MLP(13)-MLP(1)
2964.56 2817.14
0.3702 0.5792
1.1848 1.0951
2710.67 2529.39 2528.72 2510.53 2508.51 2506.95
0.2067 0.0076 0.0066 0.0060 0.0057 0.0076
0.9956 0.7411 0.7405 0.7228 0.6872 0.6834
Gradient Boosting GARCH and Neural Networks
161
These tables reveal the satisfactory results produced by the algorithms based on neural networks that represent conditional mean via a non-linear model6 . Specifically: 1. In regard to the Table 1, of particular note are the M(m)-GARCH models, which propose variance models very similar to the true one ((9)). For these models, the simultaneous estimation of conditional mean and variance produces better results than the sequential procedure typically used in practice (estimating the GARCH model after subtracting the mean estimated through M). As can be observed, the estimation of the parameters of the true GARCH model is totally erroneous in ARMA(0,0)-GARCH(1,1) and ARMA(1,0)GARCH(1,1) models. The high value of the constant (ˆ a0 in the table) of the GARCH model proposed by these models indicates that these have interpreted the variability due to mean as if it were originated by the variance. 2. In regard to Table 2 (sinusoidal process with logarithmic variance), similar results are obtained. Here the MLP(13)-MLP(1) algorithm produces a slight improvement over the MLP(13)-GARCH(1,1). This algorithm does not include variance ht in the model (it is a non-linear ARCH model that only includes the square ε2t of the innovations): its inclusion slightly worsened the results (− ln Lktest = 2510) perhaps due to the instability of the intermediate estimations for ht . 4.2 Heteroskedastic Series with Zero Mean Table 3 shows the mean results for various algorithms with 50 realizations of the additive (11), (12) and threshold (11), (13) volatility models. 1. In regard to the first series, almost all the non-linear models evaluated improve the results of the GARCH(1,1) model in terms of likelihood, although the results in terms of quadratic loss are more heterogeneous. In regard to the minus log-likelihood, of note is the behavior of the BoostMLP(3) and BoostARX(2,1) models in the table. The BoostARX(2,1) model uses ARX(2,1) as the base hypothesis to fit the gradient7 in which the exogenous covariable is the square of the innovations ε2t and in which the initial hypothesis for the algorithm is the variance in the 6
7
In order to correctly evaluate the results obtained in terms of likelihood, it is necessary to bear in mind the logarithmic scale used but also the difficulty implied in producing improvements in this criterion, given that likelihood measures the quality of the prediction of future returns and future volatilities, and is much more volatile than squared error, which measures only the quality of the prediction of future volatilities [1]. Note that the lag order of the base hypothesis of boosting cannot be compared to that of the ARMA-GARCH applied to the original series.
162
J.M. Mat´ıas et al.
Table 3. Results for the best algorithms with the additive volatility series (11), (12) and threshold volatility series (11), (13). The second column indicates, for the boosting algorithms, whether these initiate (yes) with a GARCH (1,1) model or, on the other hand (no), they use the variance of the sample as the initial hypothesis. The third column indicates the number of iterations (B) for these algorithms Volatility Additive Model Model
IniG B − ln Lktest ASRhtest
ARMA(0,0)-GARCH(1,1) MLP(1)-MLP(3) BoostMLP(3) BoostARX(2,1) BoostARX(2,1) BoostGARCH(1,1)-ARMA(1,0)GARCH(1,1)
– – no no yes yes
– – 3 6 43 2
1394.96 1393.43 1392.76 1392.01 1393.54 1394.35
0.0105 0.0288 0.0168 0.0137 0.0096 0.0094
Volatility Threshold Model Model
IniG B − ln Lktest ASRhtest
ARMA(0,0)-GARCH(1,1) – – 1229.52 BoostMLP(1)-MLP(3) yes 2 1223.91 BoostGARCH(1,1)-ARMA(1,0)GARCH(1,1) yes 2 1223.54
0.3697 0.3697 0.3683
data (IniG=no). The utilization of a GARCH (1,1) model as the initial hypothesis (IniG=yes) produces less satisfactory results for this algorithm in terms of likelihood, but more satisfactory results in terms of quadratic mean. The relative improvements produced by these algorithms in relation to the GARCH(1,1) reference model are similar to those obtained by [1] using regression trees and projection pursuit as the base hypothesis8 . 2. In regard to the threshold model, the BoostMLP(1)-MLP(3) and BoostGARCH(1,1)-ARMA(1,0)-GARCH(1,1) algorithms noticeably improve the results of the GARCH(1,1) in terms of volatility. Despite the fact that the MLP(1)-MLP(2) model produces very good results in many of the samples, its overall performance is affected by the fact that it is highly unstable in some of them. In this series all the boosting algorithms improved when a GARCH(1,1) model was used as the initial hypothesis.
8
A precise comparison between these results and those obtained by [1] is not possible, given that important differences may exist in the different sets of data, e.g. due to the initial value of the variance used on generating each realization of the process. The relative improvements with respect to the results of the GARCH(1,1) model for each case are similar.
Gradient Boosting GARCH and Neural Networks
163
5 Conclusion The evaluation of the proposed algorithms with a single lag in various sets of artificial data produces the following conclusions: 1. For all the sets of data analyzed, there was always at least one algorithm that significantly improved on the results of a GARCH model. On the other hand, no algorithm was revealed as the best in all the situations studied. 2. The algorithms that combine neural networks for conditional mean with GARCH models or neural networks for variance appear to represent a considerable improvement over the ARMA-GARCH models when a nonlinear trend function is present. 3. The Gradient Boost algorithms tend to produce good results particularly with highly nonlinear volatility. For this family of algorithms, the ARMAGARCH models were the most stable, surest (and fastest) base hypotheses in our tests. As models estimated via maximum likelihood, the ARMAGARCH models depart from the usual Gradient Boost philosophy of fitting the empirical negative gradient via least squares, although the good results obtained in our tests would support their use. It would appear that the empirical gradient is not merely a simple n-dimensional vector but a time series in itself, with possible heteroskedasticity inherited from the original series. Future lines of research are the development of stopping rules for the boosting algorithm, the study of its behavior with different base hypotheses and the systematic evaluation of the algorithms for series of real data.
Acknowledgments The research of W. Gonz´alez-Manteiga, M. Febrero and J. M. Mat´ıas was supported by the Spanish Ministry of Education and Science, Grant No. MTM2005-00820.
References 1. Audrino F, B¨ uhlmann P (2002) Volatility estimation with functional gradient descent for very high dimensional financial time series. J Comp Finance 6:65–89 2. Bollerslev T (1986) Generalized autoregressive conditional heteroskedasticity. J Econometrics 31:27–37 3. Box GEP, Jenkins GM, Reinsel GC (1994) Time series analysis: forecasting and control. Prentice-Hall, Upper Saddle River 4. Engle RF (1982) Autoregressive conditional heteroskedasticity with estimates of the variance of United Kingdom inflation. Econometrica 50:987–1007
164
J.M. Mat´ıas et al.
5. Fan J, Yao Q (2003) Nonlinear time series: nonparametric and parametric methods. Springer, Berlin/Heidelberg 6. Foresee FD, Hagan T (1997) Gauss-Newton approximation to Bayesian regularization. In: Proc IEEE Int Joint Conf Neural Networks, Houston, TX, USA. IEEE, Piscataway, pp 1930–1935 7. Friedman J (2001) Greedy function approximation: a gradient boosting machine. The Annals of Stat 39:1189-1232 8. Granger CWJ, Newbold P (1986) Forecasting economic time series. Academic Press, London 9. Granger CWJ, Ter¨ asvirta T (1993) Modelling nonlinear economic relationships. Oxford University Press, Oxford 10. Medeiros MC, Ter¨ asvirta T, Rech G (2002) Building neural networks models for time series: a statistical approach. Tech Report 461, Stockholm School of Economics, Stockholm 11. Miranda FG, Burgess N (1997) Modelling market volatilities: the neural network perspective. The European J Finance 3:137–157 12. Priestley MB (1981) Spectral analysis and time series. Academic Press, London 13. Refenes APN, Burgess AN, Bentz Y (1997) Neural networks in financial engineering: a study in methodology. IEEE Trans Neural Networks 8:1222–1267 14. Schapire RE (1990) The strength of weak learnability. Mach Learn 5:197–227 15. Schittenkopf C, Dorffner G, Dockner EJ (2000) Forecasting time-dependent conditional densities: a seminonparametric neural network approach. J Forecasting 19:355–374 16. Tino P, Schittenkopf C, Dorffner G (2001) Financial volatility trading using recurrent neural networks. IEEE Trans Neural Networks 12:865–874 17. Tsay RS (2001) Analysis of financial time series. John Wiley and Sons, Hoboken 18. Weigend AS, Gershenfeld NA (1993) Time series prediction: forecasting the future and understanding the past. Perseus Books, New York
Cascading with VDM and Binary Decision Trees for Nominal Data Jes´ us Maudes, Juan J. Rodr´ıguez, and C´esar Garc´ıa-Osorio Escuela Polit´ecnica Superior – Lenguajes y Sistemas Inform´ aticos Universidad de Burgos, Francisco de Vitoria s/n, 09006, Burgos, Spain, jmaudes, jjrodriguez,
[email protected] Summary. In pattern recognition, many learning methods need numbers as inputs. This paper analyzes two-level classifier ensembles to improve numeric learning methods on nominal data. A different classifier was used at each level. The classifier at the base level transforms the nominal inputs into continuous probabilities that the classifier at the meta level uses as inputs. An experimental validation is provided over 27 nominal datasets for enhancing a method that requires numerical inputs (e.g. Support Vector Machine, SVM). Cascading, Stacking and Grading are used as two-level ensemble implementations. Experiments combine these methods with another symbolic-to-numerical transformation – Value Difference Metric, VDM. The results suggest that Cascading with Binary Decision Trees at base level and SVM with VDM at meta level produces a better accuracy than other possible two-level configurations. Key words: classifier ensemble, nominal data, cascade generalization, support vector machine, decision tree
1 Introduction Data can be classified into numerical and qualitative. Qualitative data can only take values from a finite, predefined set. If no order is assumed between such values, the data is referred to as nominal or categorical. In pattern recognition many methods require numerical inputs, so there is a mismatch when they are applied to nominal data. One approach for adapting nominal data to numerical methods is to transform each nominal feature into n binary features (NBF) [1]), where n is the number of possible values the nominal feature can take. Following this method, a nominal value is converted into a group of binary attributes, where all but one values are zeroes. The only non-zero attribute serves to uniquely characterize the associated nominal value so as to distinguish binary features related to different nominal values. An alternative approach to binary conversion is transforming symbolic values into numeric continuous values. Value Difference Metric (VDM) that is J. Maudes et al.: Cascading with VDM and Binary Decision Trees for Nominal Data, Studies in Computational Intelligence (SCI) 126, 165–178 (2008) c Springer-Verlag Berlin Heidelberg 2008 www.springerlink.com
166
J. Maudes et al.
one of such transformations for measuring distances between symbolic values is presented in [2]. Duch [3] used VDM to transform symbolic features into numeric, followed by classification. In [3] VDM is tested using Feature Space Mapping (FSM) networks and k-nearest neighbor (k-NN) classifiers, and better or similar results were obtained with the VDM conversion than without it. VDM replaces each nominal value x of an attribute A with a probability vector v = (v1 , . . . , vc ), where c is the number of classes and vi = P (class = ci |A = x). Thus, VDM can significantly increase an input dimensionality in multiclass problems. Nominal to binary conversion also increases input dimensionality, but, in that case, the increase is due to a different reason. Cascading [4] is an n levels ensemble, where n is typically two. Two level cascades consist of a base level classifier and a meta level classifier. A classifier at base level performs the extension of the original data by inserting new attributes. These new attributes are derived from the probability class distribution provided by such a base classifier. The meta classifier then takes the extended data as input and generates the final classification. In [5], Cascading is used on nominal datasets. In that work, an SVM with linear kernel is enhanced using a Cascading with an SVM at meta level and a Decision Tree at base level. Decision Trees can directly deal with nominal data without any prior transformation. So this Cascading implementation extends nominal data with continuous probabilities as new features. These continuous values can be then handled by an SVM. Experimental validation in [5] shows that this Cascading configuration performs better than 1) VDM applied to SVM, and 2) other combinations of one SVM and one Decision Tree (e.g. Stacking and Grading of one SVM and one Decision Tree). Cascading adds c new attributes to the original attributes (c is the number of classes), since it only adds new continuous dimensions while leaving nominal attributes unchanged so that the nominal-to-numerical transformation is still required if a classifier at meta level needs numerical inputs. According to [5], in nominal data problems for linear classifiers such as linear SVM, it can be better to use extra pre-built features from a base classifier using Cascading than to convert nominal features into continuous ones using VDM because: 1. If there are two instances with the same symbolic value for some nominal feature, but belonging to different classes, VDM will calculate the same probability vector for both of them. However, it is preferable that different values would be provided to facilitate the classification task, for example, by taking into account the values of the rest of attributes of the instance, as the classifier at base level does. This issue is of especial interest wherever linear separability is required. 2. Cascading does not replace original attributes, thus requiring a method like NFB to perform such replacement. NBF (or any other nominalto-numerical transformation) can easily lead to a non-linear separable
Cascading with VDM and Binary Decision Trees for Nominal Data
167
representation of data (see the example in Table 1). However, adding new input dimensions that are probability estimations (like Cascading does) can help to gain linear separability.
Table 1. Nominal data conversion into binary can lead to non linearly separable regions. The table shows an example where getting an hyperplane of coefficients xi and constant k, to separate class c1 points from class c2 points is impossible Samples (a1 , b1 , c1 )(a1 , b2 , c2 ) (a2 , b2 , c1 )(a3 , b1 , c2 ) (a3 , b3 , c1 )(a2 , b3 , c2 ) Class c1 points Class c2 points x1 + x4 + k > 0 x1 + x5 + k < 0 x2 + x5 + k > 0 x3 + x4 + k < 0 x3 + x6 + k > 0 x2 + x6 + k < 0
Samples Binarized (1, 0, 0, 1, 0, 0, c1 )(1, 0, 0, 0, 1, 0, c2 ) (0, 1, 0, 0, 1, 0, c1 )(0, 0, 1, 1, 0, 0, c2 ) (0, 0, 1, 0, 0, 1, c1 )(0, 1, 0, 0, 0, 1, c2 ) Sum of c1 inequations x1 + x2 + x3 + x4 + x5 + x6 + 3k > 0 Sum of c2 inequations x1 + x2 + x3 + x4 + x5 + x6 + 3k < 0 ⇒ incompatibility ⇒ non-linear separability
Experimental results in [5] also show that VDM applied to the input of a Decision Tree seems an interesting method. It suggests that VDM, Decision Trees and Cascading can be combined in some way to get a more accurate ensemble for nominal data, and this idea is the main motivation behind this paper. We will show experimentally that the Cascading configuration proposed in [5] can be improved, if Decision Trees with binary splits (Binary Decision Trees) are used instead of multiple split Decision Trees, and if VDM is used instead of NBF for transforming input SVM features. This paper is an extension of [5]. For the sake of self-containment, some content is repeated here. The paper is structured as follows. Section 2 describes the Two-Level Ensembles that we will use in experimental validation (Cascading, Stacking and Grading) and how they can be used on nominal data. Section 3 analyzes VDM applied to Decision Trees and Binary Decision Trees. Section 4 derives some equivalences in order to simplify experimental validation. Section 5 contains the experimental validation. Conclusions are drawn in Sect. 6.
2 Two-Level Ensembles and Nominal Data A two-level ensemble consists of a meta level and a base level. Each level contains different classifiers. The base level classifier output is used as input for the meta level classifier. The numerical input classifier will be used as meta classifier, and a classifier which can work directly with nominal data
168
J. Maudes et al.
will be used as base classifier. We have considered three two-level schemas (i.e. Cascading, Stacking and Grading). Cascade Generalization (also known as Cascading) [4] is an architecture for designing classification algorithms. Cascading is commonly used with two levels. Level-1 (base level) is trained with the raw dataset, whereas Level-2 (meta level) uses all the original features from the dataset plus the output of the base level classifier as its inputs. Base level outputs are vectors representing a conditional probability distribution (p1 , . . . , pc ), where pi is the estimated probability that the input example belongs to class i, and c is the number of classes in the dataset. Cascading can be extended to more than two levels. Training one classifier with the output of another classifier can lead to overfitting. However, Cascading is not degraded by overfitting because base level and meta level classifiers are different, and because the meta level classifier also uses the original features as inputs. Sometimes, the nominal-to-numerical conversion (e.g. NBF conversion) can result in a data representation where classes of the data are not linearly separable. Cascading can help to solve this problem, because the input space is augmented with new features that transform the non-linearly separable data representation into another that is likely to avoid the problem. In our approach, these new features are generated by a base level classifier (e.g. a Decision Tree), which can deal with nominal data. The Cascading meta level input consists of the nominal-to-numerical converted attributes (e.g. using NBF or VDM) plus the base level classifier output. While VDM input dimensionality grows as the number of nominal attributes increases, the Cascading base level output only adds c new dimensions (where c is the number of classes). However, the meta level input dimensionality is also high, because the nominal-to-binary conversion is still required. Stacked Generalization, also known as Stacking [6] also uses two or more levels. Levels are numbered in a different way in [6], but for the sake of simplicity we still use the base/meta notation for levels. Stacking works with more than one base classifier and those base classifiers used to be different. Let b be the number of base classifiers and c be the number of classes. For training n × b base classifiers, n disjoint partitions (or folds) of the training data are made. Each base classifier is trained with the data from n − 1 data partitions and tested with the data from the remaining partition (cross validation). Thus each instance is used for testing exactly b times (once for each of the b base classifiers). After that, the meta classifier is trained with a new dataset of c × b dimensionality plus the class attribute. For each original instance, such c × b features are obtained by using the c probability estimations from the b classifiers that tested that instance. Finally, the n × b classifiers are discarded, and a new set of b base classifiers is constructed, this time using the whole training set. Overfitting is therefore avoided, because base and meta classifiers are different and because due to cross-validation, the meta classifier training set differs from the base classifiers training set.
Cascading with VDM and Binary Decision Trees for Nominal Data
169
Grading [7] is another two-level method that uses n folds like Stacking. In [7] levels are also numbered in a different way than in Cascading, but again we will still keep our base/meta notation. Just as in Stacking, the set of n × b base classifiers is constructed beforehand and tested using cross validation. The output of each base learner is a binary attribute that indicates whether the prediction is correct or not (graded prediction). Subsequently, b meta classifiers (all of the same type) are produced, each being trained with the original features plus the graded predictions obtained from one group of n base classifiers. Graded predictions are used as the meta classifier class. Finally, the n × b base classifiers are discarded, and the new b base classifiers are trained with the whole training set. The final prediction is made by meta classifiers based on the voting rule. Stacking and Grading can be used for nominal data in the same way we have used Cascading (i.e. a single Decision Tree at base level). Note that Stacking and Grading require an internal cross-validation, so they are computationally more expensive than Cascading even when only one base classifier is required. On the other hand, Stacking meta classifiers do not use the nominal-tobinary converted data, so the meta input dimension is not increased and is always equal to the number of classes when only one base classifier is used. However, the Cascading and Grading meta input dimension increases, because their meta classifier takes both the base prediction and the nominal-to-binary converted attributes as their input values. Usually, Stacking and Grading are used with several base classifiers, but we have tested them with one base classifier in order to compare Cascading with other combinations of one SVM with one Decision Tree. In this paper we will use the notation Two-Level-Ensemble[M=C1( ); B=C2( )](x), which represents a two-level ensemble, where C1( ) classifier is used as meta classifier and C2( ) classifier is used as base. The x represents an input sample. So Cascading[M=SVM( );B=DecisionTree( )](x) is a Cascading ensemble with an SVM at meta level and a Decision Tree at base level. Whenever meta or base classifier cannot directly deal with nominal data, the NBF transformation is assumed. For example, SVM in Cascading[M=SVM( ); B=Decision Tree( )](x) needs NBF, but SVM in Stacking[M=SVM( );B=Decision Tree( )](x) does not need NBF.
3 Binary Decision Trees vs. VDM on Decision Trees Let (x, y) be an instance from the training set, where x is a multidimensional input vector, and y is the output (class) variable. In this paper, VDM(x) will represent another vector, where every component xi from x is mapped into a group of components VDM(xi ). If xi is nominal VDM(xi ) is obtained applying VDM to xi . If xi is numerical, VDM(xi ) is equal to xi . So VDM(x) dimension is greater or equal than that of x.
170
J. Maudes et al.
Therefore DecisionTree(VDM(x)) is a decision tree that takes as input VDM transformed features. In [5] DecisionTree(VDM(x)) is experimentally tested, resulting in an interesting method based on its accuracy and low computational cost. Decision Trees applied to nominal data usually split their nodes into more than two branches. Each branch corresponds to a symbolic value of an attribute. However, the VDM feature transformation makes the Decision Trees to split their nodes in exactly two branches, one representing instances with values bigger than a threshold and other branch for the rest. Working with non-binary splits increases the number of branches in each split, and makes branches to classify fewer samples, so if the dataset cardinality is small, a non-binary split can lead to a too greedy partition of the input space. Thus it seems better to use binary trees than non-binary ones (e.g. DecisionTree(VDM(x)) better than DecisionTree(x)). However, Decision Trees can be forced to make binary splits with some slight changes in the algorithm, even when working with nominal data and without VDM. We denote Binary Decision Trees as DecisionTreeBin(x). Binary splits in DecisionTreeBin(x) test if some value of a nominal attribute is equal to or distinct from a certain symbolic value, whereas binary splits in DecisionTree(VDM(x)) are based on testing if some value of an attribute is larger than a threshold or not. So DecisionTreeBin(x) will lead to slightly different results than DecisionTree(VDM(x)). Note that when VDM is applied, both Decision Trees and Binary Decision Trees have only “bigger than” splitting nodes. That is: DecisionT ree(V DM (x)) ≡ DecisionT reeBin(V DM (x)).
(1)
This equivalence will considerably reduce the number of possible combinations containing Decision Trees in two-level ensembles.
4 Equivalent Ensembles VDM can be implemented as a filter that keeps numerical attributes and transforms nominal attributes into a set of probabilities. We will use C(VDM( )) to denote a classifier C that applies the VDM filter to the data before classifying them. Thereby, C(VDM( )) can be used to note that VDM is applied to one level of a two-level ensemble. For example: Cascading(M=SVM(VDM( )); B=DecisionTree( ))(x) applies VDM only to SVM inputs. So taking into account (1), we can derive many equivalent two-level ensembles. The first and second rows in Table 2 show some examples that can be derived from this equivalence. The third row shows that VDM calculated beforehand for Cascading is the same as applying the VDM filter to each level. This is only true for Cascading. In Stacking and Grading VDM estimations
Cascading with VDM and Binary Decision Trees for Nominal Data
171
are not the same for the whole data set and for each fold. So VDM calculated beforehand for Stacking and Grading is not the same as applying the VDM filter to each level. Table 2. Examples of equivalent two-level ensembles with VDM. Methods in the second column are equivalent to methods in the third column 1. Stacking[M=SVM( ); B=DecisionTree(VDM( ))](x) 2. Cascading[M=DecisionTree( ); B=DecisionTreeBin( )](VDM(x)) 3. Cascading[M=SVM(VDM( )); B=DecisionTree(VDM( ))](x) 4. Stacking[M=SVM( ); B=DecisionTree( )](x) 5. Stacking[M=DecisionTree( ); B=SVM( )](x) 6. Stacking[M=DecisionTree( ); B=SVM( )](x)
Stacking[M=SVM( ); B=DecisionTreeBin(VDM( ))](x) Cascading[M=DecisionTreeBin( ); B=DecisionTree( )](VDM(x)) Cascading[M=SVM( ); B=DecisionTree( )](VDM(x)) Stacking[M=SVM(VDM(( )); B=DecisionTree( )](x) Stacking[M=DecisionTreeBin( ); B=SVM( )](x) Stacking[M=DecisionTreeBin(VDM( )); B=SVM( )](x)
Because meta level in Stacking only uses continuous probabilities as inputs some other equivalences can be derived: 1. It is not useful to apply VDM to the meta level for Stacking. See one example at the fourth row in Table 2. 2. There is no difference in using Binary Decision Trees instead Non-binary Decision Trees in the Stacking meta level. See the example at the fifth row in Table 2. These two rules can be mixed, resulting in transitive equivalences. For example, look at the sixth row in Table 2. Thereby all these simplification rules will be taken into account to reduce the number of methods to test in the experimental validation design.
5 Experimental Validation We have implemented Cascading Generalization and VDM algorithms in Java within WEKA [8]. We have tested the following methods, using 27 datasets: 1. A Decision Tree (binary or not). WEKA J.48 is the implementation of the C4.5 Release 8 Quinlan Decision Tree [13]. We denote the binary split version as J.48bin. Both implementations (J.48 and J.48bin) were also applied to two-level ensembles that use a Decision Tree as their base or meta method.
172
J. Maudes et al.
2. SMO [9], an implementation of SVM provided by WEKA. Linear kernel was used. This implementation was also used in all two-level ensembles that required SVM as their base or meta classifier. 3. J.48 (binary or not) with VDM and SMO with VDM. In both cases, nominal features were replaced by VDM output. 4. Cascading with a Decision Tree at base level and with an SVM at meta level, and vice versa. VDM was also applied to both levels, none of them, or only one of them. 5. WEKA Stacking implementation, with ten folds using J.48 (binary and not binary) at base level and SMO at meta level and vice versa. VDM was also applied to both levels, none of them, or only one of them. 6. WEKA Grading implementation, with ten folds using J.48 (binary and not binary) at base level and SMO at meta level and vice versa. VDM was also applied to both levels, none of them, or only one of them. Equivalences derived in Sect. 4 have been applied, which gave rise to 57 methods. A 10×10 fold cross validation was utilized. The corrected resampled t-test statistic from [10] was used (significance level 5%) for comparing the methods. The 100 (10×10) results for each method and dataset are considered as the input for the test. Table 3 shows the datasets. Most of them are from the UCI repository [11], and the rest are from Statlib. All datasets selected have no numerical or ordinal attributes.1 The only modifications made on datasets were: (i) ID attributes have been removed (i.e. in Molecular biology promoters and Splice datasets); (ii) In Monks-1 and Monks-2, only training set is considered, because testing set is a subset of the training set; (iii) In Monks-3 and Spect, the training and test sets were merged together into a single set. The first two columns in Tables 4 and 5 rank the methods according to the abovementioned statistical test. For each considered data set and a pair of methods, the test was done. The test can conclude that there is no statistically significant difference between two methods or that one method is better than the other. Hence, each method will have a number of wins and losses. The difference between these two counts is used for ranking the methods. Equivalent methods were omitted in Tables 4 and 5. In both tables, ‘W-L’ stands for the difference between the number of significant wins and losses for a certain method while ‘W-L Rank’ is a rank assigned based on this difference so that a better performing method gets a higher rank. ‘Casc’, ‘Stack’, and ‘Grad’ mean Cascading, Stacking, and Grading, respectively. From the ranking, one can draw the following conclusions: 1. J.48bin(x) seems to perform better than J.48(VDM(x)) and both of them perform better than J.48(x). 1
Some attributes from these datasets are in fact ordinal, but for convenience we have treated them as nominal, just as they appear on the Weka web site. Retrieved 3 March 2007 from http://www.cs.waikato.ac.nz/ml/weka/
Cascading with VDM and Binary Decision Trees for Nominal Data
173
Table 3. Datasets used in experimental validation. #E: number of instances, #A: number of attributes including class, #C: number of classes. (U) from UCI, (S) from Statlib Dataset
#E
#A #C Dataset
#E #A #C
Audiology (U) Boxing1 (S) Boxing2 (S) Breast cancer (U) Car (U) Dmft (S) Fraud (S) Kr-vs-kp (U) Mol Biol Prmtrs (U) Monks-1 (U) Monks-2 (U) Monks-3 (U) Mushroom (U) Nursery (U)
226 120 132 286 1728 797 42 3196 106 432 432 438 8124 12960
70 4 4 10 7 5 12 37 58 7 7 7 23 9
90 339 323 323 323 1066 1066 1066 683 267 3190 958 435
24 2 2 2 4 6 2 2 2 2 2 2 2 5
Postop. patient (U) Primary tumor (U) Solar flare 1 C (U) Solar flare 1 M (U) Solar flare 1 X (U) Solar flare 2 C (U) Solar flare 2 M (U) Solar flare 2 X (U) Soybean (U) Spect (U) Splice (U) Tic-tac-toe (U) Vote (U)
9 18 11 11 11 11 11 11 36 23 61 10 17
3 21 3 4 2 8 6 3 19 2 3 2 2
2. SMO(x) is not improved by VDM (i.e. SMO(VDM(x)) is not better than SMO(x)). 3. Best ranked methods are Cascading configurations using an SMO at meta level, and a Decision Tree at base level. Two best methods are Cascading[M=SMO(VDM( )); B=J.48bin( )](x) and Cascading[M=SMO( ); B= J.48bin( )](x). So using J.48bin at base level seems to be a better solution than using VDM at meta level. 4. Two most computationally expensive configurations (i.e. Stacking and Grading) perform worse than Cascading and sometimes even worse than J.48 and SMO on their own. 5. A Decision Tree at base level leads to a better Cascading performance than a scheme with SMO at base level. The latter fact apparently contradicts to [4], where the strategy to choose an appropriate algorithm for each Cascading level is provided based on the following three points: • • •
“Combine classifiers with different behavior from a Bias-Variance analysis. At low level use algorithms with low variance. At high level use algorithms with low bias.”
Variance and bias are terms from a loss function to measure classification error [14]. An unstable classifier, such as a Neural Network or a Decision Tree, has a high variance, whereas the variance of a more stable classifier, such as SVM or Boosting, will be lower. Increasing variance commonly leads to decreasing bias and vice versa. So the application of Cascading with the
174
J. Maudes et al. Table 4. Methods for which W − L > 0 and their ranks W-L rank W-L Method 1 2 3 4 5 6 7.5 7.5 9 10 11 12 13 14 15 16 17 18 19 20 21 22.5 22.5 24 25 26 27 28 29 30 31 32 33
294 250 215 185 172 146 142 142 131 128 127 126 118 111 107 102 99 94 90 88 83 71 71 69 68 67 62 38 13 12 9 8 3
Casc[Meta=SMO(VDM( ));Base=J.48bin( )](x) Casc[Meta=SMO( );Base=J.48bin( )](x) Casc[Meta=SMO( );Base=J.48( )](VDM(x)) Casc[Meta=J.48(VDM( ));Base=J.48bin( )](x) Casc[Meta=SMO( );Base=J.48( )](x) Casc[Meta=SMO( );Base=J.48(VDM( ))](x) Casc[Meta=J.48bin( );Base=J.48(VDM( ))](x) Casc[Meta=SMO(VDM( ));Base=J.48( )](x) Casc[Meta=J.48( );Base=J.48bin( )](VDM(x)) J.48bin(x) Casc[Meta=J.48bin( );Base=SMO( )](x) Casc[Meta=J.48( );Base=J.48bin( )](x) Stack[Meta=J.48( );Base=J.48bin( )](x) Casc[Meta=J.48bin(VDM( ));Base=J.48( )](x) Casc[Meta=J.48bin( );Base=J.48( )](x) Casc[Meta=J.48( );Base=J.48bin(VDM( ))](x) Grad[Meta=J.48( );Base=J.48bin( )](x) J.48(VDM(x)) Stack[Meta=J.48bin( );Base=J.48(VDM( ))](x) Stack[Meta=J.48bin( );Base=J.48( )](VDM(x)) Stack[Meta=SMO( );Base=J.48bin( )](x) Grad[Meta=J.48( );Base=J.48bin(VDM( ))](x) Casc[Meta=J.48( );Base=SMO( )](VDM(x)) Stack[Meta=SMO( );Base=J.48( )](VDM(x)) Grad[Meta=J.48(VDM( ));Base=J.48bin( )](x) Stack[Meta=SMO( );Base=J.48(VDM( ))](x) Casc[Meta=J.48bin( );Base=SMO(VDM( ))](x) Casc[Meta=J.48(VDM( ));Base=SMO( )](x) Grad[Meta=J.48(VDM( ));Base=J.48bin(VDM( ))](x) Grad[Meta=SMO(VDM( ));Base=J.48bin( )](x) Grad[Meta=SMO( );Base=J.48bin( )](x) Grad[Meta=J.48bin( );Base=J.48(VDM( ))](x) Casc[Meta=J.48( );Base=SMO( )](x)
above rules is an attempt to combine learners, one with low bias and other with low variance, in order to arrive at a new learner possessing lower values of both measures. Note that [4] prefers low variance at low level and low bias at high level, because by “selecting learners with low bias for high level, we are able to fit more complex decision surfaces, taking into account the ’stable’ surfaces drawn by the low level learners”. An experimental validation provided in [4] over 26 UCI datasets shows this, but nominal and continuous attributes are mixed in the tested datasets.
Cascading with VDM and Binary Decision Trees for Nominal Data
175
Table 5. Methods for which W − L ≤ 0 and their ranks W-L rank W-L Method 34 35 36 37 38 39 40 41 42 43.5 43.5 44.5 44.5 47 47.5 47.5 50 51 52 53 54 55.5 55.5 57
0 -23 -26 -53 -72 -75 -95 -117 -126 -134 -134 -137 -137 -173 -174 -174 -179 -184 -185 -187 -210 -278 -278 -290
Grad[Meta=J.48bin( );Base=J.48( )](VDM(x)) Grad[Meta=SMO( );Base=J.48(VDM( ))](x) Grad[Meta=SMO(VDM( ));Base=J.48(VDM( ))](x) Grad[Meta=SMO( );Base=J.48( )](VDM(x)) J.48(x) Casc[Meta=J.48( );Base=SMO(VDM( ))](x) Grad[Meta=J.48( );Base=SMO( )](x) Grad[Meta=J.48bin( );Base=SMO( )](x) Stack[Meta=SMO( );Base=J.48( )](x) Grad[Meta=J.48( );Base=SMO(VDM( ))](x) Stack[Meta=J.48bin( );Base=J.48( )](x) SMO(x) Grad[Meta=J.48(VDM( ));Base=SMO( )](x) SMO(VDM(x)) Grad[Meta=J.48bin(VDM( ));Base=J.48( )](x) Grad[Meta=J.48bin( );Base=J.48( )](x) Grad[Meta=J.48( );Base=SMO( )](VDM(x)) Grad[Meta=J.48(VDM( ));Base=SMO(VDM( ))](x) Grad[Meta=SMO(VDM( ));Base=J.48( )](x) Grad[Meta=J.48bin( );Base=SMO(VDM( ))](x) Grad[Meta=SMO( );Base=J.48( )](x) Stack[Meta=J.48( );Base=SMO( )](VDM(x)) Stack[Meta=J.48( );Base=SMO(VDM( ))](x) Stack[Meta=J.48( );Base=SMO( )](x)
However, those “stable surfaces” can be drawn in an inappropriate way when data are nominal, especially if the method at low level cannot directly deal with nominal data and needs some kind of conversion as, for example, SVM. So we think there is no contradiction between [4] and our results, because the experiments were focused on different kinds of data. To further analyze the achieved results, we have focused only on the top performing methods in Table 4 and SMO, J.48, J.48bin with and without VDM. Results are summarized in Tables 6 and 7. Bold font in these tables marks the best method for each dataset. The number of significant wins and losses against the best ranked configuration (Cascading using an SVM filtered with VDM as a meta classifier and a Binary Decision Tree as a base classifier), are also given in the last row of each table. The symbol “◦” indicates a significant win, and the symbol “•” indicates a significant lost. According to this last row, Cascading[M=SMO(VDM( )); B=J.48bin( )](x) in Table 6 is the best method, but differences between this method and the rest of Cascading methods are not very important. Binary trees seem to be the best single classifier option, and the use of binary trees as base
176
J. Maudes et al.
Table 6. Accuracy of the top Cascading ensembles: E1 - Casc[M=SMO(VDM( )); B=J.48bin( )](x); E2 - Casc[M=SMO( ); B=J.48bin( )](x); E3 - Casc[M=SMO( ); B=J.48( )](x); E4 - Casc[M=SMO(VDM( )); B=J.48( )](x); E5 - Casc[M=SMO( ); B=J.48( )](VDM(x)); E6 - Casc[M=J.48(VDM( ); B=J.48bin( )](x); E7 Casc[M=J.48( ); B=J.48bin( )](VDM(x)) Dataset
E1
E2
E3
E4
E5
E6
E7
Audiology Boxing1 Boxing2 Breast cancer Car Dmft Fraud Kr-vs-kp M.Biol.Prm Monks-1 Monks-2 Monks-3 Mushroom Nursery Post.patient Prim. tumor Solar f.1 C Solar f.1 M Solar f.1 X Solar f.2 C Solar f.2 M Solar f.2 X Soybean Spect Splice Tic-tac-toe Vote wins/ties/losses
84.21 85.33 79.77 70.50 96.72 20.10 70.75 99.44 88.98 98.17 94.79 98.63 100.0 99.36 69.11 43.13 89.70 89.24 97.84 82.59 96.62 99.53 93.91 81.96 94.93 93.81 96.75 0/27/0
80.91• 84.00 80.98 70.40 96.76 20.22 68.85 99.44 90.05 98.19 94.89 98.63 100.0 99.36 69.11 43.25 89.73 89.55 97.84 82.67 96.62 99.53 93.78 81.62 93.96• 94.16 96.69 0/25/2
82.65 83.67 80.44 74.14 95.14• 19.81 73.60 99.44 91.42 96.60 67.14• 98.63 100.0 98.29• 67.22 44.31 88.95 89.67 97.84 82.91 96.62 99.53 94.13 81.62 93.55• 97.35◦ 96.69 1/22/4
84.74 85.42 78.76 74.28 95.03• 19.89 75.65 99.44 91.43 96.60 67.14• 98.63 100.0 98.25• 67.11 43.93 88.92 89.33 97.84 82.77 96.62 99.53 93.95 81.96 95.33 85.53• 96.75 0/23/4
84.34 81.17 79.91 70.92 97.32 20.18 85.50◦ 99.35 86.55 76.21• 89.70• 98.63 100.0 99.42 68.78 43.55 88.18• 88.99 97.84 82.76 96.62 99.53 93.83 82.14 94.80 94.28 96.75 1/23/3
76.65• 82.83 79.41 70.43 97.21◦ 19.72 72.55 99.44 78.90• 99.51 96.37 98.63 99.99 99.59◦ 69.22 40.61 89.61 89.64 97.84 82.70 96.62 99.53 92.43 81.84 94.32• 94.06 97.19 2/22/3
76.86• 81.25 79.75 70.96 97.54 19.72 86.40◦ 99.37 76.82• 76.12• 91.14 98.63 100.0 99.59◦ 69.33 40.56 88.15• 89.58 97.84 82.90 96.62 99.53 92.75 82.29 94.24 94.49 97.19 2/21/4
classifiers in Cascading ensembles seems to be the reason behind good Cascading performance.
6 Conclusion Many learning algorithms need numerical inputs. A nominal-to-numerical data transformation allows to apply these methods to symbolic data. NBF and VDM are typical examples of such a transformation. However, a better solution is obtained when another classifier, taking nominal data as input and
Cascading with VDM and Binary Decision Trees for Nominal Data
177
Table 7. Accuracy of SMO, J.48, and J.48bin with and without VDM. M1 SMO(x); M2 - SMO(VDM(x)); M3 - J.48(x); M4 - J.48(VDM(x)); M5 - J.48bin(x) Dataset
M1
M2
M3
M4
M5
Audiology Boxing1 Boxing2 Breast cancer Car Dmft Fraud Kr-vs-kp Mol.Biol.Prm Monks-1 Monks-2 Monks-3 Mushroom Nursery Post. patient Primary tumor Solar flare1 C Solar flare1 M Solar flare1 X Solar flare2 C Solar flare2 M Solar flare2 X Soybean Spect Splice Tic-tac-toe Vote wins/ties/losses
80.77• 81.58 82.34 69.52 93.62• 21.14 76.10 95.79• 91.01 74.86• 67.14• 96.12• 100.00 93.08• 67.33 47.09 88.49• 89.70 97.84 82.91 96.62 99.53 93.10 83.61 92.88• 98.33◦ 95.77 1/17/9
84.16 83.67 79.15 68.97 93.20• 20.73 73.10 96.79• 91.65 75.00• 67.14• 95.89• 100.00 93.08• 67.11 42.69 88.19• 89.33 97.84 82.77 96.62 99.53 93.38 82.79 95.47 73.90• 96.04 0/19/8
77.26• 87.00 80.44 74.28 92.22• 19.60 63.05 99.44 79.04• 96.60 67.14• 98.63 100.00 97.18• 69.78 41.39 88.95 89.98 97.84 82.93 96.62 99.53 91.78• 81.35 94.17 85.28• 96.57 0/20/7
76.73• 81.08 79.91 70.88 97.30 20.06 86.40◦ 99.36 76.22• 76.28• 90.07• 98.63 100.00 99.42 69.33 41.22 88.15• 89.61 97.84 82.89 96.62 99.53 92.77 81.69 94.28 94.28 96.57 1/21/5
76.92• 85.33 79.62 70.50 96.63 19.82 66.10 99.44 79.09• 98.33 94.31 98.63 99.99 99.36 70.11 41.19 89.73 89.76 97.84 82.72 96.62 99.53 92.30 81.35 94.36• 93.79 96.57 0/24/3
generating numerical data as output, is employed. This solution can result in more linearly-separable classes of data, which is the important issue for many learning algorithms. This idea prompted to test three two-level ensemble schemes: Cascading, Stacking, and Grading. Tests were carried out on 27 datasets containing nominal features. Experimental validation showed that indeed employing a method producing numerical output from nominal input as a base (first level) classifier leads to good performing Cascading ensembles. In contrast, this did not always work with other two schemes, Stacking and Grading, though there were notable exceptions when Decision Trees were selected for both levels of an ensemble. Since Cascading is the fastest among all three schemes, this fact makes Cascading very attractive in real-world applications. As the base method in Cascading ensembles, a Binary Decision Tree can be highly recommended.
178
J. Maudes et al.
We also tried the VDM nominal-to-numerical transformation at one or both levels of Cascading, Stacking and Grading ensembles. Although VDM appears in the best ranked method, experimental validation revealed that differences from other winning methods are not very significant. Moreover, many results obtained showed that SVM sometimes performs even worse when VDM is applied instead of NBF. Our next work is to study Cascading ensembles with numerical methods that do not require linear class separability (e.g. SVM with non-linear kernels).
References 1. Grabczewski K, Jankowski N (2003) Transformations of symbolic data for continuous data oriented models. In: Kaynak O, Alpaydin E, Oja E, Xu L (eds) Proc Artif Neural Networks and Neural Inf Proc, Istanbul, Turkey. Springer, Berlin/Heidelberg, pp 359–366 2. Stanfill C, Waltz D (1986) Toward memory-based reasoning. Communications of the ACM 29:1213–1229 3. Duch W, Grudzinski K, Stawski G (2000) Symbolic features in neural networks. In: Rutkowski L, Cz¸estochowa R (eds) Proc the 5th Conf Neural Networks and Soft Computing, Zakopane, Poland, pp 180–185 4. Gama J, Brazdil P (2000) Cascade generalization. Mach Learn 41:315–343 5. Maudes J, Rodr´ıguez JJ, Garc´ıa-Osorio S (2007) Cascading for nominal data. In: Haindl M, Kittler J, Roli F (eds) Proc the 7th Int Workshop on Multiple Classifier Syst, Prague, Czech Republic. Springer, Berlin/Heidelberg, pp 231– 240 6. Wolpert D (1992) Stacked generalization. Neural Networks 5:241–260 7. Seewald AK, F¨ urnkranz J (2001) An evaluation of grading classifiers. In: Hoffmann F, Hand DJ, Adams NM, Fisher DH, Guimar˜ aes G (eds) Proc the 4th Int Conf Advances in Intell Data Analysis, Cascais, Portugal. Springer, Berlin/Heidelberg, pp 115–124 8. Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, San Francisco 9. Platt J (1999) Fast training of support vector machines using sequential minimal optimization. In: Sch¨ olkopf B, Burges C, Smola A (eds) Advances in kernel methods, MIT Press, Cambridge, pp 185–208 10. Nadeau C, Bengio Y (2003) Inference for the generalization error. Mach Learn 52:239–281 11. Blake CL, Merz CJ (1998) UCI Repository of machine learning databases (http://www.ics.uci.edu/~ mlearn/MLRepository.html) 12. Demˇsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Research 7:1–30 13. Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Francisco 14. Kohavi R, Wolpert D (1996) Bias plus variance decomposition for zero-one loss functions. In: Saitta L (ed) Proc the 13th Int Conf Mach Learn, Bari, Italy. Morgan Kaufmann, San Francisco, pp 275–283
Index
k-means, 3, 8, 9, 11, 13, 16, 17 k-nearest neighbor, 115, 117, 135, 166 aggregation, 53–57, 59, 67, 68, 72 anomaly detection, 91, 95, 96, 99–101, 109
decision tree, 137, 165–173, 175, 177 entropy, 33, 92, 119 evidence accumulation clustering, 3, 14–16, 27 feature selection, 116
bioinformatics, 50, 115 bipartite graph, 38, 39 bolstered resubstitution error, 120–125, 127–131, 133 boosting, 153, 155–157, 160–163, 173
GARCH, 153–156, 160, 161, 163 gene expression, 4, 49, 64, 66, 115–118, 120, 133 grading, 165–173, 177, 178 graph partitioning problem, 35, 37
cancer classification, 115, 117 cascading, 165–173, 175–178 categorical data, 31–35, 37, 39, 41, 43, 45, 47 classifier ensemble, 35, 165 cluster ensemble, 3, 4, 7, 26, 27 cluster ensembles, 31, 72 accuracy-diversity trade-off, 35 consensus function, 32, 35–37 clustering k-means, 32, 49, 53–57, 59–61, 64–66, 71, 73, 79–81, 159 k-modes, 32 COOLCAT, 33 modes, 36 consensus function, 72 consistency index, 3, 19, 20, 25, 26 copula, 115, 117, 122, 125–129, 133
hierarchical clustering, 3, 13, 20, 32, 65, 73 high dimensionality, 31, 51
dataset complexity, 115, 117, 120, 121, 129–133
one-class classifier, 91, 101, 104–106, 108, 110
intrusion detection, 91–94, 97, 98, 103, 104, 107, 110 Jaccard distance, 15, 36, 38, 39 Markov blanket, 118 METIS, 15, 37, 72 multiple classifier system, 91, 94, 98, 100 Na¨ıve Bayes, 135, 140 neural network, 72, 135, 137, 153, 155, 157, 158, 161, 163, 173 nominal data, 165–170, 175, 176 normalized mutual information, 3, 17, 24, 25, 40, 65
180
Index
RANDCLUST, 53, 64–67 random projections, 12, 49–51, 54, 59, 65, 67, 68 random subspaces, 35
stochastic discrimination, 35 support vector machine, 105, 135, 145, 165
stacking, 135, 136, 138, 140–143, 145–147, 150, 165–173, 177, 178
time series, 135–138, 140–143, 145–147, 150, 153–155