HYBRID METHODS IN PATTERN RECOGNITION
World Scientific
Series in Machine Perception and Artificial Intelligence - Vol. 47
HYBRID METHODS IN PATTERN RECOGNITION Editors
H Bunke University of Bern, Switzerland
A Kandel University of South Florida, USA
li> World Scientific •
New Jersey • London • Singapore • Hong Kong
SERIES IN MACHINE PERCEPTION AND ARTIFICIAL INTELLIGENCE* Editors:
H. Bunke (Univ. Bern, Switzerland) P. S. P. Wang (Northeastern Univ., USA)
Vol. 34: Advances in Handwriting Recognition (Ed. S.-W. Lee) Vol. 35: Vision Interface — Real World Applications of Computer Vision (Eds. M. Cherietand Y.-H. Yang) Vol. 36: Wavelet Theory and Its Application to Pattern Recognition (Y. Y. Tang, L. H. Yang, J. Liu and H. Ma) Vol. 37: Image Processing for the Food Industry (E. R. Davies) Vol. 38: New Approaches to Fuzzy Modeling and Control — Design and Analysis (M. Margaliot and G. Langholz) Vol. 39: Artificial Intelligence Techniques in Breast Cancer Diagnosis and Prognosis (Eds. A. Jain, A. Jain, S. Jain and L. Jain) Vol. 40: Texture Analysis in Machine Vision (Ed. M. K. Pietikaineri) Vol. 41: Neuro-Fuzzy Pattern Recognition (Eds. H. Bunke and A. Kandel) Vol. 42: Invariants for Pattern Recognition and Classification (Ed. M. A. Rodrigues) Vol. 43: Agent Engineering (Eds. Jiming Liu, Ning Zhong, Yuan Y. Tang and Patrick S. P. Wang) Vol. 44: Multispectral Image Processing and Pattern Recognition (Eds. J. Shen, P. S. P. Wang and T. Zhang) Vol. 45: Hidden Markov Models: Applications in Computer Vision (Eds. H. Bunke and T. Caelli) Vol. 46: Syntactic Pattern Recognition for Seismic Oil Exploration (K. Y. Huang) Vol. 47: Hybrid Methods in Pattern Recognition (Eds. H. Bunke and A. Kandel) Vol. 48: Multimodal Interface for Human-Machine Communications (Eds. P. C. Yuen, Y. Y. Tang and P. S. P. Wang) Vol. 49: Neural Networks and Systolic Array Design (Eds. D. Zhang and S. K. Pal)
*For the complete list of titles in this series, please write to the Publisher.
HYBRID METHODS IN PATTERN RECOGNITION
Published by World Scientific Publishing Co. Pte. Ltd. P O Box 128, Fairer Road, Singapore 912805 USA office: Suite IB, 1060 Main Street, River Edge, NJ 07661 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
HYBRID METHODS IN PATTERN RECOGNITION Series in Machine Perception and Artificial Intelligence — Vol. 47 Copyright © 2002 by World Scientific Publishing Co. Pte. Ltd. AH rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
ISBN 981-02-4832-6
Printed in Singapore.
Dedicated to The Honorable Congressman C. W. Bill Young House of Representatives for his vision and continuous support in creating the National Institute for Systems Test and Productivity at the Computer Science and Engineering Department, University of South Florida
This page is intentionally left blank
Preface
The discipline of pattern recognition has seen enormous progress since its beginnings more than four decades ago. Over the years various approaches have emerged, based on statistical decision theory, structural matching and parsing, neural networks, fuzzy logic, artificial intelligence, evolutionary computing, and others. Obviously, these approaches are characterized by a high degree of diversity. In order to combine their strengths and avoid their weaknesses, hybrid pattern recognition schemes have been proposed, combining several techniques into a single pattern recognition system. Hybrid methods have been known about for a long time, but they have gained new interest only recently. An example is the area of classifier combination, which has attracted enormous attention over the past few years. The contributions included in this volume cover recent advances in hybrid pattern recognition. In the first chapter by H. Ishibuchi and M. Nii, a novel type of neural network architecture is introduced, which can process fuzzy input data. This type of neural net is quite powerful because it can simultaneously deal with different data formats, such as real or fuzzy numbers and intervals, as well as linguistic variables. The following two chapters deal with hybrid systems that aim at the application of neural networks in the domain of structural pattern recognition. In the second chapter by G. Adorni et al., an extension of the classical backpropagation algorithm that can be applied in the graph domain is proposed. This extension allows us to apply multilayer perceptron neural networks not only to feature vectors, but also to patterns represented by means of graphs. A generalization of self-organizing maps from n-dimensional real space to the domain of graphs is proposed in Chap. 3, by S. Giinter and H. Bunke. In particular, the problem of finding the optimal number of clusters in a graph clustering task is addressed.
Vll
Vlll
Preface
In Chap. 4, A. Bargiela and W. Pedrycz introduce a general framework for clustering through identification of information granules. It is argued t h a t the clusters, or granules, produced by this method are particularly suitable for hybrid systems. T h e next two chapters describe combinations of neural networks and hidden Markov models. First, in Chap. 5, G. Rigoll reviews a number of possible combination schemes. Most of t h e m originated in the context of speech and handwriting recognition; however, they are applicable to a much wider spectrum of applications. In Chap. 6, by T. Artieres et al., a system for on-line recognition of handwritten words and sentences is investigated. T h e main building blocks of this system are a hidden Markov model and a neural net. T h e following three chapters address the emerging field of multiple classifier systems. First, in Chap. 7, T. K. Ho provides a critical survey of the field. She identifies the lessons learned from previous work, points out the remaining problems, and suggests ways to advance the state-of-the-art. Then, in Chap. 8, F . Roli and G. Giacinto describe procedures for the systematic generation of multiple classifiers and their combination. Finally, in Chap. 9, A. Verikas et al. propose an approach to the integration of multiple neural networks into an ensemble. B o t h the generation of the individual nets and the combination of their o u t p u t s is described. In the final three chapters of the book applications of hybrid methods are presented. In Chap. 10, A. Klose and R. Kruse describe a system for the interpretation of remotely sensed images. This system integrates methods from the fields of neural nets, fuzzy logic, and evolutionary computation. In Chap. 11, D.-W. J u n g and R.-H. P a r k address the problem of fingerprint identification. T h e authors use a combination of various methods t o achieve robust recognition at a high speed. Last b u t not least, M. Junker et al. describe a system for automatic text categorization. Their system integrates symbolic rule-based learning with subsymbolic learning using support vector machines. Although it is not possible to cover all current activities in hybrid pattern recognition in one book, we believe t h a t the papers included in this volume are a valuable and representative sample of up-to-date work in this emerging and important branch of p a t t e r n recognition. We hope t h a t the contributions are valuable and will be useful to many of our colleagues working in the field.
Preface
IX
The editors are grateful to all the authors for their cooperation and the timely submission of their manuscripts. Finally, we would like to thank Scott Dick and Adam Schenker of the Computer Science and Engineering Department at the University of South Florida for their assistance and support.
Horst Bunke, Bern, Switzerland Abraham Kandel, Tampa, Florida August 2001
This page is intentionally left blank
Contents
Preface H. Bunke and A. Kandel Neuro-Fuzzy Systems Chapter 1 Fuzzification of Neural Networks for Classification Problems H. Ishibuchi and M. Nii Neural Networks for Structural Pattern Recognition Chapter 2 Adaptive Graphic Pattern Recognition: Foundations and Perspectives G. Adorni, S. Cagnoni and M. Gori Chapter 3
Adaptive Self-Organizing Map in the Graph Domain S. Giinter and H. Bunke
Clustering for Hybrid Systems Chapter 4 From Numbers to Information Granules: A Study in Unsupervised Learning and Feature Analysis A. Bargiela and W. Pedrycz Combining Neural Networks and Hidden Markov Models Chapter 5 Combination of Hidden Markov Models and Neural Networks for Hybrid Statistical Pattern Recognition G. Rigoll Chapter 6
From Character to Sentences: A Hybrid Neuro-Markovian System for On-Line Handwriting Recognition T. Artieres, P. Gallinari, H. Li, S. Marukatat and B. Dorizzi
vii
1
33
61
75
113
145
xii
Contents
Multiple Classifier Systems Chapter 7 Multiple Classifier Combination: Lessons and Next Steps T. K. Ho Chapter 8 Chapter 9
171
Design of Multiple Classifier Systems F. Roll and G. Giacinto
199
Fusing Neural Networks Through Fuzzy Integration A. Verikas, A. Lipnickas, M. Bacauskiene and K. Malmqvist
227
Applications of Hybrid Systems Chapter 10 Hybrid Data Mining Methods in Image Processing A. Klose and R. Kruse
253
Chapter 11 Robust Fingerprint Identification Based on Hybrid Pattern Recognition Methods D.-W. Jung and R.-H. Park
275
Chapter 12
Text Categorization Using Learned Document Features M. Junker, A. Abecker and A. Dengel
301
CHAPTER 1 FUZZIFICATION OF N E U R A L N E T W O R K S FOR CLASSIFICATION PROBLEMS
Hisao Ishibuchi Department of Industrial Engineering, Osaka Prefecture University 1-1 Gakuen-cho, Sakai, Osaka 599-8531, Japan E-mail:
[email protected]
Manabu Nii Department of Computer Engineering, Himeji Institute of Technology 2167 Shosha, Himeji, Hyogo 671-2201, Japan E-mail: nii@comp. eng. himeji-tech. ac.jp
This chapter explains the handling of linguistic knowledge and fuzzy inputs in multi-layer feedforward neural networks for pattern classification problems. First we show how fuzzy input vectors can be classified by trained neural networks. The input-output relation of each unit is extended to the case of fuzzy inputs using fuzzy arithmetic. That is, fuzzy outputs from neural networks are denned by fuzzy arithmetic. The classification of each fuzzy input vector is performed by a decision rule using the corresponding fuzzy output vector. Next we show how neural networks can be trained from fuzzy training patterns. Our fuzzy training pattern is a pair of a fuzzy input vector and a non-fuzzy class label. We define a cost function to be minimized in the learning process as a distance between a fuzzy output vector and a non-fuzzy target vector. A learning algorithm is derived from the cost function in the same manner as the well-known back-propagation algorithm. Then we show how linguistic rules can be extracted from trained neural networks. Our linguistic rule has linguistic antecedent conditions, a non-fuzzy consequent class, and a certainty grade. We also show how linguistic rules can be utilized in the learning process. That is, linguistic rules are used as training data. Our learning scheme can simultaneously utilize linguistic rules and numerical data in the same framework. Finally we describe the architecture, learning, and application areas of interval-arithmetic-based
1
H. Ishibuchi & M. Nii
2
neural networks, which can be viewed as a basic form of our fuzzified neural networks. 1. Introduction Multilayer feedforward neural networks can be fuzzified by extending their inputs, connection weights and/or targets to fuzzy numbers (Buckley and Hayashi 1994). Various learning algorithms have been proposed for adjusting connection weights of fuzzified neural networks (for example, Hayashi et al. 1993, Krishnamraju et al. 1994, Ishibuchi et al. 1995a, 1995b, Feuring 1996, Teodorescu and Arotaritei 1997, Dunyak and Wunsch 1997, 1999). Fuzzified neural networks have many promising application areas such as fuzzy regression analysis (Dunyak and Wunsch 2000, Ishibuchi and Nii 2001), decision making (Ishibuchi and Nii 2000, Kuo et al. 2001), forecasting (Kuo and Xue 1999), fuzzy rule extraction (Ishibuchi and Nii 1996, Ishibuchi et al. 1997), and learning from fuzzy rules (Ishibuchi et al. 1993, 1994). The approximation ability of fuzzified neural networks was studied by Buckley and Hayashi (1999) and Buckley and Feuring (2000). Perceptron neural networks were fuzzified in Chen and Chang (2000). In this chapter, we illustrate how fuzzified neural networks can be applied to pattern classification problems. We use multilayer feedforward neural networks with fuzzy inputs, non-fuzzy connection weights, and nonfuzzy targets for handling uncertain patterns and linguistic rules such as "If X\ is small and x
Fuzzification
of Neural Networks for Classification
Problems
3
Sft, linguistic values can be handled in the same framework as fuzzy numbers. Next we discuss the learning of neural networks from fuzzy training patterns. Labeled fuzzy p a t t e r n s are used as training data. T h a t is, each training p a t t e r n is a pair of a fuzzy input vector and its class label. In the same manner as the well-known back-propagation algorithm (Rumelhart et al. 1996), a learning algorithm is derived from a cost function defined by a fuzzy o u t p u t vector and a non-fuzzy target vector. T h e n we illust r a t e the linguistic rule extraction from neural networks. Linguistic rules of the following form are extracted froin a neural network trained for an n-dimensional p a t t e r n classification problem. Rule Rp : If x\ is ap\ and . . . and xn is apn then Class Cp with CFP , (1) where Rp is the label of the p-th rule, x = {x\,... , xn) is an n-dimensional p a t t e r n vector, aPi is an antecedent linguistic value on the i-th feature, Cp is a consequent class, and CFP is a certainty grade. T h e n antecedent linguistic values are presented as an n-dimensional fuzzy input vector to the trained neural network. T h e consequent class and the certainty grade are specified based on the corresponding fuzzy o u t p u t vector. We also discuss the learning of neural networks from linguistic rules of the form in (1). In this case, the antecedent linguistic values are used as a fuzzy input vector as in the fuzzy rule extraction. T h e corresponding target vector is determined by the consequent class. T h e certainty grade can be used for adjusting the importance of each linguistic rule in the learning process. Finally, we describe interval-arithmetic-based neural networks. Since fuzzy arithmetic is numerically performed on the level set (i.e., a-cut) of the fuzzy input vector, interval-arithmetic-based neural networks can be viewed as a basic form of fuzzified neural networks. We illustrate some applications of interval-arithmetic-based neural networks to p a t t e r n classification problems. For example, they can be used for handing incomplete p a t t e r n s with missing inputs where each missing input is represented by an interval including its possible values. They can also be used for decreasing the number of inputs required for the classification of new p a t t e r n s . We show an interval-arithmetic-based approach where each unmeasured input is represented by an interval including its possible values. W h e n h u m a n knowledge is represented by intervals such as "If x\ is in [10, 30] and X2 is in [4, 7] then Class 2", interval-arithmetic-based neural networks can be used for incorporating such knowledge into the learning of neural networks.
H. Ishibuchi & M. Nii
4
2. Classification of Fuzzy Patterns by Trained Neural Networks In this section, we concentrate our attention on the classification of uncertain patterns by trained neural networks. The learning of neural networks from uncertain training patterns is discussed in the next section. 2.1. Classification
Task
Let us assume that a standard three-layer feedforward neural network (Rumelhart et al. 1986) has already been trained for an n-dimensional pattern classification problem with c classes. The number of input units is the same as the dimensionality of the pattern classification problem (i.e., n). The number of hidden units, which is denoted by n # in this chapter, can be arbitrarily specified. The number of output units is the same as the number of classes (i.e., c). Thus our three-layer feedforward neural network has the n x njj x c structure. When an n-dimensional real vector x p = [xp\,.., , xpn) is presented to our neural network, the input-output relation of each unit is written as follows (Rumelhart et al. 1986): [Neural Network Architecture] Input units : opi = xVi, Hidden units : opj = f
i = 1, 2 , . . . , n , ^
u)2iovi + 9j
Output units : opk = f \^2wkjOPj
V=i
(2) ,
+ 9k \ ,
j = 1, 2 , . . . , nH ,
(3)
k = 1, 2 , . . . , c.
(4)
/
In this formulation, w is a connection weight and 8 is a bias. We use the following sigmoidal activation function for the hidden and output units:
Normally the input vector x p is classified by the output unit with the largest output value. This means that we use the following decision rule: If opk < opi for fc = 1,2,... ,c (k j^ I) then classify x p as Class I.
(6)
Fuzzification
of Neural Networks for Classification
Problems
5
Fig. 1 is an example of the classification boundary generated by a trained neural network using this classification rule. Fig. 1 also shows training data used in the learning of the neural network.
Fig. 1.
Classification boundary and training patterns.
S 0.0,
Fig. 2.
2
3 4 Input value
5
* • *
Examples of membership functions of "about 2" and "about 5."
Our task in this section is to classify uncertain patterns represented by fuzzy vectors. For example, let us consider the classification of a fuzzy vector XA = (2, 2) using the trained neural network in Fig. 1. The meaning of each fuzzy number is mathematically specified by a membership function on the real axis 5ft. For example, the fuzzy number 2 may be defined by a triangular membership function as shown in Fig. 2. Roughly speaking, the membership function JJL^ {x) of 2 specifies the possible range of 2 on the real
6
H. Ishibuchi & M. Nii
axis 5ft. More specifically, the value of /xg (x) for a specific input x denotes the extent (i.e., membership grade) to which x is compatible with the fuzzy concept "about 2". T h e membership function /^(ir) of 2 in Fig. 2 is written as /j,-2{x) = m a x { 0 , 1 - \2-x\}.
(7)
While the fuzzy vector x ^ = ( 2 , 2 ) involves a certain amount of uncertainty, the neural network may be able to classify x ^ as Class 1 because x ^ is located far from the classification b o u n d a r y (see Fig. 1). On the other hand, it seems to be difficult for the neural network to classify another fuzzy vector Xg = (5, 5) because xg is located near the classification boundary (see Fig. 1 for the location of x g and Fig. 2 for the membership function of 5). In this section, we mathematically formulate these intuitive discussions as a decision rule for fuzzy input vectors. 2 . 2 . Calculation
of Fuzzy
Outputs
Let Xj, = (xpi,... ,Xpn) be an n-dimensional fuzzy input vector to our neural network. Note t h a t xPi can be a real number and an interval because they are represented in the same framework as fuzzy numbers. For example, a real number a and an interval A = [ai, a%\ are represented by the following membership functions: , . ^
)
=
^
X )
=
/ 1, \o,
if x = a, otherwise.
\0,
otherwise.
(8)
(9)
Thus the fuzzy input vector x p = (xpi,. .. , xpn) can be a mixture of fuzzy numbers, intervals and real numbers such as x p = (5, [2, 3], 3.48). W h e n the fuzzy input vector x p = (xpi,... ,xpn) is presented to the neural network, the input-output relation of each unit in (2)-(5) is denned by fuzzy arithmetic (Kaufmann and G u p t a 1985). For example, the fuzzy o u t p u t dpj from the j - t h hidden unit is calculated by extending the input vector x p = (xpi,... ,xpn) to the fuzzy vector x p = (xpi,... ,xpn) in (2)-(3) as / n \ (10)
Fuzzification
of Neural Networks for Classification
Problems
7
In the same manner, the fuzzy output opk from the fc-th output unit is calculated as oPk = f | ^2wkjop:J
+ ok
(11)
As we can see from (10)-(11), the calculation of the fuzzy outputs 6Pj and opk involves the multiplication of fuzzy numbers by real numbers, the addition of fuzzy numbers, and the nonlinear mapping of fuzzy numbers by the sigmoidal activation function /(•). These operations on fuzzy numbers are illustrated in Fig. 3 and Fig. 4.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Input value
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Input value
Fig. 3.
Illustration of w • a and a + b.
Numerical calculations of the fuzzy outputs are performed on the level sets (i.e., a-cuts) of the fuzzy input vector using interval arithmetic (Moore 1979). This is because the fuzzy output from each unit cannot be represented in a simple parameterized form due to their nonlinear shape (see the fuzzy output in Fig. 4). In such numerical calculations, each fuzzy input 0.2, is first discretized into a set of its a-cuts for some a's (e.g. 0.4, 0.6, 0.8, 1.0). The a-cut of a fuzzy number a is defined as follows (see Fig. 5): [a]a = {x I Ha(x) >a}
for
0 < a < 1.
(12)
H. Ishibuchi & M. Nil
Lx 1
'
- 3 - 2 - 1 0
Fig. 4.
j
1 2 3 Input value
Illustration of the fuzzy activation function
Via Fig. 5.
f(x).
Va\a Input value
A fuzzy number a and its a-cut.
fs(*)
Input value Fig. 6.
Representation of a fuzzy number a by its five a-cuts for 0.2, 0.4, 0.6, 0.8, 1.0.
As shown in Fig. 5, cc-cuts of any fuzzy numbers are closed intervals. In Fig. 6, a fuzzy number a is discretized into its five a-cuts for a = 0.2, 0.4, 0.6, 0.8, 1.0. The a-cut of the fuzzy output opk from each output unit is calculated using interval arithmetic from the a-cut of the fuzzy input vector
Fuzzification
of Neural Networks for Classification
Problems
9
x p . That is, the input-output relations for fuzzy numbers in (10)-(11) is rewritten for a-cuts (i.e., intervals) as follows (see Buckley and Hayashi (1994), Hayashi et al. (1993), and Ishibuchi et al. (1993)): l°pj]a = / ( Yl
[Opk]a = /
W
H ferfla + 6i ) '
^2wkj[Opj]a
+0k I •
( 13 )
(14)
The fuzzy output opk from each output unit is constructed by calculating its a-cut [opk]a for various values of a using interval arithmetic. Details of interval arithmetic in the neural network are given in Section 6 of this chapter. 2.3. Classification
of Fuzzy Input
Vectors
As we have already mentioned, when we try to classify an uncertain pattern by a trained neural network, we first represent the uncertain pattern as a fuzzy vector x p = (xpi,... , xpn). Then x p is used as an input vector to the neural network. The corresponding fuzzy output vector 6P — ( o p i , . . . , opc) from the neural network is calculated using interval arithmetic on the acut of x p . In this subsection, we discuss the classification of the fuzzy input vector x p using the fuzzy output vector 6P. By directly extending the decision rule in (6) to the case of the fuzzy output vector 6p, we have its fuzzy version: If dPk < dpi for k = 1,2,... , c (k ^ I) then classify x p as Class I.
(15)
In this decision rule for the fuzzy input vector x p , we have to define the inequality relation between fuzzy numbers opk and bp\. Since the fuzzy output vector is numerically calculated by interval arithmetic on the a-cut of the fuzzy input vector x p , we define the inequality bpk < opi based on the a-cuts of opk and bv\. We use the following inequality relation between the a-cuts of opk and opi: [Opk]a < [Opl)a & [5pk]a < [5pl}a ,
(16)
where the superscripts " U" and "L" denote the upper limit and the lower limit of an a-cut, respectively (see Fig. 5). Using this inequality relation for
H. Ishibuchi & M. Nil
10
a prespecified value of a, we modify the decision rule in (15) for the fuzzy input vector as ^ [opfc]a < [°pi]a for fc = 1,2,... , c (fc ^ Z) then classify x p as Class I. (17) Fig. 7 shows the fuzzy output vector 6 B = (OBI, 5B2, 053) corresponding to the fuzzy input vector x ^ = (5, 5) in Fig. 1. In Fig. 7, OBX has no fuzziness (i.e., OBI — 0.0). As shown in this figure, XB can be classified as Class 3 by our decision rule in (17) when the value of a is larger than 0.43. If the value of a is smaller than 0.43, there is no class that satisfies our decision rule for the fuzzy output vector 6 B in Fig. 7. On the other hand, x ^ = (2, 2) can be classified as Class 1 by our decision rule for almost all values of a. Fig. 8 shows the corresponding fuzzy output vector 6A = (5AI, OA2, 6.43)- As shown in Fig. 8, each element of 6^ can be viewed as OAI = 1, OA2 — 0.0, and 5A3 = 6. Thus x^ = (2, 2) is classified as Class 1 regardless of the specification of a in (17). t °m
\jll2
.
.
.
.
1
Output value Fig. 7. Fuzzy output vector 6 g = (ogi, 5B2, 033). The shape of each fuzzy output is depicted by calculating its a-cuts for a = 0.01, 0.02,... , 1.00. °A2
1.00 °A3
f\ nn
SK 0
°A\
J^
0.5 Output value Fig. 8. Fuzzy output vector 6,4 = (OAI, OAI, 0,43). The shape of each fuzzy output is depicted by calculating its a-cuts for a = 0.01, 0.02,... , 1.00.
Fuzzification
of Neural Networks for Classification
Problems
11
As shown in Fig. 7 and Fig. 8, x^ = (2, 2) is classifiable by our decision rule for a wide range of a while xg = (5, 5) is classifiable for a narrower range of a. This means that the classification of x^ has a high confidence grade while that of x s has a low confidence grade. The confidence grade of the classification of the fuzzy input vector x p can be defined by the width of the range of a for which x p is classifiable. In the case of Fig. 7, XB = (5,5) is classified as Class 3 with confidence grade 0.57. On the other hand, x# = (2, 2) is classified as Class 1 with confidence grade 1.00. Note that the classification of the fuzzy input vector x p is rejected regardless of the specification of a if x p is not classifiable for a — 1.0. When x p is classifiable for a = 1.0, the confidence grade can be efficiently searched for in the unit interval [0, 1] using a simple bisection method. 3. Training of Neural Networks from Fuzzy Training Patterns 3.1. Learning
Task
Let us assume that we have m fuzzy training patterns x p = (xp\,... , xpn), p = 1,2,... ,m, each of which has already been classified as one of c classes. Note that x p may be a mixture of linguistic values, fuzzy numbers, intervals and real numbers, which can be represented in the same framework as fuzzy numbers using membership functions. Fuzzy training patterns may be obtained from human experts as linguistic knowledge such as "If x\ is small and 22 is large then Class 3". This fuzzy rule is handled as a fuzzy training pattern (small, large) labeled as Class 3. Fuzzy training patterns may be also obtained from measurements involving uncertain and/or missing values such as (2.45, ?, 4). Each missing value is represented by an interval including its possible values. For example, if the domain of the second feature of the pattern (2.45, ?, 4) is the unit interval [0, 1], this pattern is transformed into the equivalent fuzzy pattern (2.45, [0, 1], 4) with no missing value. Our task in this section is to train the standard feedforward neural network defined by (2)-(5) using m fuzzy training patterns Xp — \xPi,
• • • , Xpn),
3.2. Learning
p
1,
ZJ,
. . . , m.
Algorithm
For training the neural network, each fuzzy training pattern x p is used as an input vector. The corresponding fuzzy output vector 6P — (dpi,... , opc) is
H. Ishibuchi & M. Nii
12
calculated using interval arithmetic on the a-cut of x p . A non-fuzzy target vector tp = (tpi,... , tpC) for the fuzzy output vector 6 p is defined based on the given classification of the training pattern
t pk
1, 0,
if x p belongs to Class k , otherwise,
fc = l , 2 ,
(18)
The neural network is trained to decrease the difference between the non-fuzzy target vector tp and the fuzzy output vector 6 p . Since the a-cut \opk]a of each fuzzy output opk is calculated in the feedforward calculation, we first consider the difference between the a-cut [opk]a and the target tpkThis is the difference between an interval and a real number. Let us define an error measure between [opk]a and tpk as S-pka = Cpka + epka >
\^)
where e^ka and e^ka are defined using the lower limit [opk]a and the upper limit [opk}^ of the a-cut \opk\a as ~pka e
pka
(tpk - [oPk}a) = (fpk ~ \Opk]a)
/2 I2-
(20) (21)
That is, e^ka and e^ka axe squared errors for the lower limit and the upper limit of the a-cut, respectively. This definition of the error measure epka is illustrated in Fig. 9.
tOptfi Output value Fig. 9.
Illustration of e^ka
and ePha- In this figure, tpk = 1 is assumed.
Using the error measure epkQ for the a-cut of the fuzzy output opk from the k-th output unit, we define the error measure epa for the a-cut of the
Fuzzification
of Neural Networks for Classification
Problems
13
fuzzy o u t p u t vector 6 P as \^, fc=l
T h e connection weights and the biases of the neural network can be adjusted by the gradient descent technique as in the s t a n d a r d back-propagation algorithm. For example, the u p d a t e rule for the connection weight Wji can be written as
where r\ is a learning rate (i.e., positive constant). T h e explicit calculation of the partial derivative in the u p d a t e rule is shown in Section 6 of this chapter. T h e learning can b e weighted by the value of a as dep
+ °-V
l-Z?)-
(24)
In this case, the a-cut for a high level has a larger effect on the learning t h a n t h a t for a low level. The neural network is trained on all the m fuzzy training p a t t e r n s and various values of a. Let s be the number of different values of a (i.e., a\, a?2, • • • ,<*«)• W h e n incremental learning is used, the algorithm is: [Learning A l g o r i t h m ] Step 1: Initialization of the value of a: Let a be a i . Step 2: Initialization of the fuzzy input vector: Let x p be x i . Step 3: Feedforward calculation: Using interval arithmetic, calculate the acut of the fuzzy o u t p u t vector 6 p corresponding to the a-cut of the fuzzy input vector x p . Step 4: Adjustment: Adjust the connection weights and biases for decreasing the error measure epa using the gradient descent technique. Step 5: U p d a t e of the training p a t t e r n : W h e n the current x p is not the last fuzzy p a t t e r n x m , replace x p with the next fuzzy p a t t e r n and return to Step 3. Step 6: U p d a t e of the value of a: If the current a is not the last value as of a, then replace a with the next value of a and return to Step 2. Step 7: Termination test: If a pre-specified stopping condition is satisfied, terminate the learning. Otherwise, return to Step 1.
H. Ishibuchi & M. Nii
14
3.3. Numerical
Examples
Example 1: Let us assume that we have the nine fuzzy patterns in Table 1. The membership function of each fuzzy number is assumed to be triangular as shown in Fig. 2. That is, the membership function of a is given by Ha(x) = max{0, 1 - \x - a\} .
(25)
We trained a three-layer feedforward neural network with two input, five hidden, and three output units. In the learning, we used the weighted update rule in (24) for five values of a (i.e., a = 0.2, 0.4, 0.6, 0.8, 1.0). As in the standard back-propagation algorithm, we added a momentum term to our update rule in (24). The learning rate and the momentum constant were specified as 0.25 and 0.9, respectively. We iterated the learning algorithm 1000 times (i.e., 1000 epochs) for each cc-cut of each training pattern. Table 1. Fuzzy training patterns in Example 1. P
%>i
ip2
1 2 3 4 5 6 7 8 9
1 2 3 1 3 5 4 6 6
3 1 3
x
l
5 6 5 1 1 4
Class label Class Class Class Class Class Class Class Class Class
1 1 1 2 2 2 3 3 3
o Class 1 • Class 2 * Class 3
Fig. 10. Fuzzy training patterns in Example 1 and the classification boundary by the trained neural network.
Fuzzification
of Neural Networks for Classification
Problems
15
Fig. 10 shows the classification boundary obtained by the trained neural network, together with the a-cuts of the nine fuzzy patterns for a = 0.2, 0.4, 0.6, 0.8, 1.0. From this figure, we can see that the neural network was successfully trained using the nine fuzzy training patterns in Table 1. Example 2: Let us assume that we have the six training patterns in Table 2. They are fuzzy, interval, and real vectors. Using this example, we show that neural networks can be trained even when the training data are a mixture of fuzzy numbers, intervals and real numbers. In the same manner as in the previous example, we trained a neural network with two input, five hidden, and three output units. Fig. 11 shows the classification boundary obtained by the trained neural network together with the six training patterns. From this figure, we can see that our approach can handle fuzzy numbers, intervals and real numbers in the same framework. This is because intervals and real numbers, as well as fuzzy numbers, are represented by membership functions in our approach. Table 2. Fuzzy training patterns in Example 2. p
Xp\
a;P2
Class label
1 2 3 4 5 6
1 [1, 6] 3.0 [1, 4] 5 5
3 [1, 2] 3.0 [5, 6] 5 3
Class Class Class Class Class Class
x
l
o Class 1
• Class 2
1 1 2 2 3 3
* Class 3
6 Class 2 5 4
"s.^ * s ^
3
- in
1
\»
s 1
E
2 Class 1 1 0
1
i
i
i
i
2
3
4
5
•
-
6 " '"'
Fig. 11. Fuzzy training patterns in Example 2 and the classification boundary by the trained neural network.
H. Ishibuchi & M. Nii
16
4. L i n g u i s t i c R u l e E x t r a c t i o n f r o m N e u r a l N e t w o r k s 4.1.
Assumptions
In this section, we show how we can extract linguistic rules of t h e following type from trained neural networks. Rule Rp : If X\ is ap\ and . . . and xn is apn then Class Cp with CFP , (26) where api is an antecedent linguistic value such as "small" and "large". While neural networks simplified by pruning algorithms are usually used for rule extraction in the literature, we do not assume any special network structure or learning algorithm in this section. Our approach is applicable to arbitrary neural networks with the s t a n d a r d feedforward architecture. In this section, we assume t h a t a standard three-layer feedforward neural network in (2)-(5) has already been trained for an n-dimensional p a t t e r n classification problem with c classes. We also assume t h a t a set of linguistic values is given for each feature of the p a t t e r n classification problem by h u m a n users. For example, weight may be described in some situation by the three linguistic values "light", "middle", and "heavy" as shown in Fig. 12. Of course, different sets of linguistic values should be used in different situations.
.
light
0
u. \lmiddle\l
10 20 30 40 50 60 70 80 90 100 110
Weight [Kg]
Fig. 12.
4 . 2 . Rule
Linguistic values "light", "middle", and "heavy."
Extraction
In our rule extraction method, the antecedent linguistic values
Fuzzification
of Neural Networks for Classification
Problems
17
Xp = kp = (api,... ,apn). The determination of the consequent class Cv is the same as the classification of the fuzzy input vector a p discussed in Section 2. The certainty grade CFp is specified as the confidence grade of the classification of a p . The calculation of the confidence grade has already been described in Section 2. In this manner, we can determine the consequent class Cp and certainty grade CFP for the antecedent linguistic values api, . . . , apn using the trained neural network. The value of CFP can be used to decrease the number of extracted linguistic rules. For example, we can specify a lower bound CFm-m for CFp. We extract the corresponding linguistic rule Rp only when CFP is larger than or equal to the lower bound ^ -^min-
The antecedent part of each linguistic rule is specified as a combination of the given linguistic values. Let Ki be the number of the given linguistic values for the i-th feature (i.e., i-th input). In addition to the Ki linguistic values for the i-th feature, we also use "don't care" as an antecedent linguistic value. Thus the total number of combinations of antecedent linguistic values for the n features is {K\ + 1) x • • • x (Kn + 1). When the domain of the i-th feature is an interval Di, "don't care" for the i-th feature can be viewed as the interval Di itself. That is, the membership function of "don't care" for the i-th feature is specified as J 1, [idon't care(Xi)
-
|
Q
^
if Xi e Di, o t h e r w i s e
.
_
W)
To illustrate our rule extraction method, we show a simulation result using the trained neural network in Fig. 1. Let us assume that the pattern space in Fig. 1 is [0, 6] x [0, 6]. This means that [0, 6] is the domain of each feature (i.e., D\ = Z?2 = [0, 6]). In our approach, we need a set of linguistic values for each feature. We assume that the five linguistic values in Fig. 13
Input value
Fig. 13. Five linguistic values "small", and "large."
"medium small",
"medium",
"medium
large",
18
H. Ishibuchi & M. Nii
are given for each feature. Thus our task is to extract linguistic rules of the following type from the trained neural network in Fig. 1 using the five linguistic values in Fig. 13. Rule Rp : If X\ is ap\ and x2 is ap2 then Class Cp with CFp .
(28)
In this task, we have (5 + 1) x (5 + 1) combinations of antecedent linguistic values: (api, ap2) = (don't care, don't care), (don't care, small), ..., (large, large). Each of these combinations was presented to the trained neural network as a fuzzy input vector. Then the consequent class and the certainty grade were determined by the corresponding fuzzy output vector. In this manner, we extracted linguistic rules from the trained neural network. The threshold value CFmia was specified as CFmin = 0.01. The extracted linguistic rules are shown in Fig. 14, where each real number in parentheses is the certainty grade of the corresponding linguistic rule. In addition to the 25 linguistic rules in Fig. 14, the following rule was also extracted: If x\ is large and x2 is don't care then Class 3 with 0.08 .
(29)
Since the "don't care" condition can be omitted, this linguistic rule is simplified as If x\ is large then Class 3 with 0.08 .
(30)
From Fig. 1 and Fig. 14, we can see that linguistic rules near the classification boundary have small certainty grades. On the other hand,
Class 2 (0.46)
(0.81)
Class 2 (0.88)
Class 2 ! (0.28) \
Class 3 (0.50)
Class 1 (0.19)
Class 2 (0.04)
Class 2 (0.24)
Class 2 ! (0.07) 1
Class 3 (0.69)
Class 1 (0.96)
Class 1 (0.71)
Class 1 (0.31)
Class 3 ! {0.25) I
Class 3 (0.78)
Class 1 (1.00)
Class 1 (1.00)
Class 1 (0.39)
Class 3 ; (0.33) ;
Class 3 (0.93)
Class 1 (1.00)
Class 1 (1.00)
(0.33)
Class 3 ; (0.28) J
Class 3 (1.00)
Fig. 14. Extracted linguistic rules. Each real number in parentheses shows the certainty grade of the corresponding linguistic rule.
Fuzzification
of Neural Networks for Classification x
2
O Class 1 • Class 2 A class 3 A
0 Fig. 15.
Problems
1 2
3
4
5
A
6
Classification boundary induced by the extracted linguistic rules.
linguistic rules far from the classification boundary have large certainty grades. Fig. 15 shows the classification boundary generated by the extracted linguistic rules. To depict the classification boundary, we used a fuzzy reasoning method based on a single winner rule (Ishibuchi et al. 1992, Ishibuchi et al. 1999b). From the comparison between Fig. 1 and Fig. 15, we can see t h a t the classification b o u n d a r y induced by the extracted linguistic rules is similar to t h a t induced by the trained neural network.
4 . 3 . Rule
Selection
Our rule extraction method examines all combinations of antecedent linguistic values. This leads to an exponential increase in the number of examined linguistic rules relative to the dimensionality of the p a t t e r n classification problem. A simple trick for avoiding such an exponential increase is to examine only general linguistic rules with a few antecedent conditions. In other words, linguistic rules with many "don't care" conditions are examined. Let us define the length of a linguistic rule by the number of its antecedent conditions excluding "don't care" conditions. For example, the length of the linguistic rule in (29) is one. Even when the total number of combinations of antecedent linguistic values is huge, the number of general linguistic rules is not large. General (i.e., short) rules are also more preferable t h a n specific (i.e., long) rules from the viewpoint of interpretability. It is usually difficult for h u m a n users to intuitively understand the meaning of a long linguistic rule with many antecedent conditions. A small number of relevant rules can be selected from a large number of extracted rules using genetic algorithms (Ishibuchi et al. 1997). Let us
H. Ishibuchi & M. Nil
20
assume that r linguistic rules are extracted from the trained neural network. In this case, any subset of the extracted rules can be represented by a binary string of the length r. Genetic algorithms (Holland 1975 and Goldberg 1989) can be used to find a small number of linguistic rules that yield a high classification rate. The fitness value of each subset is defined by the number of linguistic rules and the classification rate on the training patterns. 5. Training of Neural Networks from Linguistic Rules 5.1. Learning
Task
Our task in this section is to train neural networks using linguistic rules. As in the previous sections, we use a three-layer feedforward neural network with n input and c output units for an n-dimensional pattern classification problem with c classes. We assume that m linguistic rules Rp, p = 1, 2 , . . . ,TO,of the form (26) are given for training the neural network. 5.2. Learning from Linguistic Grades
Rules with No
Certainty
First let us consider a simpler case where the given linguistic rules have no certainty grades. In this case, the linguistic rules can be written as follows: Rule Rp : If Xi is api and . . . and xn is apn then Class Cp , p=l,2,...,m.
(31)
These linguistic rules can be viewed as fuzzy training patterns a p = ( a p i , . . . ,apn), p = 1,2,... ,m. The target vector t p = ( i p i , . . . ,tpc) for each fuzzy training pattern a p is specified by the corresponding consequent class Cp in the same manner as in Section 3. The neural network can be trained in the same manner as in Section 3 using the m input-target pairs (a p , t p ) , p= 1,2,... ,m. Since linguistic values, fuzzy numbers, intervals and real numbers can be handled in the same way by representing them as membership functions, various kinds of information (e.g., linguistic rules, fuzzy patterns, interval patterns, and real patterns) can be simultaneously utilized in the learning of neural networks. This is the main advantage of our approach over standard learning algorithms that can handle only numerical data.
Fuzzification
of Neural Networks for Classification
5.3. Learning from Linguistic
Problems
Rules with Certainty
21
Grades
When each of the given linguistic rules Rp, p=l,2,...,m, has the certainty grade CFp, it can be used as the weight (or importance) of Rp in the learning of the neural network. More specifically, we can define the cost function epa for the a-cut of the fuzzy input vector a p as c
&pa
=
^"p
' /
y
&pka 5
l*-*^J
k=l
where epka is defined from the fuzzy input vector ap and the target vector tp in the same manner as in Section 3. To illustrate the effect of having a certainty grade for each linguistic rule on the learning of neural networks, let us assume that we have the following two linguistic rules for a two-dimensional pattern classification problem in the pattern space [0, 6] x [0, 6]: Rule Ri : If xx is large then Class 1 with CFX = 1.0,
(33)
Rule R2 : If x2 is large then Class 2 with CF2 = 0.3 ,
(34)
where we use the same membership function for the linguistic value "large" as in the previous section (i.e., as in Fig. 13). These two linguistic rules conflict with each other in the input region compatible with these two rules (i.e., the area described as "xi is large and £2 is large"). We also assume that we have numerical training data in Fig. 16. We trained a neural network with two input, five hidden, and three output units using these numerical x
2
O Class 1 • Class 2 A Class 3
6 5 4 3 2 1 0
1
2
3
4
5
6 "*'
Fig. 16. Classification boundary by the trained neural network for the case of CF\ = 1.0 and CF2 = 0.3.
H. Ishibuchi & M. Nii
22 x
2
O Class 1 • Class 2 A class 3
Fig. 17. Classification boundary by the trained neural network for the case of CF\ = 0.3 and CF2 = 1.0.
data and the two linguistic rules. Fig. 16 shows the classification boundary generated by the trained neural network after 1000 epochs. Let us focus our attention on the conflicting region with large x\ and large x-i- As we can see from Fig. 16, this region is classified as Class 1 (i.e., the consequent class of the stronger rule). For comparison, we also trained the same neural network by specifying the certainty grades as CF\ = 0.3 and CF^ = 1.0. In this case, we obtained the classification boundary in Fig. 17 where the conflicting region is classified as Class 2 (i.e., the consequent class of the stronger rule). 6. Interval-Arithmetic-Based Neural Networks In this section, we explicitly describe interval arithmetic and a learning algorithm for neural networks with interval input vectors. We also describe the learning from expert knowledge and the classification of incomplete patterns with missing inputs. Such classifications can be utilized for decreasing the number of inputs to be measured (i.e., for decreasing the measurement cost). 6.1. Feedforward
Calculation
We first show the feedforward calculation in our three-layer feedforward neural network in (2)-(5) for an interval input vector X p = {Xp\,... ,X p n ) where Xpi = [xpi, a £ ] . If we view the interval input vector X p as the ev-cut of the fuzzy input vector xp, the following descriptions are directly related to the fuzzified neural network in the previous sections.
Fuzzification
of Neural Networks for Classification
Problems
23
The interval output Opi = [opi, o^] from each input unit i is the same as the interval input Xpi = [xpi, xpi\. Using interval arithmetic, the interval output 0Pj = [Opj, opA from each hidden unit j is calculated as
°p,
= f ( E
w
«&
= /1 E
w
i*°p> + e j ]
^°pi + E
iWji>0
>
( 35 )
Wji<0
w
w
^°Pi + E
Wji>0
*°#
+ e
A •
(36)
utji<0
u k] from each hidden In the same manner, the interval output Opk = [OpJpk' o^ k, pk unit k is calculated as
°pk
= /| E
w
kj°pj+
yiufc;)>o °pk
= 'l
E
w
E
fej°TO+6'fe
(37)
w
(38)
u)fcj
\wkj>0
kj°pj+
E
kjOpj+ok
t«fcj<0
The following operations on intervals are used in the above calculations: A + B = [aL, au] + [bL, bu] = [aL + 6 L , au + bu], L
17
i c/1 _ / [u> • a , u> • a ], w-A = w [aL, au] = { _ ' [u; • a^, u; • aL],
if w > 0 , „.: ' if iu
i f
/ ( X ) = / ( [ ^ , ^ ] ) = [/(x L ), / ( ^ ) ] .
6.2. Derivation
of Learning
(39) (40) (41)
Algorithm
As in Section 3, the target vector tp = ( i p i , . . . ,tpc) is specified from the label (i.e., classification) of the interval input vector X p . A cost function is defined from the interval output vector O p = ( O p i , . . . , Opc) and the target vector t p = (tpi,... ,t pc ) as c
c
eP = £ > p f c " ohpkfl2 + ^(tPk fc=i fe=i
~ oupk)2/2 .
(42)
24
H. Ishibuchi & M. Nii
In the gradient descent learning, the connection weights wkj and are updated as de„
wnkr =
dev dw n
n
(43)
dw kj
„.,old
w J*
u
(44)
The biases are also updated in the same manner. The partial derivative dep/dwkj is calculated as follows. (1) When wk} > 0, dep °pk°pk
dwkj
(45)
~ °pk°pk >
where $pk — (tpk ~ °pk)°pk(1
_
°pk)
(46)
S
-
Opk)
(47)
°pkupk •
(48)
pk = {tpk - 0pk)0pk(l
(2) When wkj < 0, den dw,kj The partial derivative dep/dwji (1) When
Wji
dep dw JI
°pkupk
is calculated as follows.
> 0,
V
5Hkwkj
fc=i
Y.
J2
pkwkj
} Opji1 ~
°pj)°pi
k=l wki<0
S
pkWkJ+ Yl 5PkWkj
fe=l fc=l
wkj>0
5
uifc^O
} Opji1
~ Opj)Opi .
(49)
Fuzzification
of Neural Networks for Classification
Problems
25
(2) When wjt < 0 de. dwji
Yl
d
pkwkj+ J2 6pkw^ \
LUfcj>0
Z^
°PJ(I-°PJH
Wkj<0
S
pkWkj+
}__, tfkwkj } 0^(1 - 0^)0^
.
(50)
fc=i fc=i vkj>0
6.3. Learning from Expert
Wkj<0
Knowledge
Expert knowledge may be represented in various forms such as linguistic rules and prototypes. One simple form is based on intervals. Let us assume that we have the following rules for a two-dimensional pattern classification problem with the pattern space [0, 6] x [0, 6]: If zi < 2 then Class 1,
(51)
If 3 < x\ and 5 < x-i then Class 2 ,
(52)
If 3 < xi < 5 and 2 < x2 < 4 then Class 3 .
(53)
Since the domain of each feature (i.e., each input) is the interval [0, 6], these rules are rewritten as If x\ is in [0, 2] and x2 is in [0, 6] then Class 1,
(54)
If x\ is in [3, 6] and x-i is in [5, 6] then Class 2 ,
(55)
If x\ is in [3, 5] and X2 is in [2, 4] then Class 3 .
(56)
Thus these rules are handled as three interval training patterns ([0, 2], [0, 6]), ([3, 6], [5, 6]) and ([3, 5], [2, 4]). Using these interval training patterns and some numerical training data, we trained a neural network with two input, five hidden and three output units by the above learning algorithm. In the experiment, real numbers were handled as a special case of intervals with no width. Our learning algorithm was iterated 1000 times (e.g., 1000 epochs) for each pattern, with an added momentum term. In Fig. 18, we show the classification boundary obtained from the trained neural network, together with the training data used in the learning. From this figure, we can see that all the interval and non-interval training patterns are correctly classified by the trained neural network.
H. Ishibuchi & M. Nii
26 x
2
Fig. 18.
6.4. Decreasing
O Class 1 • Class 2 A class 3
Classification boundary and training patterns.
the Measurement
Cost
oA
oA
oA
value
c
As we have already explained in Section 2, incomplete input patterns with missing inputs can be represented as interval vectors where each missing input is replaced with its domain interval. Let us assume that we have an input pattern (?, 2) with a missing input. If we use the trained neural network in Fig. 18, this input pattern is represented by an interval vector X^4 = ([0, 6], [2, 2]) since the domain of each input in Fig. 18 is the interval [0, 6]. This interval vector X ^ is presented to the trained neural network. The corresponding interval output vector 0,4 = (OAI, 0^2, OAS) is calculated as shown in Fig. 19. The three interval outputs totally overlap with each other. Thus we cannot classify the input pattern (?, 2) from the interval output vector. This corresponds to the fact that we cannot
3 -4—1
3
O 0.0
1st Oiltput
Fig. 19.
2nd Output
3rd Ou tput
Interval output vector corresponding to the input pattern (?, 2).
Fuzzification
of Neural Networks for Classification
Problems
27
classify the input p a t t e r n (?, 2) in Fig. 18. On the other hand, we can classify another input p a t t e r n (1, ?) in Fig. 18. Even when the value of the second input x2 is missing, we can classify the input p a t t e r n (1, ?) as Class 1 from Fig. 18. This input p a t t e r n is handled as an interval input vector X B = ([1, 1], [0, 6]). T h e corresponding interval o u t p u t vector is calculated as O B = (OBI, 0B2, 0 B 3 ) ^ ([1, 1], [0, 0], [0, 0]). T h u s we can classify the input vector (1, ?) as Class 1 from the interval o u t p u t vector. Let us describe the above discussion more formally. For classifying an input vector xp with missing inputs by a trained neural network, first such an input vector is denoted by an interval input vector X p with no missing inputs. Each missing input is represented by its domain interval. Each of the given (i.e., measured) inputs is represented by the equivalent interval with no width (e.g., xpi = [xpi, xpi\). Note t h a t t h e interval input vector X p includes all the possible values of the missing inputs. T h a t is, x p 6 X p always holds for any actual value of each missing input. Then the interval input vector X p is presented to the trained neural network for calculating the interval o u t p u t vector Op = (Opi,... , Opc). We use the following classification rule for examining the classifiability of the interval input vector Xp: If cFvk < opl for k = 1, 2 , . . . , c (k ^ V) then classify X p as Class I.
(57)
Note t h a t the following relation holds if o^k < o£ holds. opk < Opi for v o p f c e Opk and v o p ; e Opi.
(58)
T h u s any input vector x p in X p is classified as t h e same class as X p when X p is classifiable by our decision rule. In other words, X p is classifiable by our decision rule only when the input region specified by X p is classified as the same class by the trained neural network. In this case, we can classify the input p a t t e r n with missing inputs before acquiring the actual value of each missing input. T h e above idea can be used for decreasing the measurement cost for classifying new p a t t e r n s in the classification phase. In s t a n d a r d applications of neural networks to p a t t e r n classification problems, all the input values should be measured for classifying new patterns. As we have described, all input values are not always necessary for the classification purpose. For example, a new p a t t e r n with X\ — 1 can be classified as Class 1 in Fig. 18 without measuring the second input x2 . This example suggests t h a t we m a y
28
H. Ishibuchi & M. Nii
be able to decrease the measurement cost if we first measure the first input x\. The second input is to be measured only when the new pattern is not classifiable by the first input (i.e., only when (xpi, ?) is not classifiable). We cannot, however, decrease the measurement cost if we first measure the second input x^ in Fig. 18. From the above discussions, we can see that the measurement order should be appropriately specified for efficiently decreasing the measurement cost. Let us assume that we have m training patterns. We also assume that a neural network has already been trained. In this case, we can determine the measurement order of the n inputs (i.e., n features) by the following procedure. [Determination of the measurement order] Step 1: Let ^ = {x\,... ,xn} where ^ is the set of unmeasured inputs. Step 2: Perform the following procedures (a)-(c) for all combinations of (|\f| - 1) inputs from * where | * | denotes the number of inputs in * . (a) Select (|*| - 1) inputs from * . (b) Classify the m training patterns using only the selected inputs. The other inputs are handled as missing inputs. (c) Calculate the classification rate in (b). Step 3: Replace ty with the set of the (|\]/| — 1) inputs that has the highest classification rate in Step 2. If \t includes only a single input, stop the algorithm. Otherwise return to Step 2.
7. Conclusion In this chapter, we described several approaches for applying fuzzified neural networks to pattern classification problems. The basic idea in this chapter is to handle different kinds of information (i.e., real numbers, intervals, fuzzy numbers, and linguistic values) in the same framework using membership functions. Real numbers and intervals are viewed as special cases of fuzzy numbers. As a result, human knowledge and numerical data can be simultaneously utilized in the learning of neural networks. Our basic idea makes it possible to classify a mixed input pattern of real numbers, intervals, fuzzy numbers and linguistic values by neural networks. We also
Fuzzification
of Neural Networks for Classification
Problems
29
described interval-arithmetic-based neural networks. O u r approach to the decrease of the measurement cost can be implemented without causing any deterioration in the classification ability because the measurement for each new p a t t e r n is continued until it becomes classifiable. One important issue, which was not discussed in this chapter, is the increase of fuzziness during the feedforward calculation in fuzzified neural networks (Ishibuchi et al. 1999a). Since the input-output relation is defined by fuzzy arithmetic, fuzzy o u t p u t s from o u t p u t units always include excess fuzziness. Such excess fuzziness has a bad effect on the classification of fuzzy input vectors and the linguistic rule extraction. Interval o u t p u t s from output units also include excess width. Such excess width has a bad effect on the classification of interval input vectors and the decrease of the measurement cost. References 1. Buckley J. J. and Hayashi Y. "Fuzzy neural networks: A survey", Fuzzy Sets and Systems 66, 1-13 (1994). 2. Buckley J. J. and Hayashi Y. "Can neural nets be universal approximators for fuzzy functions?", Fuzzy Sets and Systems 101, 323-330 (1999). 3. Buckley J. J. and Feuring T. "Universal approximators for fuzzy functions", Fuzzy Sets and Systems 113, 411-415 (2000). 4. Chen J. L. and Chang J. Y. "Fuzzy perceptron neural networks for classifiers with numerical data and linguistic rules as inputs", IEEE Trans, on Fuzzy Systems 8, 730-745 (2000). 5. Dunyak J. and Wunsch D. "A training technique for fuzzy number neural networks", Proc. of 1997 IEEE International Conference on Neural Networks, 533-536 (1997). 6. Dunyak J. and Wunsch D. "Fuzzy number neural networks", Fuzzy Sets and Systems 108, 49-58 (1999). 7. Dunyak J. and Wunsch D. "Fuzzy regression by fuzzy number neural networks", Fuzzy Sets and Systems 112, 371-380 (2000). 8. Feuring T. "Learning in fuzzy neural networks", Proc. of 1996 IEEE International Conference on Neural Networks, 1061-1066 (1996). 9. Goldberg D. E. Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Wesley, Reading (1989). 10. Hayashi Y., Buckley J. J., and Czogala E. "Fuzzy neural network with fuzzy signals and weights", International Journal of Intelligent Systems 8, 527-537 (1993). 11. Holland J. H. Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor (1975).
30
H. Ishibuchi & M. Nii
12. Ishibuchi H. and Tanaka H. "An extension of the BP-algorithm to interval input vectors", Proc. of IEEE International Joint Conference on Neural Networks, 1588-1593 (1991). 13. Ishibuchi H., Nozaki K., and Tanaka H. "Distributed representation of fuzzy rules and its application to pattern classification", Fuzzy Sets and Systems 52, 21-32 (1992). 14. Ishibuchi H., Fujioka R., and Tanaka H. "Neural networks that learn from fuzzy if-then rules", IEEE Transactions on Fuzzy Systems 1, 85-97 (1993). 15. Ishibuchi H., Tanaka H., and Okada H. "Interpolation of fuzzy if-then rules by neural networks", International Journal of Approximate Reasoning 10, 3-27 (1994). 16. Ishibuchi H., Kwon K., and Tanaka H. "A learning algorithm of fuzzy neural networks with triangular fuzzy weights", Fuzzy Sets and Systems 71, 277-293 (1995a). 17. Ishibuchi H., Morioka K., and Turksen I. B. "Learning by fuzzified neural network", International Journal of Approximate Reasoning 13, 327-358 (1995b). 18. Ishibuchi H. and Nii M. "Generating fuzzy if-then rules from trained neural networks: Linguistic analysis of neural networks", Proc. of 1996 IEEE International Conference on Neural Networks, 1133-1138 (1996). 19. Ishibuchi H., Nii M., and Murata T. "Linguistic rule extraction from neural networks and genetic-algorithm-based rule selection", Proc. of 1997 IEEE International Conference on Neural Networks, 2390-2395 (1997). 20. Ishibuchi H. and Nii M. "Minimizing the measurement cost in the classification of new samples by neural-network based classifiers", Proc. of 5th International Conference on Soft Computing and Information/Intelligent Systems (IIZUKA'98), 634-637 (1998). 21. Ishibuchi H., Nii M., and Tanaka K. "Decreasing excess fuzziness in fuzzy outputs from neural networks for linguistic rule extraction", Proc. of International Joint Conference on Neural Networks, CD-ROM Proceedings (1999a). 22. Ishibuchi H., Nakashima T., and Morisawa T. "Voting in fuzzy rule-based systems for pattern classification problems", Fuzzy Sets and Systems 103, 223-238 (1999b). 23. Ishibuchi H. and Nii M. "Neural networks for soft decision making", Fuzzy Sets and Systems 115, 121-140 (2000). 24. Ishibuchi H. and Nii M. "Fuzzy regression using asymmetric fuzzy coefficients and fuzzified neural networks", Fuzzy Sets and Systems 119, 273-290 (2001). 25. Kaufmann A. and Gupta M. M. Introduction to Fuzzy Arithmetic, Van Nostrand Reinhold, New York (1985). 26. Krishnamraju P. V., Buckley J. J., Reilly K. D., and Hayashi Y. "Genetic learning algorithms for fuzzy neural nets", Proc. of 1994 IEEE International Conference on Fuzzy Systems, 1969-1974 (1994). 27. Kuo R. J. and Xue K. C. "Fuzzy neural networks with application to sales forecasting", Fuzzy Sets and Systems 108, 123-143 (1999).
Fuzzification of Neural Networks for Classification Problems
31
28. Kuo R. J., Chen C. H., and Hwang Y. C. "An intelligent stock trading decision support system through integration of genetic algorithm based fuzzy neural network and artificial neural network", Fuzzy Sets and Systems 118, 21-45 (2001). 29. Moore R. E. Methods and Applications of Interval Analysis, SIAM Studies in Applied Mathematics, Philadelphia (1979). 30. Rumelhart D. E., McClelland J. L., and the P D P Research Group Parallel Distributed Processing, MIT Press, Cambridge, Massachusetts (1986). 31. Teodorescu H. N. and Arotaritei D. "Analysis of learning algorithms performance for algebraic fuzzy neural networks", Proc. of 1997 International Fuzzy Systems Association World Congress TV, 468-473 (1997).
This page is intentionally left blank
CHAPTER 2 ADAPTIVE GRAPHIC PATTERN RECOGNITION: FOUNDATIONS AND PERSPECTIVES
Giovanni Adorni and Stefano Cagnoni Dipartimento di Ingegneria dell'Informazione Universita di Parma, Viale delle Scienze, Parma, Italy
Marco Gori Dipartimento di Ingegneria dell'Informazione Universita di Siena, Via Roma, 56, 53100 Siena, Italy In this chapter we propose a new approach to pattern recognition, referred to as adaptive graphical pattern recognition which is inbetween decision-theoretic and structural pattern recognition. In particular we focus on the extension of classic supervised neural network-based approaches to pattern recognition, and show that the classic backpropagation learning scheme can naturally be extended to the case of patterns that are represented by directed ordered acyclic graphs. More general graphs can easily express complex patterns, but we demonstrate that the corresponding extension of classic neural network architectures and learning algorithms is less effective. This extended view of neural networks operating on graphs gives rise to a new wave of connectionist-based techniques. Experimental results on problems of pattern classification and image retrieval clearly indicate the effectiveness of the proposed approach, especially when neither purely structural nor purely subsymbolic representations are appropriate.
1. I n t r o d u c t i o n In this chapter we propose a general framework for the development of a novel approach t o p a t t e r n recognition which is based on learning in graphic domains. T h e d a t a of these domains simultaneously possess the highly structured representation of classical syntactic and structural approaches, 33
34
G. Adorni, S. Cagnoni & M. Gori
and the sub-symbolic capabilities of decision-theoretic models, typical of connectionism and statistics. Preliminary efforts have been made to construct a general framework for these learning schemes. 1 In this chapter we focus on the extension of classic neural network-based approaches to pattern recognition, with an emphasis on supervised learning. In particular, we show t h a t the classic backpropagation learning scheme can naturally be extended to the case of p a t t e r n s t h a t are represented by directed ordered acyclic graphs. Unlike other more general graphical structures, this kind of abstract representation is sometimes a convoluted representation of the original p a t t e r n . However, it offers one of the simplest mechanisms for expressing the p a t t e r n ' s structure. More general graphs can easily express complex patterns, but we demonstrate t h a t the corresponding extension of classic neural network architectures a n d learning algorithms is less effective. In the simplest case, the computation carried out by neural networks on graphs is independent of their nodes. This is in fact a strong hypothesis t h a t dramatically simplifies the learning scheme but, unfortunately, it is not very appropriate to all p a t t e r n recognition problems. We briefly review recent a t t e m p t s to face this problem, as well as other open problems in this area. This extended view of neural networks operating on graphs gives rise to a new wave of connectionist-based techniques for p a t t e r n recognition, which is in some sense in between traditional decision-theoretic and struct u r a l approaches. Throughout this chapter, this new approach is referred to as adaptive graphic pattern recognition and its application to p a t t e r n classification and image retrieval tasks are briefly reviewed. T h e chapter is organized as follows. In Section 2 we introduce the problem of p a t t e r n representation from the two opposite viewpoints of decision-theoretic and structured p a t t e r n recognition. Section 3 briefly reviews the most popular approaches to p a t t e r n representation involving complex d a t a structures. Section 4 introduces the basic concepts behind neural processing of structured data, while Section 5 proposes the new general framework of graphical p a t t e r n recognition. Finally, we draw some conclusions in Section 6 and give a perspective of the research in the field. 2. D e c i s i o n - T h e o r e t i c V e r s u s S t r u c t u r a l A p p r o a c h e s In the last three decades the emphasis in p a t t e r n recognition research has hovered pendulum-like from decision-theoretic to structured approaches.
Adaptive Graphic Pattern Recognition: Foundations
and Perspectives
35
Decision-theoretic methods are essentially based on numerical features that provide a global representation of a pattern via an appropriate preprocessing algorithm. Many different decision-theoretic methods have been developed in the framework of connectionist models, which operate on sub-symbolic pattern representations. On the other hand, syntactic and structural pattern recognition (and, additionally, artificial intelligencebased methods) have been developed that emphasize the symbolic nature of patterns. Since their main focus is on expectations that can be derived from previous knowledge of the components detected in the patterns under consideration, such methods are often referred to as "knowledge-based" methods. Such different approaches to pattern recognition have given rise to the longstanding debate between traditional AI methods, based on symbols, and computational intelligence methods, which operate on numbers. 2 However, both purely decision-theoretic and syntactical/structural approaches are of limited value when applied to many interesting real-world problems for different reasons. Syntactical and structural methods can model the structure of patterns. However, these methods are not very well-suited for dealing with patterns corrupted by noise. This limitation was recognized early on, and several approaches have been pursued to incorporate statistical properties into structured approaches. The symbols used for either syntactical or structural approaches have been enriched with attributes, which are in fact vectors of real numbers representing appropriate features of the patterns. These attributes are expected to allow some statistical variability in the patterns under consideration. Error-correction mechanisms have been introduced to deal with either noise or distortions, 3 Additionally, symbolic string parsing has been extended using stochastic grammars to model the uncertainty and randomness of the accepted strings. 3,4 In these approaches, a probability measure is attached to the productions of the grammar G, and accepted strings can be assigned an attribute representing the probability with which they belong to the class represented by G. Lu and Fu 5 combined errorcorrection parsing and stochastic grammars to attain a better integration with statistical approaches. A comprehensive survey on the incorporation of statistical approaches into syntactical and structural pattern recognition can be found in Ref. 6. Likewise, related approaches to integrate Al-based and decision-theoretic methods can be found in Ref. 7.
36
G. Adorni, S. Cagnoni & M. Gori
On the other hand, either parametric or non-parametric statistical methods can nicely deal with distorted p a t t e r n s and noise, but they are severely limited in all cases in which the p a t t e r n s are strongly structured. T h e feature extraction process in those cases seems to be inherently illposed; the features are either global or degenerate to the pixel level. T h e renewal of interest in artificial neural networks which began in the mid-eighties suggested shifting the emphasis from the complex task of feature selection and extraction to the development of effective architectures and learning algorithms. In principle, neural networks are capable of extracting by themselves optimal features for classification during the learning process. "Learning from examples," which is typical of connectionist models, does not require specific assumptions on the d a t a probability distribution. T h e field of neural networks, however, has now reached the point where it is necessary t o state t h a t neglecting the issue of p a t t e r n representation and relying exclusively on learning is neither theoretically nor experimentally justified. There is evidence to claim t h a t complex p a t t e r n recognition tasks require architectures with a huge number of parameters t o make the loading of t h e weights effective. This makes generalization t o new examples very hard. 8 On the other hand, the adoption of architectures with few parameters, which would facilitate generalization, results in very hard optimization problems which are typically populated by many suboptimal local minima. T h e n a t u r e of this problem is partially addressed in the critical analyses on connectionist models by Fodor and Pylyshyn 9 and Minsky. 1 0
3 . O n t h e E x t r a c t i o n of S t r u c t u r e d R e p r e s e n t a t i o n s As early pointed out by Wiener, a p a t t e r n can often arrangement characterized by the order of the elements rather than by the intrinsic nature of these elements. hierarchical, and topological relations between p a r t s yield significant information which seems to be useful tion processes.
be regarded as an of which it is made, Hence, the causal, of a given p a t t e r n in h u m a n recogni-
As pointed out in the previous section, structured p a t t e r n descriptors are usually opposed to feature-based descriptors, and can be regarded as high-level vs. low-level representations, respectively. However, we notice t h a t while representing p a t t e r n s using a set of features does not require making the structure of the p a t t e r n s explicit, when using structured
Adaptive
Graphic Pattern Recognition: Foundations
and Perspectives
37
representations one deals with a high level of abstraction and, moreover, can take low-level features into account. If one considers the inherent structured nature of some patterns (e.g. logos, sketches) and the different levels of abstraction at which they can be analyzed, hierarchical pattern representations are a rather natural choice. For the purpose of this chapter, structured representations can be divided into knowledge-based representations, image-partitioning representations, and multi-resolution transforms and representations. In knowledge-based representations, elementary features like edges can be combined to form segments. Edges can be further combined to compose geometrical shapes that identify objects or object parts. Then, different objects can be related to one another to give a scene its meaning. Such relationships can be of two kinds, the is-a relationship and the part.of relationship, which correspond the combination of parts into wholes and abstraction, respectively. The main idea behind image-partitioning representation is to hierarchically subdivide an image until it is decomposed into basic components that present uniform properties. Such "atomic" elements can either be uniform regions or contour segments. Finally, multi-resolution representations of data are based on the observation that, in the real world, objects look different depending on the scale of observation. In the remainder of the section, we will only briefly review imagepartitioning representations, which have been mainly adopted so far in the the pattern recognition approach proposed in this chapter.61 3.1. Image-partitioning
Representations
Image-partitioning representations can be positioned half-way between multi-resolution transforms, which translate one image from a domain to another one, spreading information at different resolution levels, and most knowledge-based representations, in which an image is hierarchically represented on the basis of a decomposition based on high-level symbolic processes and logic inferences. Whereas in classical structured approaches the "atomic" parts into which an image is decomposed are geometrical primitives, in partitioning methods the atomic elements are extracted directly by classic image processing techniques. Different partitioning methods are a
T o the best of our knowledge, multi-resolution and knowledge-based representations have not been used so far as graphic inputs to neural networks. However, in principle, they can be utilized in the same manner as image-partitioning representations
38
G. Adorni, S. Cagnoni & M. Gori
used depending on whether the representation is based on portions of the image or on its edges. • Region-based representations Region-based image-partitioning constructs a representation beginning with regions of the image, which are segmented on the basis of some predefined uniformity criteria. Binary partition trees provide one of the simplest region-based representations. They are structured representations of regions that can be obtained from an initial partition. The leaves represent regions that belong to an initial partition. The remaining nodes represent regions that are obtained by appropriate merging of regions and, finally, the root node turns out to represent the whole image. This kind of representation is closely related to multi-resolution representations. Binary partition trees have been shown to be suitable for several applications, such as information retrieval, segmentation, and filtering.11 However, the representation turns out to be strongly application-dependent, since all possible merging of regions belonging to the initial partition are not represented in the tree, and their selection depends on the merging criterion. In addition, the choice of the initial partition is arbitrary, and affects the representation significantly. Quad-trees 12 ' 13 are amongst the oldest and most-widely adopted regionbased image-partitioning representations. An extensive survey on quad-tree representation and their applications can be found in Refs. 14 and 15. Octrees 16 ' 17 are the three-dimensional extension of the quad-tree. They are built by starting from a cubical volume (pattern) and recursively subdividing this volume into eight congruent disjoint cubes (octants) until a uniformity criterion is satisfied. Several operations can be defined on quadtrees, which are reviewed in Refs. 17 and 18. In particular, for binary images, set-theoretic operations such as union and intersection are quite simple to implement. 12 • Contour-based representations Contour-based representations use image contours as primitives. Strip trees 1 8 - 2 1 are based on a top-down approach to curve approximation. They are binary trees representing a single curve, obtained by subsequently subdividing the curve into segments. Strip trees are very efficient representations of a curve for applications in computer graphics. 22
Adaptive Graphic Pattern Recognition: Foundations
and Perspectives
39
A chain code 23 ' 24 is a vector of integers that samples the orientation of a point-size contour at each point. To build the pattern representation, one starts from a point in the contour and looks for an adjacent point in a 3 x 3 neighborhood, using a 3-bit code to represent its relative position. An exponent is added to all elements of the chain, representing how many pixels have the same direction. This procedure is iterated until the starting pixel is reached again. Chain code representations have the same drawbacks as strip trees, since they require that the image be pre-processed before it can be used. Chain code representations are also very sensitive to noise (occlusions, impulsive noise), since the chain code can change in length and content for each possible alteration of the contour. Contour tree 25 is a general term referring to several hierarchical contour representations. Generally speaking, a contour tree is a data structure that is built according to the following rules: 1. The root corresponds to a closed contour that includes all other contours, if any. Otherwise, the root corresponds with the image border. 2. Each node of the tree corresponds to a closed contour. 3. A contour lying inside another contour is represented as the other's child. An example of contour-tree construction from a given pattern is shown in Fig. 1. The contour tree representation is suitable for many applications in the field of image processing where the representation of concentric isosurfaces is important. 26 ' 27 However, the need to pre-segment the image into closed contours limits the versatility of this representation. To overcome this problem, the closed-contour constraint can be relaxed. Suitable image processing algorithms can be used that extract closed contours (see e.g. Ref. 28). Contour-tree based representations have been primarily used in conjunction with the pattern recognition approach proposed in this chapter.
4. Neural networks for structured pattern representations In this section we introduce the basic principles behind adaptive processing of structured information. We focus on supervised learning by extending the backprogation algorithm for classic multilayer perceptrons to the case of directed acyclic graphs.
40
G. Adorni, S. Cagnoni & M. Gori
Fig. 1. The contour-tree algorithm and the corresponding representation for a company logo. Note that, unlike the typical pre-processing schemes adopted for neural networks, this representation is invariant under rotation.
4.1. Multilayer
Perceptrons
for Static
Representations
Feedforward neural networks 29 are directed acyclic graphs whose nodes carry out a forward computation based on any topological sort b S of the vertices. If we denote by pa[i>] be the parents of v, then the corresponding neural output is for eachv £ <S Xv = a ^2Zzepa[t>]
w
v,zXzj
where c(-) = tanh(-) is the node output. In the case of multilayer networks, the computational scheme reduces to a pipe of the layers. Let Topological sorting arises whenever we have a problem involving a partial ordering. For instance, a large glossary containing definitions of technical terms might require topological sorting. We can write wi -< w? provided that the definition of term wz depends directly or indirectly on that of term w±. Since we need no circular definition, we have a problem of topological sorting, that is we want to arrange terms in such a way that no term is used before it has been defined.
Adaptive Graphic Pattern Recognition: Foundations
and Perspectives
41
C = {(ua, da), ua € U} be the training set and let M be a feedforward neural network. The degree of "matching" between the desired target d(u) and the response of the neural network can a;(it, w) be expressed by:
EM = $ > ( « ) , where e(u) is any metrics that yields the distance between d(u) a;(it, w). For instance:
E=\^(x{u,w)-d{u))2. ueu
(1) and
(2)
The optimization of E typically involves adjusting a huge number of parameters. In the case of large optimization problems, the gradient heuristics is commonly used dw
„
— = ~HVWE,
, „
(3)
Of course, the trajectory can get stuck in local minima of E(w). The gradient is calculated by Backpropagation, which turns out to be an optimal algorithm in the sense of computational complexity.0 In the last few years, multilayer perceptrons have been extensively used in pattern recognition. Let us consider the typical pattern recognition learning task depicted in Fig. 2. In many applications to pattern recognition, neural-based learning machines simply process de-sampled images so as to provide a more compact, yet sufficiently accurate representation. In so doing, the task of selecting appropriate features is delegated to the learning c
In the literature there is often confusion when using the term "Backpropagation." Many people refer to Backpropagation as to the algorithm for performing the gradient descent of the error function. The algorithm is based on a forward-backward computation of the gradient. Using a more appropriate meaning of the term, others call Backpropagation simply that special gradient computation scheme. Let m = \w\ be the number of weights, then the forward-backward steps takes Q(m). This is obviously a lower bound on the computational complexity; that is no other algorithms can perform (asymptotically) the same function more effectively, since one needs at least to load all the weights to compute the gradient. This was pointed out in Ref. 30. It is worth mentioning that classic numerical gradient computation algorithms based on the computation of function E takes 0(m2). In many pattern recognition application where m is a huge number, the Backpropagation algorithms makes the optimization-based learning a viable approach, whereas any classic optimization method which does not take the neural network structure into account would get stuck simply in the gradient computation.
42
67. Adorni, S. Cagnoni & M. Gori
Fig. 2. The typical computation carried out by multilayer perceptrons for pattern recognition while choosing a simple pre-processing based on image de-sampling.
process, which is expected to provide feature-based input representations in the first hidden layer. Multilayer neural networks are powerful computational devices, which go well beyond the limitations pointed out by Minsky and Papert for Rosenblatt's perceptrons. 10 In spite of this clear advantage, however, in many real world problems, the Backpropagation learning procedure does not yield satisfactorily results. Enthusiastic reports of experimental results for some problems and bad failures for others clearly indicate that for the learning to be effective a number of crucial architectural solutions and tricks are of crucial importance, but at the same time, the learning task itself can be hard to attack. Amongst different design choices, the effect of the number of parameters seems to be quite clear. Experimental evidence has been accumulated in many artificial and real-world problems that allows us to conclude that when the number of network parameters increases the Backpropagation learning algorithm has more chance to find global minima but, unfortunately, generalisation to new examples is likely to be less effective. Hence, generalisation to new examples and optimal convergence of the learning algorithm look like conjugate variables in quantum mechanics: increasing of the number of parameters typically results in a better optimal convergence behavior, but the drawback is that generalisation capabilities are likely to decrease, and vice versa.
Adaptive
Graphic Pattern Recognition: Foundations
and Perspectives
43
Fig. 3. The hat cannot be detected robustly by simply inspecting local features on the top of the picture; that would result in a trivial error on the case of policemen with one (two) raised hand(s). On the other hand, the extraction of a graphical representation simplifies the learning task dramatically.
In pattern recognition, a potential advantage of the scheme depicted in Fig. 2 is that of delegating the feature extraction to the learning process. Unfortunately, there are learning tasks in which this straightforward representation makes the subsequent learning process very hard. Let us consider the artificial learning task depicted in Fig. 3, in which we want to recognise policemen with hats. The artificial example makes it clear that the hat is not always located in the same portion of the picture. The hat cannot be detected robustly by
44
G. Adorni, S. Cagnoni & M. Gori
simply inspecting local features on the top of the picture; t h a t would result in a trivial error on the case of policemen with one (two) raised hand(s). Hence, in order to learn the concept, we need to incorporate shift invariance. W h e n using multilayer perceptrons, shift invariance a n d other complex mappings, including scale and rotation invariance can be implemented with appropriate architectures. d Unfortunately, the problem is moved to learnability; a huge number of parameters might be required for avoiding local minima in the error function. Consequently, an appropriate generalisation to new examples requires a corresponding growth of the number of examples, which in t u r n leads t o an explosion of the computational cost of the learning.
4 . 2 . Processing
Directed
Acyclic
Graphs
In most interesting problems of p a t t e r n analysis and recognition, d a t a are inherently structured. If we consider the policeman in Fig. 3 we immediately conclude t h a t a graphical representation of the p a t t e r n is definitively more appropriate t h a n the simple static representation, based on the image desampling often adopted in conjunction with multilayer perceptrons. Like data, the model can itself be structured in the sense t h a t the generic variable xi%v might be independent of q\xXi^, where, following the notation introduced in Ref. 1, q^1 is the operator which denotes the fc-th child of a given node. T h e structure of independence for some variables represents a form of prior knowledge. For instance, a classic form of independence arises when the connections of any two state variables, xv and xw, only take place between components XitV and xw^ with the same index i. In the case of lists, this assumption means t h a t only local-feedback connections are permitted for the state variables. Likewise, other statements of independence might involve input-state variables a n d / o r state-output variables. An explicit statement of independence can be regarded as a sort of prior knowledge on the transduction t h a t the machine is expected to learn. In general these statements can also be different for different nodes and can be conveniently expressed by a graphical structure t h a t is referred to as a recursive network. An example of recursive network R is shown on the left-side of Fig. 4. In this case, R is simply a graph which states the full dependency of d
I n particular, multilayer neural networks guarantee universal approximation even with one hidden layer, provided that enough units are adopted in the hidden layer.
Adaptive Graphic Pattern Recognition: Foundations
and Perspectives
45
*o>;' i< 6ah
\/"J»
6aa
6a6
Fig. 4. Compiling the encoding network from the recursive network and the given data structure by function ipr.
the state variable from the state of the children q^xxv, Q21xv, q^xxv and the label uv attached to the node. Unlike Fig. 4, in m a n y real world problems the knowledge in a recursive network R yields topological constraints t h a t often make it possible t o cut the number of trainable parameters significantly. Let us consider a directed ordered graph. For any node v one can identify a set, eventually empty, of ordered children ch[u]. Let ajch[v] be the state associated with set ch[w] and 0 be vector of learning parameters. T h e state xv and the o u t p u t yv of each node v follows the equations Xv = /(»ch[v], UV: V, 0 ) , Vv = 9(Xv, uv, v,
9).
(4)
This is a straightforward extension of classic causal models in system theory. T h e hypothesis of dealing with directed acyclic graph t u r n s out to be useful for carrying out a forward computation, and the hypothesis of considering ordered sets of children is used in order to define the position of the parameters in functions / and g. Alternatively, one can keep essentially the same computational scheme for directed positional acyclic graphs in which the children of each node are associated with an integer. T h e difference with respect to directed ordered acyclic graphs is t h a t they consider only ordered sets of children, and do not include the case in which the children of a given node are not given in a sequential ordering. For instance, in Fig. 3, the difference of the two p a t t e r n s is kept in the graphical representation in t h e case of directed positional acyclic graphs, b u t is lost in the case of a representation based on directed acyclic graphs. Given
46
G. Adorni, S. Cagnoni & M. Gori
the recursive network R and any DOAG u, we can construct an encoded representation of u on the basis of the independence constraints expressed by R, t h a t is ur = ipr(R,u). T h e scheme adopted for compiling ur is depicted in Figure 4, while a detailed description of the mathematical process involved is given in Ref. 1. From the encoding network depicted in Fig. 4 we can see a pictorial representation of the computation taking place in the recursive neural network. Each n i l pointer, represented by a small box, is associated with a frontier state xv, which is in fact an initial state used to terminate the recursive equation. T h e graph plays its own role in the computation either because of the information attached to its nodes or for its topology. Any formal description of the computation of the input graph requires sorting the nodes, so as to define for which nodes the state must be computed first. As already pointed out for the computation of the activation of the neurons of a feedforward neural network, the computation can be based on any topological sorting. One can use a d a t a flow computation model where the state of a given node can only be computed once all the states of its children are known. To some extent, the computation of the o u t p u t yv can be regarded as a transduction of the input graph u to an o u t p u t y with the same skeleton 6 as u. These IO-isomorph transductions are the direct generalisation of the classic concept of transduction of lists. When processing graphs, the concept of IO-isomorph transductions can also be extended to the case in which the skeleton of the graph is also modified. Because of the kind of problems considered in this chapter, however, this case will not be treated. T h e classification of DOAGs is in fact the most important IOisomorph transduction for applications to p a t t e r n recognition. T h e o u t p u t of the classification process corresponds with ys, t h a t is the o u t p u t value of the variables attached to the supersource in the encoding network. Basically, when the focus is on classification, we disregard all the o u t p u t s yv of the IO-isomorph transduction apart from the final values ys of the forward computation. T h e information attached to the recursive network, however, needs to be integrated with a specific choice of functions / and g which must be suitable for learning the parameters 0. T h e connectionist assumption for functions / and g t u r n s out to be adequate especially to fulfill computational e
T h e skeleton of a graph is the structure of the data regardless of the information attached to the nodes.
Adaptive Graphic Pattern Recognition: Foundations
and Perspectives
47
complexity requirements. The extension to the case of DO AGs is straightforward. Let o be the maximum outdegree of the given directed graph. The dependence of node v on its children chjt;] can be expressed by pointer matrices Av(k) € M n,n , k = 1 , . . . ,o. Likewise, the information attached to the nodes can be propagated by weight matrix Bv £ M.n'm. Hence, the first-order connectionist assumption yields
xv = a I ^2 Av(k) • q^Xy + Bv • uv I .
(5)
The output can be computed by means of yv = a(Cv • xv + Dv • uv), where Cv € W'n and Dv e K p , m . Hence the learning parameters can be grouped for each node of the graph in 6V = {[Av(l),...
, Av(o)}, Bv, Cv, Dv} .
The most attracting feature of connectionist assumptions for / and g is that they provide universal approximation capabilities by means of a graphical structure with units of a few different types (e.g. the sigmoid). The strong consequence of this graphical representation for / and g is that, for any input graph, an encoding neural network can be created which is itself a graph with neurons as nodes. Hence, the connectionist assumption makes it possible to go one step further than beyond the general independence constraints expressed by means of the concept of recursive network. The encoding neural network un associated with equation 5 is constructed by replacing each node of the encoding network ur with the chosen connectionist map; that is 4>n = -ipn ° Ipr • Un -» 4>n{ur) = i)n(lpr(R,u))
.
The construction of the encoding neural network un from the encoding neural network ur is depicted in Fig. 5. For the particular case of stationary models, in which the parameters 0V are independent of the node v. Encoding neural networks turns out to be weighed graphs, that is there is always a real variable attached to the edges (weight). Note that the architectural choice expressed by equation 5 can be easily extended so as to express functions / and g by general feedforward neural architectures. Of course, the composition of directed acyclic graphs (data) with the local node computation based on feedforward neural networks, which are directed acyclic
G. Adorni, S. Cagnoni & M. Gori
b
M
• •• a
• . '': .
M
i M
! e i
#
m
J M
.i.\i.
^
j
k
* *
^
*
Fig. 5. The construction of a first-order recursive neural network from the encoding network of Fig. 4. The construction holds under the assumption that the frontier states are null.
graphs, yields in general encoding neural networks which are still acyclic graphs. As a result, the supervised of learning a given set of DOAGs results in the supervision of the corresponding encoding neural networks. Because of the non-stationarity hypothesis, the parameters are independent of the node and, therefore, the learning of the weights 8 can be framed as an optimization problem. We can thus use the Backpropagation algorithm for training. Since the Backpropagation of the error takes place on neural networks which encode the structure of the given examples, the corresponding algorithm for the gradient computation, in this case, is referred to as Backpropagation through structure.1'31 This algorithm uses the classical forward
Adaptive Graphic Pattern Recognition: Foundations
and Perspectives
49
and backward steps, the only difference being that the parameters of the different encoded neural networks must be shared. 4.3. Cycles, Non-stationarity,
and
Beyond
Our proposed computation scheme proposed for directed acyclic graphs is a straightforward extension of the case of static data. The hypothesis that the children of each node are ordered is fundamental, and allows us to attach the appropriate set of weights to each child. The assumption that the graph is acyclic yields acyclic encoded neural networks for which the Backpropagation algorithm holds. Finally, the non-stationarity hypothesis makes it possible to attach the same set of weights to each node. The hypothesis of dealing with ordered graphs can be relaxed in different ways. A straightforward solution is to share the pointer matrices Ak among the children. In so doing, a unique matrix A is used for all the children, which overcomes the problem of defining the position in the computation of functions / and g. Alternatively, given chbj] one can consider the set of its permutations 'P(ch[ti]) and calculate functions / and g by an appropriate sharing of the weights.32 The second solution is more general than the first in terms of computational capabilities, but turns out to be effective only when the outdegree of the graphs is quite small. Otherwise, the cardinality oiV(ch[v}) explodes. The construction of the encoding neural network gives rise to feedforward neural networks in the case of acyclic graphs. As shown in Fig. 6, for general graphs and directed graphs with cycles, the same construction of the encoding neural networks produces a recurrent neural network. As a result, the computation of each graph cannot be performed by simple forward step. The feedback loops in the neural network can produce complex dynamics, which do not necessarily correspond with convergence to an equilibrium point. It is worth mentioning that cyclic and undirected pattern representations can be extracted in a more natural way than directed ordered graphs. However, the drawback to this approach is that the corresponding learning process is significantly more expensive. In general, given a planar graph, one can construct a corresponding DOAG provided that an anchor node is also specified. Unfortunately, in pattern recognition one cannot always rely on the availability of such an anchor; there are cases in which the corresponding graphical extraction is likely not to be very robust.
50
G. Adorni, S. Cagnoni & M. Gori
Fig. 6. The encoding of cyclic graphs yields cyclic encoding networks, which in turn, gives rise to recurrent neural network architectures.
The proposed models represent a natural extension of the processing of sequences by causal dynamical systems. In pattern recognition, the hypothesis of causality could be profitably removed, since there is no need to carry out an on-line computation at node level. Having homogeneous computations at node level may not be adequate for many pattern recognition problems. This has been already pointed out in Ref. 33, where a simple solution has been adopted to account for nonstationarity. The graphs are partitioned into different sets depending on the number of nodes, and are processed separately. A more general and computational scheme has been devised in Ref. 34, where a linguistic description of non-stationarity is given which is used to compile the encoding neural networks. 5. Graphical Pattern Recognition The Ref. Ref. The and,
term adaptive graphical pattern recognition was first introduced in 33, but early experiments using this approach were carried out in 35. Graphs are either in the data or in the computational model. adopted connectionist models inherit the structure of the data graph moreover, they have their own graphical structure that expresses the
Adaptive Graphic Pattern Recognition: Foundations
and Perspectives
51
dependencies on the single variables. Basically, graphical pattern recognition methods integrate domain structure into decision-theoretic models. The structure can be introduced at two different levels. First, we can introduce a bias on the map (e.g. receptive fields). In so doing, the pattern of connectivity in the neural network is driven by the prior knowledge in the application domain. Second, each pattern can be represented by a corresponding graph. As put forward in the previous section, the hypothesis of directed ordered graphs can be profitably exploited to generalize the forward and backward computation of classical feedforward networks. The proposed approach can be pursued in most interesting pattern recognition problems. In this chapter we focus attention on supervised learning schemes, but related extensions have recently been conceived for unsupervised learning. 5.1.
Classification
Recursive neural networks seem to be very appropriate for either classification or regression. Basically, the structured input representation is converted to a static representation (the neural activations in the hidden layers), which is subsequently encoded into the required class. This approach shares the advantages and disadvantages of related. MLP-based approaches for static data. In particular, the approach is well-suited for complex discrimination problems. The effectiveness of recursive neural networks for pattern classification has been shown in Ref. 36 by massive experimentation on logo recognition. In particular, it has been shown that the network performance is improved by properly filtering the logo image before extracting the data structure. The patterns were represented using trees extracted by an opportune modification of the contour-tree algorithm. That modification plays a fundamental role in the creation of data structures that enhance the structure of the pattern. The experimental results show that, though in theory the contour tree rotation invariance no longer holds, as a matter of fact, there is a very slight dependence of the performance on the rotation angle. These experimental results indicate that adaptive graphical pattern recognition is appropriate when we need to recognize patterns in the presence of noise, and under rotation and scale invariance. These very promising results suggest that the proposed method bridges nicely decision-theoretic approaches based on numerical features and syntactic and structural approaches.
52
G. Adorni, S. Cagnoni & M. Gori
Network growing a n d pruning can b e successfully used for improving the learning process. It is worth mentioning t h a t recursive neural networks can profitably be used for classification of highly structured inputs, like image documents representations by XY-trees. Unfortunately, in this particular kind of application the major limitation t u r n s out to be t h a t the number of classes is fixed in advance, a limitation which is inherited from multilayer networks. Neural networks in structured domains can be used in verification problems, where one wants to establish whether a given p a t t e r n belongs to a given class. Unlike p a t t e r n classification, one does not know in advance the kind of inputs t o be processed. It has been pointed out t h a t sigmoidal multilayered neural networks are not appropriate for this task. 3 7 Consequently, our recursive neural networks are also not appropriate for verification tasks. However, as for multilayer networks, the adoption of radial basis function units suffices to remove this limitation. 5 . 2 . Image
Retrieval
T h e neural networks introduced in this chapter and their related extensions are good candidates for many interesting image retrieval tasks. In particular, the proposed models introduce a new notion of similarity, which is constructed on the basis of the user feedback. In most approaches proposed in the literature, queries involve either global or local features, and disregard the p a t t e r n structure. T h e proposed approach makes it possible to retrieve p a t t e r n s on the basis of a strong involvement of the p a t t e r n structure, since t h e graph topology plays a crucial role in t h e computation. On the other hand, since the nodes contain a vector of real-valued features, the proposed approach can also be able to exploit the sub-symbolic nature of the p a t t e r n s . Figure 7 shows a possible graphical representation of the images of a given database. T h e database has been created using an a t t r i b u t e plex g r a m m a r as described in Ref. 38. Unlike p a t t e r n classification, in which the learning scheme is a straightforward extension to backpropagation for static d a t a , learning the notion of similarity requires the definition of an appropriate target function. For each pair of images, the user provides feedback on how relevant the retrieved image is to the query. Consequently, the learning process consists of adapting the weights so as to incorporate the user feedback. Given any two pairs of images the user is asked whether they look similar and is expected to
Adaptive Graphic Pattern Recognition: Foundations
Original Image
Segmented Image
i
r r
i
!
53
Graph Extraction
Point of view
U l
cznr:
and Perspectives
i
i
1
i
I
r
LJ_„
1
_l
1
~JHJ L"T~ i
i
i
i
i
~~~-—-_
T~ 1 I I
I I
L i
Features Insertion Fig. 7.
DOAG Extraction
Extracted Graph
Extraction of an appropriate graphical representation from the given images.
provide a simple Boolean answer. In the case the images are not similar (see e.g. Fig. 8) their corresponding points in the hidden layer of the recursive neural network must be moved far apart, whereas in the case the images are similar, the corresponding points must be moved close to each other (see e.g. Fig. 9). Let Ui and u-2 be the graphical representations of two images for which the user is evaluating the similarity. u\ and u
S2)
Xg
J1)
Xg
If (II xs2) - xs1] \
{x (2)
x
(i)
54
G. Adorni, S. Cagnoni & M. Gori
I (fiay do not \ [ look similar J
User
:?&..
Learning
»
:.?;?•
• : :
/
* •
• •
?'
4 -i: *
•»«
Fig. 8. The user feedback information is learned by moving the state-based representation in the hidden layer.
- buUhey \ 1 looksimilai-y
User
B ^m i
f?«
• • /
| * » .
• •
•
i
• • • *^#*«
•
B \
/
p
Learning
--1 i ••/ f
• •
• •
«
Fig. 9. The user feedback information is learned by moving the state-based representation in the hidden layer.
Adaptive Gravhic Pattern Recoanition: Foundations
Fig. 10.
and Perspectives
55
Typical answer to a query where we can see the effect of learning.
where d and d are two thresholds determined experimentally. For any given image, the system proposes the closest patterns in the database. Pairs can properly be extracted and submitted to the learning scheme based on recursive networks in which the error function is created run-tine as shown above. E\2{9) can be optimized using a gradient-descent scheme whose computation is very much related to backpropagation through structure. A typical example of answer to a given query is depicted in Fig. 10. A proposal for a more general framework for image retrieval based on graphical representations, along with some experimental results, can be found in Ref. 39. 6. Conclusions In this chapter we have introduced a new approach to pattern recognition, in which highly structured pattern representations typical of classical syntactic and structural approaches, and the sub-symbolic capabilities of decision-theoretic models, typical of connectionism and statistics, are properly merged. The chapter emphasizes the extension of classic neural network-based approaches to the case in which rich graphic representations are given for the patterns. Statistical approaches to learning can also be extended within the general framework of graphical models (see e.g. Ref. 1). In a sense, the approach described in this chapter, referred to as adaptive graphical pattern recognition, is in between decision-theoretic and
56
G. Adorni, S. Cagnoni & M. Gori
structural p a t t e r n recognition and it is very promising in p a t t e r n classification and image retrieval applications. It is worth mentioning t h a t , so far, the application to p a t t e r n recognition has been essentially limited to nonstationary processing of DOAGs. Recent analyses on graphs different from DOAGs, as well as models which allow one to incorporate non-stationarity, have not been applied so far to p a t t e r n recognition. On the other hand, we are confident t h a t more appropriate graphical representations of d a t a and non-homogeneous computations at the node level are of crucial importance for further developments in the field.
Acknowledgments T h e preliminary ideas behind this chapter emerged at an informal workshop on "Adaptive computation on structured domains" which was held in Siena, on March 1997. T h e content of the chapter is mainly based on the invited talk t h a t Marco Gori gave at SSPR'2000. T h e authors would like t o t h a n k Horst Bunke for fruitful discussions on the links of the proposed approach to structural p a t t e r n recognition. Special t h a n k s to Paolo Frasconi, Christoph Goller, Alessandro Sperduti, and Andreas Kuechler who strongly contributed t o the development of the general theory of adaptive computation on structured domains. Finally, Marco Maggini, Markus Hagenbuchner, Michelangelo Diligenti, and Ciro De Mauro provided crucial support for the software development and for experimentation on many different p a t t e r n recognition problems.
References 1. P. Frasconi, M. Gori, and A. Sperduti. A general framework for adaptive processing of data structures. IEEE Transactions on Neural Networks, 9(5):768-786, September 1998. 2. J. C. Bezdek. What is computational intelligence? In J. Zurada, editor, Computational intelligence: imitating life, pages 1-12. IEEE Press, 1994. 3. A. V. Aho and T. G. Peterson. A minimum distance error-correcting parser for context-free languages. SIAM Journal of Computing, (4):305-312, 1972. 4. K. S. Fu. Syntactic pattern recognition. Prentice-Hail, Englewood Cliffs, NJ, 1982. 5. S. Y. Lu and K. F. Fu. Stochastic error-correction syntax analysis for recognition of noisy patterns. IEEE Transactions on Computers, (26):1268-1276, 1977. 6. W.-H. Tsai. Combining statistical and structural methods. In H. Bunke and
Adaptive Graphic Pattern Recognition: Foundations and Perspectives
7.
8. 9.
10. 11.
12. 13. 14. 15.
16.
17. 18. 19.
20. 21.
22. 23.
57
A. Sanfeliu, editors, Syntactic and Structural Patten Recognition: Theory and Applications, chapter 12, pages 349-366. World Scientific, 1990. H. Bunke. Hybrid pattern recognition methods. In H. Bunke and A. Sanfeliu, editors, Syntactic and Structural Patten Recognition: Theory and Applications, chapter 11, pages 307-347. World Scientific, 1990. E. B. Baum and D. Haussler. What size net gives valid generalization? Neural Computation, 1(1):151-160, 1989. J. A. Fodor and Z. W. Pylyshyn. Connectionism and cognitive architecture: A critical analysis. Connections and Symbols, pages 3-72, 1989. A Cognition Special Issue. M. L. Minsky and S. A. Papert. Perceptrons - Expanded Edition. MIT Press, Cambridge, 1988. P. Salembier and L. Garrido. Binary partition tree as an efficient representation for filtering, segmentation and information retrieval. In IEEE Int. Conference on Image Processing, ICIP'98, volume 2, pages 252-256, Los Alamitos, CA, 1998. IEEE Comp. Soc. Press. G. M. Hunter and K. Steiglitz. Operations on images using quadtrees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1:145-153, 1979. C. Dyer, A. Rosenfeld, and H. Samet. Region representation: Boundary codes from quadtrees. Communications of the ACM, 23:171-179, 1980. H. Samet. The quadtree and other related hierarchical data structures. ACM Computer Surveys, 16:187-260, 1984. H. Samet. Spatial data structures. In W. Kim, editor, Modern Database Systems: The Object Model, Interoperability, and Beyond, pages 361-385. Addison Wesley/ACM Press, Reading, MA, 1995. G. M. Hunter. Efficient Computation and Data Structures for Graphics. PhD thesis, Dept. of Electrical Engineering and Computer Science, Princeton University, Princeton, NJ, 1978. D. Meagher. Geometric modelling using octree encoding. Computer Graphics and Image Processing, pages 129-147, 1982. H. Samet and R.E. Webber. Storing a collection of polygons using quadtrees. ACM Transactions on Graphics, 4:182-222, 1985. D. Ballard. Strip trees: a hierarchical representation for map features. In Proc. of the 1979 IEEE Computer Society Conference on Pattern Recognition and Image Processing, pages 278-285, New York, NY, 1979. IEEE. D. Ballard. Strip trees: a hierarchical representation for curves. Communications of the ACM, 24:310-321, 1981. H. Asada and M. Brady. The curvature primal sketch. In M. Caudill and C. Butler, editors, Proc. Workshop on Computer Vision, pages 609-618, Annapolis, MD, 1984. O. Gunther and S. Dominguez. Hierarchical schemes for curve representation. IEEE Computer Graphics & Applications, 13:55-63, 1993. H. Freeman. Computer processing of line-drawing images. ACM Computer Surveys, 6:57-97, 1974.
58
G. Adorni, S. Cagnoni & M. Gori
24. I. Pitas. Digital Image Processing Algorithms. Prentice Hall, Englewood Cliffs, NJ, 1993. 25. H. Freeman. On the encoding of arbitrary geometric configurations. IRE Trans., EC-10:260-268, 1961. 26. I. S. Kweon and T. Kanade. Extracting topographic terrain features from elevation maps. CVGIP: Image Understanding, 59:171-182, 1994. 27. M. van Kreveld, van Oostrum R, C. Bajaj, V. Pascucci, and D. Schikore. Contour trees and small seed sets for isosurface traversal. In Proc. Thirteenth Annual Symposium on Computational Geometry, pages 212-220, New York, NY, 1997. ACM. 28. R. Jain, R. Kasturi, and B. Schunk. Introduction to Machine Vision. McGraw-Hill, New York, NY, 1995. 29. D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing, volume 1, chapter 8, pages 318-362. MIT Press, Cambridge, 1986. Reprinted in Ref. 40. 30. M. Gori and A. Tesi. On the Problem of Local Minima in Backpropagation, IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(l):76-86, Janunary 1992 31. C. Goller and A. Kiichler. Learning task-dependent distributed structurerepresentations by backpropagation through structure. In IEEE International Conference on Neural Networks, pages 347-352, 1996. 32. M. Bianchini, M. Gori, and F. Scarselli. Processing acyclic graphs with recursive neural networks. Technical Report TR-DII-11-2001, Dipartimento di Ingegneria dell'Informazione, Universita di Siena, Siena, Italy, 2001. 33. M. Diligenti, M. Gori, M. Maggini, and E. Martinelli. Graphical pattern recognition. In Proceedings of ICAPR98, pages 425-432, 1998. 34. P. Frasconi, M. Gori, and A. Sperduti. Integration of graphical-based rules with adaptive learning of structured information. In Stefan Wermter and Ron Sun, editors, Hybrid Neural Networks, pages 211-225. Springer Verlag, Vol. 1778, 2000. 35. P Frasconi, M. Gori, S. Marinai, J. Sheng, G. Soda, and A. Sperduti. Logo recognition by recursive neural networks. In Proceedings of GREC97, pages 144-151, 1998. 36. M. Diligenti, M. Gori, M. Maggini, and E. Martinelli. Adaptive graphical pattern recognition for the classification of company logos. Pattern Recognition, 34(10):2049-2061, October 2001. 37. M. Gori and F. Scarselli. Are multilayer perceptrons adequate for pattern recognition and verification? IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(10):1121-1132, October 1998. 38. H. Bunke, M. Gori, M. Hagenbuchner, C. Irniger, and A. C. Tsoi. Generation of image databases using attributed plex grammars. In Proceedings of GbR2001, pages 200-209, Ischia, (IT), June 2001.
Adaptive Graphic Pattern Recognition: Foundations and Perspectives
59
39. C. De Mauro, M. Gori, and M. Maggini. Apex: An adaptve visual information retrieval system. In Proceedings of ICDAR, Seattle, (WA), September 2001. to appear. 40. J. A. Anderson and E. Rosenfeld, editors. Neurocomputing: Foundations of Research. MIT Press, Cambridge, 1988.
This page is intentionally left blank
CHAPTER 3 ADAPTIVE SELF-ORGANIZING M A P IN THE GRAPH DOMAIN
Simon Giinter and Horst Bunke Department of Computer Science, University of Bern Neubriickstrasse 10, CH-3012 Bern, Switzerland E-mail:
[email protected],bunkeSiam.unibe.ch
A new clustering algorithm, which is an extension of self-organizing map (som) into the domain of graphs, is introduced first. Then two adaptive versions of this algorithm are derived, which are able to find the optimal number of clusters automatically. While one of these adaptive clustering methods needs to be initialized with a number of clusters that is at least as large as the optimal number, the other does not need any prior knowledge about the true number of clusters. The suitability of both adaptive methods is experimentally evaluated. The paper is concluded with a discussion of the applicability of the proposed clustering algorithms, depending on available a priori knowledge about the optimal number of clusters.
1. I n t r o d u c t i o n Self-organizing m a p (som) is a very well established method in the field of neural networks 1 . A som can be used to learn the distribution of an unlabeled population of patterns, and approximate it in a lower-dimensional space. Another important application of som is clustering, which is the process of dividing a set of given p a t t e r n s into groups, or clusters, such t h a t all p a t t e r n s in t h e same group are similar to each other, while p a t t e r n s from different groups are dissimilar. Clustering problems are ubiquitous in p a t t e r n recognition. For recent surveys on the most important clustering methods see, for example, Refs. 2 and 3. As a m a t t e r of fact, almost all clustering algorithms published in the literature are based on p a t t e r n representations in terms of feature vectors. There are only a few works addressing 61
62
S. Giinter & H. Bunke
the clustering of symbolic d a t a structures, in particular g r a p h s 4 ' 5 , 6 . This lack of clustering algorithms for the symbolic domain is an unfortunate restriction, because symbolic d a t a structures have a higher representational power t h a n feature vectors. Many clustering algorithms require the number of clusters being known beforehand. To overcome this restriction, various cluster validation indices have been proposed 2 ' 3 . These indices allow to measure the quality of a clustering. However they suffer from the fact t h a t the underlying clustering algorithm has to be executed multiple times, once for each feasible number of clusters. Therefore, the process of finding the optimal number of clusters automatically by means of cluster validation indices is computationally quite costly. In the present paper two issues will be addressed. First, a graph clustering algorithm t h a t was recently introduced by the a u t h o r s 7 ' 8 will be reviewed. Secondly, an adaptive version of the same algorithm will be described, which is able to find the optimal number of clusters automatically. T h e paper is organized as follows. In Section 2 the basic concepts and terminology will be introduced. T h e n the new graph clustering algorithm will be described in Section 3. The following section provides an extension of this algorithm so as to find the optimal number of clusters automatically. Experimental results will be presented in Section 5, and conclusions drawn in Section 6. 2.
Preliminaries
In this paper we consider graphs with labeled nodes and edges. Let Ly and LE denote sets of node and edge labels, respectively. Formally, a graph is a 4-tuple, g = (V, E, //, v) where V is the set of nodes, E C V x V is the set of edges, (j, : V —»• Ly is a function assigning labels to the nodes, and v : E —> LE is a function assigning labels to the edges. A graph isomorphism from a graph g to a graph g' is a bijective mapping from the nodes of g to the nodes of g' t h a t preserves all labels and the structure of the edges. G r a p h isomorphism is a useful concept to find out if two p a t t e r n s are the same, u p to invariance properties inherent to the underlying graph representation. Real world objects are usually affected by noise such t h a t the graph representation of identical objects may not exactly match. Therefore it is necessary to integrate some degree of error tolerance into the graph
Adaptive Self-Organizing
Map in the Graph
Domain
63
matching process. A powerful concept to deal with noise and distorted graphs is error-correcting graph matching using graph edit distance. In its most general form, a graph edit operation is either a deletion, insertion, or substitution (i.e. label change). Edit operations can be applied to nodes as well as to edges. Their purpose is to correct errors and distortions in graphs. Formally, let g\ = (Vi,Ei,ji\,i/i) and gi = ( V ^ - E s , / ^ , ^ ) be two graphs. An error-correcting graph matching (ecgm) from g\ to gi is a bijective function / : V\ —> V2, where V\ C V\ and V2 Q V2. We say t h a t node x G Vi is substituted by node y € V2 if f(x) = y. If H\(x) — M2(/(#)) then the substitution is called an identical substitution. Otherwise it is termed a non-identical substitution. Any node from V\ — V\ is deleted from gi, and any node from V2 — V2 is inserted in 52 under / . T h e mapping / directly implies an edit operation on each node in gi and §2, i.e., nodes are substituted, deleted, or inserted, as described above. Additionally, the mapping / indirectly implies substitutions, deletions, and insertions on the edges of gi and g2By means of the edit operations implied by an ecgm, differences between two graphs t h a t are due to noise and distortions are modeled. In order to enhance the noise modeling capabilities, often a cost is assigned t o each edit operation. T h e costs are non-negative real numbers. They are application dependent. Typically, the more likely a certain distortion is to occur the lower is its costs. T h e cost c ( / ) of an ecgm f from a graph g\ to a graph <72 is the sum of the costs of the individual edit operations implied by / . An ecgm f from graph g\ to a graph c/2 is optimal if there is no other ecgm from gi to 52 with a lower cost. T h e edit distance, d(gi,g2), of two graphs is equal to the cost of an optimal ecgm from g\ to 52, i-e. d
(gi,g2)
= min{c(f)\f
: V\ -> V2 is an
ecgm}
In other words, the edit distance is equal to the minimum cost required to transform one graph into the other. For more details on error-correcting graph matching and graph edit distance, including computational procedures, see, for example, Ref. 9. 3. Graph Clustering U s i n g
SOM
In Refs. 7 and 8 a new graph clustering algorithm is introduced. For the purpose of completeness, this algorithm will be briefly reviewed in the present
64
S. Giinter & H. Bunke
section. It is derived from self-organizing m a p (som), as described in Ref. 1. While the 'classical' som-algorithm is based on vectorial p a t t e r n representations, the new clustering algorithm works in the domain of graphs. Hence it can be considered a generalization t h a t includes the original som as a special case. A pseudo code description of the classical som-algorithm is given in Fig. 1. T h e algorithm can serve two purposes, either clustering, or mapping a high-dimensional p a t t e r n space to a lower-dimensional one. In the present paper we focus on its application to clustering. Given a set of patterns, X, the algorithm returns a prototype yi for each cluster i. T h e prototypes are sometimes called neurons. T h e number of clusters, M, is a parameter t h a t must be provided a priori. In the algorithm, first each prototype yi is randomly initialized (line 4). In the main loop (lines 5-11) one randomly selects an element x £ X and determines the neuron y* t h a t is nearest to x. In the inner loop (lines 8,9) one considers all neurons y t h a t are within a neighborhood N(y*) of y*, including y*, and u p d a t e s t h e m according to the formula in line 9. The effect of neuron updating is to move neuron y closer to p a t t e r n x. T h e degree by which y is moved towards x is controlled by the parameter 7, which is called the learning rate. It has to be noted t h a t 7 is dependent on the distance between y and y*, i.e. the smaller this distance is the larger is the change on neuron y. After each iteration through the repeat-loop, the learning rate 7 is reduced by a small amount, thus facilitating convergence of the algorithm. It can be expected t h a t after som-algorithm (1) (2) (3) (4) (5) (7) (8) (9) (10) (11) (12)
i n p u t : a set of p a t t e r n s , X = {x\,... , XN} o u t p u t : a set of prototypes, Y = {yi,... , yu] begin initialize Y = {y\,... , yM~\ randomly r e p e a t select x £ X randomly find y* such t h a t d(x, y*) = min{d(x, y)\y e for all y £ N(y*) d o y = y + j(x-y) reduce learning rate 7 u n t i l termination condition is true end Fig. 1.
The som-algorithm.
Y}
Adaptive Self-Organizing
Map in the Graph
Domain
65
a sufficient number of iterations the y^'s have moved into areas where many X j S £1X6 concentrated. Hence each y% can be regarded a cluster center. The cluster around center j/j consists of exactly those patterns that have y, as closest neuron. More detail about this algorithm can be found in Ref. 1. In the original version of the som-algorithm all Xj's and j/j's are feature vectors 1 . To make the algorithm applicable in the graph domain, two new concepts are needed. First, a graph distance measure has to be provided in order to find graph y* that is closest to x (see line 7). For this purpose the use of graph edit distance as described in Section 2 has been proposed in Refs. 7 and 8. Secondly, a graph updating procedure implementing line 9 has to be found. Such a procedure has been described in Refs. 7 and 8. It is derived from graph edit distance computation. The result of computing the edit distance, d(g\,g2), of two graphs, g\ and g2, is a sequence of edit operations e\,... ,en that transform g\ into g2 with minimum cost. If we take a subsequence en,... , e^ of those edit operations and apply them on <7i, we obtain a new graph, #3. This new graph can be regarded a version of gi that has been changed so as to make it more similar to gi in the same way as y is made more similar to x by means of the operation in line 9 of Fig. 1. Moreover, the cost of subsequence en,... , e^ corresponds to 7, controlling the degree by which g\ is made more similar to 2With the two new concepts described in the last paragraph, the somalgorithm becomes in fact applicable in the domain of graphs. Further details, including the initialization procedure (line 4), definition and computation of neighborhood N(y*) (line 8), and termination (line 11), see Refs. 7 and 8. 4. Adaptive Graph Clustering Many clustering algorithms, including the original and the graph based version of som, require the number of clusters given as an input parameter. This is a serious problem as this number is often not known. In Ref. 10 methods are discussed to solve this problem for the traditional version of som by dynamically adding and deleting neurons during the execution of the som procedure. Using similar techniques, we propose two adaptive versions of som for the graph domain that are able to find the optimal number of clusters automatically. Both versions are based on the concept of neuron utility, which will be introduced in Section 4.1. Then in Section 4.2 the complete procedure for adaptive clustering in the graph domain is presented.
S. Giinter & H. Bunke
66
4.1. Neuron
Utility
The utility of a neuron y is an indicator reflecting how much y contributes to the approximation of the input data, i.e., it shows the contribution of y to minimizing the sum S of squared distances between all inputs and their nearest neuron. Neurons with a high utility are important and should be kept, while a neuron with a low utility can be removed without significantly changing S. Let y\ and y be neurons and x an input pattern. Then the utility of 2/1 is defined as follows.
U(yi)=
J2
[min{d(y,x)2\y^y1}-d(y1,x)2}
x^near(yi)
where near(yi) = {z|(V y)(y ^ y{) => d(y,x) >
d{yux)}
In these equations, d(_, _) denotes the graph edit distance. The utility U(y\) of neuron y\ is obtained by computing for all inputs x € near(j/i) the difference between the squared distance of the second nearest neuron to x, and the squared edit distance of y\ to x. Set near(yi) consists of all inputs that have y\ as closest neuron. Clearly, if j/i has a small utility then either for each input pattern in its immediate neighborhood there is another neuron close by, or there are no input patterns in its immediate neighborhood at all. In the second case, near(yi) = 0 and U{y\) = 0. On the other hand, if U{y{) is large than the removal of j/i would leave all inputs in set near(y\) without a neuron close by, and the sum of squared distances of all inputs to their nearest neuron would significantly increase. 4.2. Finding the Optimal
Number
of
Clusters
Assume that the number of clusters given to the som algorithm shown in Fig. 1 as input parameter is too small. Then it can be expected that, upon termination of the algorithm, the utility of each neuron is high. On the other hand, we anticipate one or several neurons with a low utility if the number of clusters given a priori is too high. Based on these considerations, the algorithm given in Fig. 1 is modified as follows. After a certain number, N > 1, of runs through the repeat loop (lines 5-11), the utility of each neuron is checked. As this check is computationally expensive, it should not be done often. On the other hand, if the utility check isn't done frequently enough, the algorithm may converge
Adaptive Self-Organizing
Map in the Graph
Domain
67
towards undesired solutions. In the experiments described in Section 5 N was set equal to 200. After the utility of each neuron has been computed, the value umin of the neuron ymin with the smallest utility is determined. Also the average utility uav of all neurons is computed. If umin < ^ , where c > 1 is a user defined constant, the neuron ymin is deleted. Otherwise, if Umin > ^r a new neuron is inserted. Each time the repeat loop (lines 5-11 of Fig. 1) is started with a new number of neurons, the learning rate 7 is reset to the value it had when the last utility check was performed, so as to give the new population of neurons more flexibility to adapt to the input patterns. Upon termination of the 50m algorithm a postprocessing procedure is executed, where the utility of all neurons is checked one more time. Any remaining neurons with a utility umin < ^f- is deleted by this postprocessing procedure. T h e method described above doesn't need any knowledge about the correct number of neurons. Depending on the minimum utility value in the neuron population, it increases or decreases the number of neurons, i.e. clusters. On the other hand, a simpler procedure can be obtained if the optimal number of neurons is approximately known beforehand. Such a procedure will be described in the following. It is to be understood as an alternative to the method introduced before. If the true number of neurons is approximately known, two procedures seem feasible. First, one could start with a number of neurons t h a t is smaller t h a n the t r u e value, and increase it until some utility threshold is reached. Secondly, one could choose a number of neurons larger t h a n the optimal value and decrease it. In some studies we found out t h a t the second strategy works better. It was actually used in the experiments described in next section. T h e procedure t h a t has been actually implemented is particularly simple. It assumes a number of neurons greater t h a n , or equal to, the optimal number given in the initialization phase. During execution of the algorithm the number of neurons is never changed. T h a t is, each time a neuron is deleted because its utility is too small, a new neuron is inserted. Upon termination, i.e. after a given number of iterations through the repeat loop (lines 5-11 in Fig. 1) have been executed, the postprocessing procedure described above is executed, removing all neurons with a utility smaller t h a n ^r. From now on, the static version of the new graph clustering algorithm described in Section 3 will be called GraSom, while the two adaptive
S. Gunter & H. Bunke
68
versions will be called AdaGraSoml and AdaGraSom2. The first one, AdaGraSoml, is the algorithm that can be initialized with any number of neurons, while AdaGraSom2 needs a value upon initialization that is greater than, or equal to, the correct number of neurons. 5. Experimental Results The graph clustering algorithm GraSom and its adaptive versions AdaGraSoml and AdaGraSom2 described in Sections 3 and 4, respectively, were experimentally evaluated. In the experiments, graph representations of capital characters were used. In Fig. 2, 15 characters are shown, each representing a different class. The characters are composed of straight line segments. In the corresponding graphs, each line segment is represented by a node with the coordinates of the endpoints in the image plane as attributes. No edges are included in this kind of graph representation. The edit costs are defined as follows. The cost of deleting or inserting a line segment is proportional to its length, while substitution costs correspond to the difference in length of the two considered line segments. This kind of representation will be called Rl in the following.
Fig. 2.
15 characters each representing a different class.
In addition to representation Rl, a second representation, called R2 below, was considered. Under R2, the nodes represent locations where either a line segment ends, or where the end points of two different line segments coincide with each other. The attributes of a node represent its location in the image. There is an edge between two nodes if the corresponding
Adaptive Self- Organizing Map in the Graph
Domain
69
locations are connected by a line in the image. No edge attributes are used in this representation. The deletion and insertion cost of a node is a constant, while the cost of a node substitution is proportional to the distance of the corresponding points in the image plane. The deletion and insertion of an edge also have constant cost. As there are no edge labels, edge substitutions will never be needed under representation R2. For each of the 15 prototypical characters shown in Fig. 2, ten distorted versions were generated. Examples of distorted A's and E's are shown in Figs. 3 and 4, respectively. The degree of distortion of the other characters is similar to Figs. 3 and 4. As a result of the distortion procedure, a sample set of 150 characters were obtained. Although the identity of each sample was known, this information was not used in the experiments describedbelow, i.e., only unlabeled samples were used in the experiments.
AAAAA
AAAAA Fig. 3.
Ten distorted versions of character A.
EEEEE Fig. 4.
Ten distorted versions of character E.
The graph clustering algorithm GraSom described in Section 3 was run on a set of 150 graphs representing the (unlabeled) sample set of characters, with the number of clusters set to 15. As the algorithm is non-deterministic, a total of 10 runs were executed. The cluster centers obtained in one of these
70
S. Giinter & H. Bunke
AEFI MN VWX Fig. 5.
Cluster centers obtained in one of the experimental runs.
runs are depicted in Fig. 5. Obviously, all cluster centers are correct in the sense that they represent meaningful prototypes of the different character classes. In all other runs similar results were obtained for both representation R l and R2, i.e., in none of the runs an incorrect prototype was generated. Also all of the 150 given input patterns were assigned to their correct cluster center. From these experiments it can be concluded that GraSom is able to produce a meaningful partition of a given set of graphs into clusters and find an appropriate prototype of each cluster, if the correct number of clusters is known. The purpose of the experiments described below is to analyze the behavior of the adaptive versions, AdaGraSoml and AdaGraSom2, in case the number of cluster is not known. The same set of data as for the experiments conducted with GraSom were used, i.e., ten unlabeled distorted versions of each of the 15 characters were generated. Then AdaGraSoml and AdaGraSom2 were executed on these data. Each run was repeated ten times to account for the random nature of these algorithms. A summary of the experimental results in shown in Table 1. AdaGraSoml was run with the initial number of clusters set equal to 1,15, and 25, while AdaGraSom2 was run only with this parameter set equal to 25. Both graph representations, R l and R2, were used. The numbers in Table 1 denote the relative rate of error in the number of clusters produced. This quantity is defined as the number of spurious or missing clusters relative to the correct number, accumulated over all experimental runs. For example, the instance of 0% in the first row and
Adaptive Self-Organizing
Map in the Graph
Domain
71
Table 1. Summary of experimental results, see text (i denotes the number of initial neurons; r denotes the type of graph representation). AdaGraSoml
AdaGraSom2
1
15
25
25
Rl
0%
0%
0%
0.67%
R2
7.33%
4%
2%
0
r~^^^
first column of the table means that if AdaGraSoml was run on graph representation Rl with one initial neuron, it produced in each of the ten runs, always 15 clusters, which is the correct number. By contrast, the number 0.67 in the first row and last column means there was one spurious or missing cluster in one out of ten runs, but in all other nine runs the correct number of 15 clusters were produced (i.e. 0.67% = -^ • 100%). Note that the relative rate of error for GraSom is 0%. From Table 1 we conclude that AdaGraSom2 produced good results for both R l and R2. Actually, there was only one cluster out of 300 either missing or spurious. The behavior of AdaGraSoml is more sensitive to the type of graph representation used. For R l perfect results were obtained, but for R2 several errors occured. The most stable version of AdaGraSoml is the one where we start with a number of neurons higher than the true value. A second measure for assessing the results produced by AdaGraSoml and AdaGraSom2 is the average graph edit distance of a pattern to its cluster center, i.e. to its nearest neuron. The smaller this number is, the more compact, i.e. the better, is the corresponding clustering. This quantity is shown in Table 2. It was computed only for those cases where the Table 2. Summary of experimental results, see text (i denotes the number of initial neurons; r denotes the type of graph representation).
72
S. Giinter & H. Bunke
algorithm produced the correct number of clusters. As none of the runs corresponding to the first column of the second row produced the correct number of clusters, this position is left empty. T h e two corresponding values obtained under GraSom for R l and R2 are 0.59 and 0.5, respectively. It is evident from Table 2 t h a t the values obtained under AdaGraSoml and AdaGraSom2 compare very favorably with those obtained with GraSom. This can be explained by the fact t h a t due to a variable number of neurons, AdaGraSoml gives the cluster center more flexibility to adapt to the actual input p a t t e r n population. Similarly, due to the higher number of neurons, cluster centers have a higher chance of migrating to the 'true' positions under AdaGraSom2. 6.
Conclusions
Clustering has become a m a t u r e discipline t h a t is important to a number of areas, including p a t t e r n recognition and related fields. But the clustering of graphs is still widely unexplored. This is an unfortunate restriction because graphs and other symbolic d a t a structures have a much higher representational power t h a n object representations in t e r m of feature vectors, which are prevailing in today's clustering algorithms. In the present paper a new graph clustering algorithm, GraSom, is described. It is an extension of the well-known som-algorithm into the domain of graphs. Moreover, two adaptive versions of GraSom, called AdaGraSoml a n d AdaGraSom2, are introduced. These algorithms are able to overcome one of the shortcomings of the classical som-algorithm, and of GraSom. While in the som-algorithm and in GraSom the number of clusters is static and needs to be defined beforehand, AdaGraSoml and AdaGraSom2 can find the optimal number of clusters automatically. Unlike cluster validation indices which require a clustering algorithm to be repeatedly executed for each possible number of clusters, these algorithms dynamically a d a p t the number of clusters to the structure of the input data. AdaGraSoml can be initialized with any number of neurons, while AdaGraSom2 requires an initial number of neurons at least as large as the true number. T h e applicability of the new graph clustering algorithm has been demonstrated through a number of experiments. From these experiments and studies reported elsewhere 1 1 , the following conclusions can be drawn. If the true number of clusters is known then it is most straightforward to apply GraSom. This algorithm produces good results and is more efficient t h a n
Adaptive Self-Organizing
Map in the Graph
Domain
73
AdaGraSoml and AdaGraSom2. In case the number of clusters is approximately known then b o t h AdaGraSoml and AdaGraSom2 qualify for being applied. If AdaGraSom2 is selected it has to be started with a number of neurons t h a t is guaranteed not to be below the true number of clusters. Also for AdaGraSoml it is advisable to run it with an initial number of neurons slightly higher t h a n the expected true value. As a third alternative, GraSom in combination with one or more of the cluster validation indices considered in Ref. 11 can be applied. However, to keep the overhead manageable, this alternative should be chosen only if the range of possible values of the number of clusters is restricted. Finally, if the number of clusters is totally unknown, AdaGraSoml qualifies for application. In this case it is advisable to run the algorithm multiple times with a different number of initial neurons and select the best result based on validation indices as discussed in Ref. 11. Alternatively, several runs of AdaGraSom2 can be executed with a variable number of initial neurons. Also in this case the final result has to be selected using validation indices 1 1 . It can be finally concluded t h a t a variety of different clustering algorithms have become available. These algorithms cover a broad spectrum of potential applications. Future research will consider clustering algorithms other t h a n som, and their adaption to the domain of graph.
References 1. T. Kohonen. Self-Organizing Map, Springer Verlag, 1997. 2. S. Theodoridis and K. K. Koutroumbas. Pattern Recognition. Academic Press, 1998. 3. A.K. Jain, M.N. Murty and P.J. Flynn. Data Clustering: A Review. ACM Computing Surveys, 31, pp. 264-323, 1999. 4. R. Englert and R. Glantz. Towards the clustering of graphs, in Proc. 2nd IAPR-TC-15 Workshop on Graph-based Representations, Austria, Austrian Computer Society, 125-133, 2000. 5. D.S. Seong, H.S. Kim and K.H. Park. Incremental clustering of attributed graphs, IEEE Trans, on Systems, Man and Cybernetics, 1399-1411, 1993. 6. D. Riano and F. Serratosa. Unsupervised synthesis of function describedgraphs, in Proc. 2nd IAPR-TC-15 Workshop on Graph-based Representations, Austria, Austrian Computer Society, 165-171, 2000. 7. S. Giinter. Graph clustering using Kohonen's method, Diploma Thesis, University of Bern, 2000. (in German) 8. S. Giinter and H. Bunke. Self-organizing map for clustering in the graph domain; to appear in Pattern Recognition Letters.
74
S. Giinter & H. Bunke
9. B.T. Messmer and H. Bunke. A new algorithm for error tolerant subgraph isomorphism. IEEE Transactions on Pattern Recognition and Machine Intelligence, 20, pp. 493-505. 10. Fritzke, B: Growing self-organizing networks - history, status quo and perspectives. In Oja, E., Kaski, S., editors, Kohonen Maps, Elsevier, 1999, pp. 131-144. 11. S. Giinter and H. Bunke. Validation indices for graph clustering, Proc. 3rd IAPR -TC-15 Workshop on Graph-based Representations in Pattern Recognition, May 23-25, 2001, Ischia (Italy), pp. 229-238.
CHAPTER 4 FROM N U M B E R S TO INFORMATION G R A N U L E S : A S T U D Y IN U N S U P E R V I S E D L E A R N I N G A N D F E A T U R E ANALYSIS Andrzej Bargiela The Nottingham Trent University Burton Street, Nottingham NG1 4BU, England E-mail:
[email protected]
Witold Pedrycz University of Alberta Edmonton, Canada,T6G 2G7 E-mail:
[email protected] and Systems Research Institute, Polish Academy of Sciences 01-447 Warsaw, Poland This chapter focuses on granular clustering: a way of finding structure in heterogeneous data and representing the data in the form of information granules. The main features of the proposed granular clustering approach are: (a) a noninvasive exploration of data carried out under weak assumptions made as to the nature of the data, (b) transparency of the constructed information granules which assume the form of hyperboxes in the problem space. We introduce a compatibility measure that expresses a degree of "similarity" between two information granules and takes into account both a distance between the granules and their size. We show how to "grow" clusters through a process of merging existing data points that exhibit high values of the compatibility measure. The clustering algorithm is discussed along with a comprehensive validation mechanism for the resulting structures (collections of information granules). We formulate a problem of feature analysis in the setting of the information granules and introduce some quantitative measures describing each feature. Numerical experiments use two-dimensional synthetic data and the multivariable Boston housing data.
75
76
A. Bargiela & W. Pedrycz
1. I n t r o d u c t i o n Combining p a t t e r n s into some form of structure is the fundamental underpinning of P a t t e r n Recognition ( P R ) . Any in-depth analysis of p a t t e r n s leads to optimal and interpretable classifiers. Interestingly, with the increasing heterogeneity of available d a t a (patterns) and the steadily growing complexity of real-life classification problems, one has to look at a uniform and general t r e a t m e n t of various P R scenarios. In one way or another, a need arises for the formulation of classification problems in the language of information granules — conceptual entities t h a t capture the essence of the overall d a t a set while retaining the character of individual p a t t e r n s . It is worth stressing t h a t information granulation can be seen as a vehicle of abstraction supporting transition from clouds of numeric d a t a and small information granules t o larger and more general information g r a n u l e s 2 ' 3 , 5 ' 1 2 ' 1 3 ' 1 6 - 1 8 . T h e area of clustering (unsupervised learning) with its long history has been an important endeavor in P R . Various algorithms for finding structures in d a t a and representing the essence of such structures in terms of prototypes, dendrograms, self-organizing m a p s 8 ' 9 and alike 1 ' 4 have been used for a long time within the research community and industry. Commonly, if not exclusively, the direct aspect of granulation has not been tackled. T h e intent of this study is to address this important problem by introducing an idea of granular clustering. T h e simplest scenario looks like this: we s t a r t from a collection of numeric d a t a (points in R n ) a n d form information granules whose distribution and size reflects the essence of the d a t a and reveals its structure. Forming the clusters (information granules) may be treated as a process of growing information granules. As the clustering progresses, we expand the clusters, enhancing the descriptive power of the granules while gradually reducing the amount of detail available to us. T h e information granules of interest in this study are represented as hyperboxes positioned in a high dimensional d a t a space. T h e mathematical formalism of interval analysis provides a robust framework for the analysis of the information density of these granular structures. T h e study's intuitive objective is to match the granularity of d a t a items used to describe physical systems to the structure of these systems. In this sense, the granulation process is a t t e m p t i n g to achieve the highest possible generalization while maintaining the character of individual d a t a structures. Hybrid p a t t e r n classifiers are used in the context of this study differently to what is commonly encountered in the literature. While most
From Numbers to Information
Granules
77
hybrid systems allude to some sort of neurofuzzy architectures sometimes augmented by evolutionary mechanisms, in this study we are concerned with hybridization occurring at the level of information granules. In other words, a hybrid system operates on a spectrum of information granules ranging from numeric data through to more general (less specific) information granules. The chapter is organized into 7 sections. In Section 2, we introduce information granules and the rationale behind them. As we are concerned with information granulation carried out in terms of sets (hyperboxes), we also provide all pertinent sets notation in this section. The principle of granular clustering is covered in Section 3 and the granular clustering algorithm is presented in Section 4. Feature analysis completed in the framework of information granules is studied in Section 5. Experimental studies are discussed within Section 6 and the conclusions are given in Section 7. 2. Information Granules and Information Granulation Information granulation is a process of data organization and data comprehension. Interestingly, humans granulate information almost in a subconscious manner. This makes the ensuing cognitive processes so effective and far superior to machine intelligence. Two representative categories of problems in which information granulation plays a prominent role involve processing of one and two-dimensional signals. The first case is primarily temporal signals. The latter case pertains to image processing and image analysis. In processing, analysis and interpretation of signals, information granules arise as a result of temporal sampling and aggregation. Several samples in the same time window can be represented as a single information granule. In the simplest case, such an interval can be formed by taking the minimal and maximal value of the signal occurring in this window of granulation (refer to Fig. 1). Some other ways of forming information granules may rely on statistical analysis; one determines a mean or median as a representative of the numeric data points and then builds a confidence interval around it (obviously, the use of this mechanism requires assumptions about the statistical properties of the population contained in the window as well as the numeric representative itself). Similarly, in image processing one combines pixels within some spatial neighborhood. Again, various features of an image can be granulated, such as brightness, texture, color, etc.
A. Bargiela & W. Pedrycz
x(k)
Fig. 1. A fragment of a time series and its granulation through sampling (T s denotes a sampling interval).
Information granulation has been studied in Refs. 2, 3, 10 and 12, in terms of the concept itself, its computational aspects and the resulting structures. While this chapter is written as a self-contained unit the reader may be interested in a broader discussion of the information granulation issues that can be found in the above publications. 2.1. Set-based Framework of Information The Language of Hyperboxes
Granules:
In the overall presentation we adhere to a standard notation. A hyperbox defined in R n denoted by B is fully described by its lower (1B) and upper corner ( u B ) . To use explicit notation, we use B(1 B , u B ) where 1 B , u B £ R n . An evident inclusion relationship holds true. We can express it as: 1 ; < u ; for i = 1 , . . . ,n where 1 and ,1 E [l B i, 1E B B B B B B u = [u i, u 2, • • • , u n ] . If 1 = u then the hyperbox reduces to a single point (numeric datum) B(1 B , 1B) = {1 B }. Hyperboxes are elements of a family of sets defined in R n . More specifically, we state that B e P(R n ) with P ( . ) being the power set of R n . The volume of B, denoted by V(B), is viewed as a measure of specificity of the information granule. The point, B(1 B , 1B) has the highest specificity. As the volume increases the specificity of information granule decreases correspondingly. Computationally, it is advantageous to consider the expression exp(—V(B)) which captures this aspect of granularity and is normalized, i.e. it attains 1 for the numeric datum and tends to zero once the hyperbox starts growing. It is instructive to elaborate on the use of such information granules in the realm of PR (in which we are quite commonly confined to the language
From Numbers to Information
(a)
Granules
79
(b)
Fig. 2. Expressing data as hyperboxes (a) and ellipsoidal information granules (b); note a reconstruction deficiency caused by the dependency between the features.
of probability and probabilistic granules, articulated as probability functions or probability density functions). Transparency of results is a key factor here. To illustrate this point Fig. 2 shows two granular constructs. In the first case, a two- dimensional box captures the essence of the data: we may state t h a t the Cartesian product of [a, b] and [c, d], B = [a, b] x [c, d] "covers" the data. Moreover, b o t h features (intervals) maintain their identity. In contrast, ellipsoidal information granules (which can be otherwise quite expressive) do not provide the same transparency as hyperboxes. Obviously one can project the ellipsoid on the corresponding features. Note however t h a t the reconstruction (the Cartesian product C = [e, f ] x [g, hj) could be quite different from the original granule l l ^ C .
3. T h e Principle of Granular Clustering Before we proceed with the details of the granular clustering technique, it is instructive t o discuss t h e underlying principle, learn how the process proceeds and concentrate on the interpretation of some results generated by the proposed clustering mechanism. As emphasized in the literature 1 ' 4 , the essence of clustering (unsupervised learning) is to discover a structure in data. It is generally true t h a t almost all existing clustering techniques operate on numeric objects (vectors in R n ) and produce representatives (prototypes) t h a t are again entirely numeric. In this sense, their form does not reflect how many d a t a points they represent or what is the distribution of these d a t a points. In the design of the clustering method, we add an
A. Bargiela & W. Pedrycz
80
extra dimension of granularity that helps sense the structure in the data as it becomes unveiled during the formation of the clusters. Without loss of generality we focus our attention on the subspace of R n and concern ourselves with the granular clustering algorithms defined on the unit hyperbox [0 l ] n . Consequently, as a pre-processing step, we normalize all input data to such a hyperbox. This pre-processing ensures that the granular clustering algorithm has a simpler mathematical formulation while retaining generality for all data in R n . 3.1. The
Design
The approach introduced here differs in many ways from approaches 1,5,14 ' 19 ' 20 ' 21 . The leitmotiv is the following:
other
An abstraction (no matter whether dealing with numeric or granular elements) is achieved through the condensation of original data elements into granules, whose location and granularity reflects the essence of the structure of data. The more condensation, the larger the sizes of the information granules that realize this aggregation. The granular clustering is carried out as the following iterative process • Find the two most compatible information granules (where the idea of compatibility guiding this search will be quantified later on) and on this basis build a new granule embracing them. In this way, one condenses the data while reducing the size of the data set. • Repeat the first step until enough data condensation has been accomplished (here one has to come up with a termination criterion or introduce a sound validation mechanism). Figure 3 illustrates how the clustering algorithm works. We start from a collection of small information granules (the original data) and start growing larger information granules. Noticeably, through their growth they tend to reflect the essential characteristics of the original data. The size of the granules reflects how much they have incorporated the original data and conveys extra information about their distribution. This approach resembles techniques of aggregative hierarchical clustering. There is a striking difference though: in hierarchical clustering we deal with point- size data and the clusters are sets of homogeneous objects. No conceptually new entities are formed. Here we deal with a heterogeneous
From Numbers to Information
Granules
81
•
\af.
Pi
• •
Fig. 3. Several snapshots of cluster growing over the clustering process; observe the small information granules forming at the initial stage (first iteration) that are grouped in some well-confined regions and give rise to three apparent large information granules at the later stage of clustering.
mix of d a t a items and "grow" larger information granules from the smaller granules a n d / o r the individual point-size data. It should be stressed t h a t the nature of an information granule is significantly richer compared to t h a t of point data. It involves additional attributes such as "shape" ( u B — 1B) and "size" (||u B — 1B|() in addition to the "position" ( u B ) a t t r i b u t e t h a t is associated with b o t h granules and point data. By monitoring the attributes of the hyperboxes we can oversee the clustering process more effectively. Essentially, once we find t h a t the attributes of individual boxes indicate t h a t they are incompatible with each other (the notion explained in Section 4), the process of clustering is terminated. By the same token, this concept should be contrasted with the idea of min-max clustering discussed by Simpson 1 4 ' 1 5 as this technique seems to bear some resemblance with the method studied here. T h e similarity is only superficial though. First, Simpson's m e t h o d deals with point-size d a t a while we consider d a t a t h a t is represented by either points or hyperboxes in p a t t e r n space. Second, the fuzzy membership functions of the information granules proposed by Simpson promote the formation of clusters whose size varies greatly in various dimensions. This is exactly the opposite to what we are trying to promote through the "compatibility measure" (discussed in Section 4). To emphasise the latter point, we present in Fig. 4 a representative of a class of membership functions proposed by Simpson and refer the reader to Fig. 10 for comparison with the functions t h a t have been utilised in the proposed clustering algorithm.
A. Bargiela & W. Pedrycz
82
0
0.2
0.4
0.6
0.8
0 Q
Fig. 4. Simpson's membership function (as presented in Ref. 15) for the hyperbox defined by the min point V = [0.2 0.2] and max point W = [0.4 0.4]. Sensitivity parameter 7 is equal to 4.
3.2. Interpretation
and Validation
of Granular
Clustering
In the literature, there are a number of cluster validity indexes, whose role is to assess the "goodness" of clusters and, as a consequence, to identify the most "plausible" number of clusters. Validity indexes help guide the clustering process by implying what the number of clusters should be. Commonly, their behavior does not lead to clear conclusions though. Even worse, they may generate conflicting suggestions as to the proper number of clusters. In granular clustering we take another position. As the clusters capture the core of the data (and obviously, this is regarded as an important benefit of the method), our conjecture is that such a core should help establish a sound platform for assessing clusters. When growing the information granules, a criterion worth investigating is the volume of the smallest granule (V m ; n ) that needs to be constructed, at this particular step, in order to cluster two component granules (more specifically, we determine e~^; the details will be covered in Section 4.1). If that minimal volume grows quickly then it can be deduced that the compatibility of the component granules is low and the clustering process can be terminated. Again, it is worth emphasizing that the granularity of data adds an extra dimension to any processing. Not only the location of the information granule is essential but also its size and shape play a crucial role in the process of clustering and afterwards during the validation of the clusters.
From Numbers to Information
83
Granules
4. The Computational Aspects of Granular Computing There are two essential functional elements of granular clustering that need to be described prior to presenting the detailed algorithm. These concern how distance between two information granules is determined and how we compute an inclusion relation between granules. While the definitions generalize to a multidimensional case, we focus here on a two-dimensional case. Note also that these two concepts work for heterogeneous data, that is, granules and numeric entities in the same feature space. 4.1. Defining
Compatibility
Between
Information
Granules
In this section, we discuss how compatibility and inclusion between two information granules are computed. The issue is more complicated than in a numeric case as these notions are granular and therefore the definitions of compatibility and inclusion should reflect this fact. Consider two information granules A and B. More explicitly, we follow the full notation A(1 A , u A ) and B(1 B , u B ) to point at their location in the space. The compatibility, compat(A,B), involves two components: a distance between A and B, d(A, B), and the size of the information granule that would be formed by merging A and B. The distance d(A, B) between A and B is defined on the basis of the distance between their extreme vertices. d(A,B) = ( | | l B - l A | | + | | u B - u A | | ) / 2 .
(1)
Obviously || . j| is a distance defined between the two numeric vectors. To make the framework general enough, we treat ||. || as an L p distance, p > 1. By chaging the value of "p" we sweep across a spectrum of well known distances that depend upon a particular value of "p". For instance, p = 1 yields a Hamming distance, Lj. The value p = 2 produces the well-known Euclidean distance, L2. For p = 00 we refer to a Tchebyschev distance, L ^ . Once A and B have been combined, giving rise to a new information granule C, the granularity of C can be captured by a volume V(C) computed in the standard fashion n
V(C) = 1 ] lengthi(c)
(2)
lengthy (C) = max(uf, u,A) - minflf, 1A)
(3)
where
i = 1,2,... ,n. (Refer to Fig. 5.)
A. Bargiela & W. Pedrycz
84
A
o
A
t
iV'l
p
C
V
> length,(C) Fig. 5.
Information granule C as a result of combining A and B.
The two expressions (l)-(2) are the elements of the compatibility measure, compat(A,B) defined as compat(A, B) = 1 - d(A, B ) e ~ a V ( c ) .
(4)
The rationale behind the above form of the compatibility measure is as follows. In clustering we aggregate two information granules that are closest together, i.e. their compatibility measure is highest. In light of the above criterion, the candidate granules to be clustered should not only be "close" enough (which is reflected by the distance component) but the resulting granule should be "compact" (meaning that the size of the granule in every dimension is approximately equal). The second requirement favors such A and B that give rise to a maximum volume for a given d(A, B), in other words it stipulates formation of hyperboxes that are as similar to hypercubes as possible. The exponential term in this expression normalizes all values to the unit interval. In particular, the volume of a point produces e _ 0 = 1 When the volume increases, its exponential function goes to zero. The parameter a balances the two concerns in the compatibility measure and is chosen so as to control the extent to which the volume impacts the compatibility measure. The compactness factor (e _ a V ( c )) introduced in the compatibility measure is critical to the overall processing of the information granules. By contrast, it is not essential and does not play any role when we cluster point-size data instead of granules. To constrain the values of the compatibility measure to the unit interval, we consider the data to lie in the unit hypercube [0, l ] n C R n (in other words we normalize the data before computing the value of (4)) and consider a normalized distance assuming values in the unit interval.
From Numbers to Information
Granules
85
To gain a better insight of what really is accomplished when using the above compatibility measure, let us study two points (numeric values) A and B situated in R 2 . Furthermore let A be fixed and located at the origin of the coordinates while we allow B some flexibility. d(A, B) is just a standard Euclidean distance. It becomes obvious that all elements (Bs) located on a circle of a fixed radius exhibit the same distance value. Restrict now the choice of B to be from this pool. If we connect A and any such B, the resulting volume changes its value depending upon the location of B. Interestingly, out of all Bs, there are four locations on the circle for which the volume of the resulting granule attains its maximum. This happens if that box (the information granule formed by clustering A and B) is a square. In other words, the compatibility measure attains a maximal value when C is a hypercube.
A •,
/
\
A
'">y
,^
\ /
<....
Fig. 6. The calculations of the compatibility measure; note that there are four possible candidates (Bs) on the circle that maximize this measure.
If we plot the compatibility measure as a function of r (where r is an angular position of B), we can easily see that the value of the compatibility measure is modulated by the angle (or equivalently the shape of the resulting information granule C), see Fig. 7. More importantly, the above graphical considerations shed light on the geometry of the information granules that are preferred by the introduced compatibility measure. This preference reflects a principle that may be termed the principle of balanced information granularity. In a nutshell, in building new information granules, we prefer to have entities whose granularity is balanced along all dimensions (variables) rather than constructing
A. Bargiela & W. Pedrycz •
compat(A, B)
Fig. 7. The compatibility measure expressed as a function of T (the plot above is restricted to the first 90°); j3 = e ^ " " / 2 .
c2 C,
Fig. 8. Examples of information granules characterized by various degrees of balance of information granularity; note that Ci and C2 are highly unbalanced (have high levels of information specificity along one of the dimensions only) while C3 is well-balanced.
i
f
,! '• /
*
'•# '.'.'.'••'y"
a>0 a=0
*< ' \
sK
B
K
a>0
:•*''''*.\
a=0
a=0 (a)
(b)
a>0
(c)
Fig. 9. Identification of Bs leading to the highest value of the compatibility measure calculated with Euclidean distance (a), Hamming distance (b) and Tchebyschev distance (c).
From Numbers to Information
Granules
87
granules that are highly unbalanced. A number of selected examples of varying granularity are portrayed in Fig. 8. When we change the distance function to the Hamming (p = 1) or Tchebyschev distances (p = oo), we have a number of Bs to choose from yet this selection can be made from different geometrical figures (that is a diamond and a square), Fig. 9. Moving on to the case where both A and B are two information granules, the resulting plots visualizing the commtibilitv measure are collected in Fig. 10.
0.8
0.6 0.4
0.2
0
0
0.5
1
(a) Two hyperboxes representing information granules in a unit box in R.2.
0
0.2
0.4
0.6
0.8
°
(b) Compatibility measure with L2 distance measure.
88
A. Bargiela & W. Pedrycz
(c) Compatibility measure with LI distance measure.
(d) Compatibility measure with Loo distance measure. Fig. 10. Comparison of compatibility measures obtained with various distance measures. Note the preference that the compatibility measure gives to hyperboxes that are well balanced in all dimensions. This contrasts with the membership function proposed in Ref. 15 and illustrated in Fig. 4.
As the clustering proceeds (refer to Fig. 3), the process of merging progressively less closely associated patterns is reflected in the gradual reduction of the compatibility measure (4). A typical plot of the evolution of the compatibility measure over the complete clustering cycle is shown in Fig. 11. The proximity of patterns that are merged into granules at the early stages of the clustering process is reflected in the relatively small gradient of the compatibility measure curve. In contrast, the large gradient of the curve, at the final stages of clustering, indicates the merging of highly incompatible clusters. The compatibility measure curve therefore provides
From Numbers
to Information
Granules
89
0.95 0.9
0.85
~
0.8
H
75
°
B O O
Indicated optima] number of clusters
0.7 0.65
0.6 0
10
P
20
30
*" Decrease of the number of clusters
40
50
60
1
F i g . 1 1 . A n e x a m p l e of t h e e v o l u t i o n of t h e c o m p a t i b i l i y m e a s u r e over t h e full cycle of t h e c l u s t e r i n g p r o c e s s (p is t h e initial n u m b e r of p a t t e r n s ) .
a convenient reference for identifying how many clusters are needed to capture the essential characteristics of the input data, while providing the best generalization. The intersection of the two gradient lines (as visualized in Fig. 11) can be used as an approximation to the optimal number of clusters. This number provides a good starting point in the subsequent optimization of the overlap of the identified clusters as discussed below. Referring to the compatibility index, we can consider a modified form where we consider a sum of the sides(edges) of the hyperboxes compat(A, B) = 1 - d(A, B ) e ~ a L ( c )
(5)
with n
L(C) = ]T length^C)
(6)
i=l
Considering the nature of these indexes, we refer to the first index as volume-driven (4), while the second is edge-driven (5). To compare these two forms of the compatibility index, we consider a simple two-dimensional case in which both A and X are numeric. We allow X to move on a unit circle while A is located at the origin of the coordinates (see Fig. 12). In this way the distance is always equal to 1 and the compatibility can be expressed by a single angle
cos0) a n ( j for t j i e e dg e -driven version compat(A, X) = 1 _ e ~( sin *+ cos <M. The plots of the compatibility measures
90
A. Bargiela & W. Pedrycz
1.0
Fig. 12. of 0.
•
Computing the compatibility measure for A and X and expressed as function
Fig. 13. Compatibility measures as a function of <j>: (a) volume-driven, (b) edge-driven; <j> is constrained to [0, 7r/2].
are shown in Fig. 13. It becomes obvious that the highest compatibility value is achieved for the same value of the angle, i.e., <j> = ir/A. These compatibility measures exhibit a visible difference when we look at their sensitivity defined as sens(0)
<9compat(A, X)
Figure 14 reveals that the compatibility based on the volume of the information granule has higher sensitivity.
From Numbers to Information
Granules
91
\s
\ \* '\\
02
0.4
0.6
0.8
Fig. 14. Sensitivity of the compatibility measures regarded as a function of (a) volume-driven, (b) edge-driven, <j) is constrained to [0, TT/2].
4.2. Expressing Granules
Inclusion
and Overlap of
Information
The inclusion relation expressing an extent to which A is included in B is defined as a ratio of two volumes incl(A, B) =
V(A n B) V(A)
(7)
It is clear from the above that the inclusion measure is monotonic, non-commutative and satisfies the following boundary conditions: incl(A, X) = 1 and incl(A, 0) = 0 where X and 0 are the unit hyperbox and the empty set in R n , respectively. The calculations are straightforward; Fig. 15 enumerates all cases for one-dimensional granules along with the pertinent values of this measure.
incl(A,B)=l
incl(A,B)
Fig. 15.
incl(A,B)
incl(A,B)=0
Computing the inclusion for two information granules A and B.
A. Bargiela & W. Pedrycz
92
It is worth mentioning that the value of the inclusion measure drops rapidly (at a rate of a~ n where a G (0, 1)) with the increasing dimension of the feature space. For example, if there is a 50% overlap (a = 0.5) in each variable in an n-dimensional space, the inclusion level is expressed as 0.5 n . Clearly, the objective of effective information abstraction through clustering of information granules translates into identifying granules for which there is a minimum overlap. To encourage the merging of granules that have significant overlap, we calculate an average of the maximum inclusion rates of each granule in every other granule. overlap(c)
—> '-
max (incl(A(i), AG))) 1
(8)
j^i
where c is the current number of granules and A(i) and A(j) are i-th and j-th granule respectively. However, we must point out that, while the measure (7) is monotonic for any two pairs of granules (i.e. if A C B and C C D, then incl(A, C) < incl(B, D)), the change of the number and size of granules during the clustering results in various local optima for (8). We illustrate this effect in Fig. 16.
•
D D
D D
(a)
% (b)
(c)
local minima of the overlap between granules
no of granules (d) Fig. 16. Progression from 5 to 2 granules involves stage (b) during which granules overlap. This is reflected in overlap(3) > 0 while overlap(5) = 0 and overlap(2) = 0.
From Numbers to Information
Granules
93
Because of the local minima of the overlap(.) function it is important to have a good initial estimate of the target number of clusters as a starting point for the local minimization of the function. Such an estimate is provided by our earlier analysis of the compatibility measure, as discussed in the previous section. Having completed clustering the quality of data abstraction afforded by the given set of clusters is measured using an independent validation data set. The generality of each cluster is well quantified by the sum of the inclusion rates of the validation data items in the respective cluster. M
INCL(i) = J2 incl(VO), A(i))
i = 1,... , c
(9)
j=i
where c is the number of clusters, M is the cardinality of the validation data set, V(j) are the validation patterns and A(i) are the clusters. As well as indicating whether a given cluster is representative for a large proportion of data, the INCL(.) measure can be used to assess how representative the training and the validation data sets are. If the sets are representative, then INCL(.) should correlate closely with the cardinality of the individual clusters.
5. The Granular Analysis The hyperboxes constructed during the design phase are helpful in a thorough analysis of the data set. They shed light on the nature of data as they are perceived from the standpoint of information granularity. We will discuss two aspects of this analysis. First, we characterize the hyperboxes themselves. Second, we analyze the properties of the variables (features) forming the data space. We should emphasize that the granular analysis follows the clustering phase and does not impact it in any way. To maintain the conciseness of our presentation, we consider that each of the "c" hyperboxes located in the n-dimensional space is fully described by vectors of its lower and upper corners (coordinates), B(k) = {l B (k), u B (k)}, k = 1,2,... ,c where l B (k) and u B (k) are vectors of the corresponding coordinates, that is l B (k) = [l B i(k), l B 2 ( k ) , . . . ,l B n (k)] and u B (k) = [uVk), uB2(k),...,uBn(k)].
A. Bargiela & W. Pedrycz
94 5.1. Characterization
of
Hyperboxes
The most evident characterization of the hyperboxes can be provided in their volumes, V{B{k)). The computations are obvious. First, we determine a ratio (normalized length) normJength;(B(k)) = U ' ^ ~1\^\} B lV K " rang ei (B(k))
(10) '
v
where range ; (B(k)) is a range of the i-th feature (variable). Since the data is normalized to a unit hypercube the range ; (B(k)) = 1 for all i. Second, the volume is taken as a product n
V(B)k)) = Y[ norm.length i (B(k)).
(11)
i=i
The volume quantifies the granularity of the hyperboxes. Intuitively, it states how "large" (detailed) the hyperboxes are and how much detail each captures. One can take an average of the volumes of the hyperboxes that gives a general summary of the hyperboxes v
i£v(B(k)).
(12)
c k=l
If one side of the hyperbox is zero then the volume measure returns a zero value. This occurs because of the multiplicative nature of volume. To alleviate this problem, we may also introduce an additive measure. A plausible descriptor of a hyperbox could reflect a "circumference" of the hyperbox and read as follows: n
^2 norm Jength; (B (k)).
(13)
i=l
5.2. Granular
Feature
Analysis
The granulation of the data space (and each feature) provides an interesting insight into the nature of the variables occurring in the problem. In what follows, we describe the variables in terms of sparsity and discriminative powers. These two descriptors are implied by the granular nature of the hyperboxes.
From Numbers to Information
Granules
95
5.2.1. Sparsity When looking at a certain variable in the hyperboxes, we can visualize how much of the entire range of the variable is occupied by the hyperboxes (i.e., how sparse the boxes are in the given space). Take the i-th feature and calculate the sum of length of the corresponding sides of the hyperboxes c
toUength; = Y^ length; (B(k))
(14)
k=l
where length;(B(k)) = u B ; (k) - l B ;(k) and i = 1,2,... ,n. The sparsity defined in the form sparsity, = ^
^
\
(15)
assumes values in the unit interval. If sparsity; is less than 1 then this represents a situation when all hyperboxes (more precisely their i-th coordinates) occupy a portion of the entire range of the feature. We may state that the variable is "underutilized". In other words, we witness a highly localized usage of this feature. Sparsity value near 1 means a complete utilization of the variable. Overutilization happens when sparsity achieves values higher than 1 (in this case we have some hyperboxes overlapping along this variable). The sparsity measure does not capture the entire picture. A situation illustrated in Fig. 17 shows two cases where the distribution of the hyperbox along the given feature is very different, yet we end up having the same value of the sparsity. This leads us to another index that describes an overlap between the hyperboxes.
0.0 0.1
0.0
0.1
0.8
0.2
1.0
1-0
Fig. 17. Two different distributions of hyperboxes (i-th feature) producing the same value of the sparsity index; in both cases the sparsity is equal to 0.3.
A. Bargiela & W. Pedrycz
96
5.2.2. Overlap
Index
We define the following index called coordinate overlap c overlap c overlap, -
2
V _ ^ ^
V
l e n g t h ^ k ) n 1(1)) y I(1))
lengthi(I(k)
(16)
i = 1 , 2 , . . . , n . In this definition, I(k) and 1(1) are intervals (sides) of the hyperboxes for the i-th variable.The higher the value of this index, the more overlap between t h e hyperboxes projects on t h e given variable. W h e n I(k) and 1(1) are pairwise disjoint then the overlap is equal to zero. This means t h a t the feature is highly discriminative, as it separated the hyperboxes. T h e higher the overlap measure, the lower the discriminative power of the feature. Each of the measures leads to a linear ordering of the features. We can easily state which of the features is highly "utilized" and which of t h e m comes with the most significant discriminative properties. To form a comprehensive picture, one can localize each feature in the sparsity — overlap space. By doing this, one can distinguish between the variables t h a t are essential to the P R problem. More specifically, we prefer features t h a t exhibit low overlap (as those come with strong discriminative properties) along with low values of sparsity (localized usage of the variable). It should b e stressed t h a t these descriptors (sparsity and overlap) emerge as import a n t quantifiers because of the existence of information granules forming the hyperboxes. 6. E x p e r i m e n t a l S t u d i e s T h e series of experiments is aimed a t visualizing the most essential features of granular clustering. We consider b o t h a synthetic d a t a set and real-life d a t a available on the W W W (Boston housing d a t a ) . 6 . 1 . Synthetic
Data
T h e synthetic d a t a sets consist of 3 groups of information granules (hyperboxes), A; G [0, 1] x [0, 1], generated by a r a n d o m number generator with a uniform distribution. Each group comprises 20 granules dispersed around pre-defined points: c x = [0.4, 0.4]; c 2 = [0.5, 0.6]; and c 3 = [0.8, 0.3]. T h e dispersion factor a is varied between 0.08 and 0.15 to establish the sensitivity of the clustering process to the dispersion of the data. T h e clustering
From Numbers to Information
20
30
Granules
97
40
iteration Fig. 18.
Compatibility measure for a single clustering process.
process is governed by the compatibility measure, (4), with the distance defined according to the L2 norm and the "compactness" factor a = 0.5. An example of the evolution of the compatibility measure throughout the clustering process is shown in Fig. 18. The intersection of the two asymptotes to the compatibility measure, traced at the beginning and at the end of the clustering process, indicates that 3 clusters (iteration 57) mark a natural 'change over' point in the behavior of the system. So, the clustering process should terminate with 3 clusters, provided that the degree of overlap of clusters is also minimized for this number of clusters. The degree of overlap of the clusters was evaluated at each of the 59 iterative steps of the clustering process, according to equation (16), and is depicted in Fig. 19. As expected, the results of the cluster overlap analysis confirm that the test data naturally falls into 3 clusters since the overlap function assumes a local minimum for 4 or fewer clusters. The quality of data abstraction achieved through clustering is assessed by evaluating the inclusion rate, (9), of an independently generated data (with the same statistical properties) in the clusters that have been identified. The change of the overall inclusion rate of the validation data throughout the clustering process is illustrated in Fig. 20. It is not surprising to see that the high value of the average inclusion rate for 3 or fewer clusters confirms that 3 clusters capture the essential features of the data while the
A. Bargiela & W. Pedrycz
10
20
30
40
SO
number of clusters Fig. 19.
An average degree of overlap of clusters.
s o o
20
30
40
50
60
number of clusters Fig. 20.
Average inclusion rate for the validation data set.
high value of the compatibility measure confirms that the clusters retain high specificity. Should the number of clusters be reduced to 2 or 1, the inclusion rate of the validation data set would only be improved marginally while there would be a very significant reduction in the specificity of the cluster(s).
Prom Numbers to Information
Granules
99
In order to achieve a degree of independence from the statistical characteristics of the random number generator the evaluation of the inclusion of the validation data sets in the clusters was repeated 100 times for each value of a £ {0.08, 0.09, 0.10, 0.11, 0.12, 0.13, 0.14, 0.15} and the number of clusters varying from 1 to 10. A total of 8000 training sets and 8000 validation sets were processed. Figure 21 illustrates how the inclusion measure, (7), depends on the data dispersion parameter a and the number of
10
°
no. of clusters Fig. 21.
Average inclusion measure evaluated for 8000 training and validation sets.
5
6
7
number of clusters Fig. 22. 2-D projection of the surface from Figure 21 resulting in a family of curves illustrating average inclusion rates of the validation data in clusters for the various values of c.
100
A. Bargiela & W. Pedrycz
0.95
0.85
o.
I U
0.75
20
30
40
iteration Fig. 23.
Edge-driven compatibility measure.
0.16 -
1
0.14 •
ni
1I-
I
0.12
•
'
0
\ \
I
/ '
II 10
/
:
r-l
20
30
40
50
«
Number of clusters Fig. 24.
Average degree of overlap of clusters throughout the clustering process.
clusters. It is interesting to note that a has little influence on the value of the inclusion measure. This is a very desirable characteristic of the clustering process, since it suggests that the precise statistical properties of data sets do not need to be known for the clustering to be effective. It is easy to note, from Figs. 21 and 22, that the inclusion rate of 0.9 or higher is consistently attained with 3 or fewer clusters.
From Numbers to Information
Granules
101
! OS
0.8 0.7 0.6
C
.2
0.5
3 c 0 3 0.2
0 1 0
10
20
30
40
50
60
Number of clusters Fig. 25.
Average inclusion rate of the validation data in the clusters.
We now assess the progression of the clustering process using the edgebased compatibility measure defined by (5). Figure 23 illustrates a typical evolution of the compatibility measure. As expected, the asymptotic change of character of this function occurs at iteration 57, indicating that there are 3 significant clusters. The results illustrated above, Figs. 23-25, are directly comparable to those contained in the earlier experiments. The asymptotical behaviour of the compatibility measure (Figs. 18 and 23) is nearly identical and the only noticeable difference in the progression of clustering occurs at the intermediate stages. 6.2. Boston
Housing
Data
Although for 2-dimensional data sets B G P(R 2 ) the number of clusters can be easily established by visual inspection, higher dimensional data presents a significant challenge. We have therefore applied the algorithm to a realistic 14-dimensional data set representing factors affecting house prices in Boston area (USA). The data set was originally compiled by Harrison and Rubinfeld, [6], and is available from the Machine Learning Database at University of California at Irvine (http://www.ics.uci.edu/ mlearn/MLSummary.html). The data set comprises 506 records.
102
A. Bargiela & W. Pedrycz
T h e 14 attributes of each d a t a record are as follows: per capita crime rate by town 1. CRIM proportion of residential land zoned for lots 2. ZN over 25,000 sq.ft.
3. INDUS 4. CHAS 5. NOX 6. RM 7. AGE 8. DIS 9. RAD 10.. TAX 11.. PTRATIO 12., B 13,.LSTAT 14.. MEDV
proportion of non-retail business acres per town Charles River d u m m y variable ( = 1 if tract bounds river; 0 otherwise) nitric oxides concentration (parts per 10 million) average number of rooms per dwelling proportion of owner-occupied units built prior to 1940 weighted distances to five Boston employment centres index of accessibility to radial highways full-value property-tax rate per $10,000 pupil-teacher ratio by town 1000(Bk - 0.63) "2 where Bk is the proportion of blacks by town % lower s t a t u s of the population Median value of owner-occupied homes in $1000's
6.2.1. Study A We divided the original set into two sets: the training set, comprising 253 odd-numbered records and the validation set comprising 253 evennumbered records. It should be noted t h a t , as a pre-processing step, all d a t a has been m a p p e d into a 14-dimensional unit hyperbox. T h e compatibility measure provided direction for the clustering process and the evolution of this measure throughout .the whole process is presented in Fig. 26. T h e gradients of the compatibility measure at the beginning and the end of the process indicate t h a t 7 clusters represent a good abstraction of the training data. In the vicinity of 7 clusters the cluster overlap indicator is minimized for 7 and 8 clusters, as shown in Fig. 27. Of these two possible numbers of clusters we select the smaller number, so as to achieve greater granulation of the original data. The generality of the identified clusters was tested by evaluating the average inclusion of the validation d a t a set (even-numbered records from the original d a t a set) in the sets of clusters identified in the last 50 steps of the clustering process. This is illustrated in Fig. 28. T h e value of over 90%, achieved for 7 clusters, indicates a good abstraction of the d a t a .
From Numbers to Information
100
Granules
103
150
iteration
Fig. 26. Compatibility measure of clusters formed from the odd-numbered records in the Boston housing data set. Iteration no. 245 corresponds to 7 clusters.
5
10
15
20
25
30
35
40
45
50
number of clusters Fig. 27.
Degree of average overlap of clusters in the last 50 out of 252 iterations.
To gain a more detailed insight into the makeup of the 7 clusters we evaluated an aggeregate inclusion measure (9) using the validation set, and compared the results with the cardinality of each cluster. It is clear, from Fig. 29, that out of 7 clusters 3 have a significant support in the two data sets, while the other 4 clusters represent data that could be described as significant exceptions. It is interesting to note, however, that the zero inclusion
A. Bargiela & W. Pedrycz
104
0
5
10
15
20
25
30
35
40
45
50
number of clusters Fig. 28.
Average inclusion measure evaluated for 1 to 50 clusters.
a o T3
.5
Cluster number Fig. 29. Cardinality (first bar) and the aggregate inclusion rate (second bar) for each of the 7 clusters.
rates of the validation data in clusters 3, 4 and 7 indicate that the small data sample makes it difficult to do a proper evaluation of the clusters. The full description of the identified clusters is given in Table 1. The results of our feature analysis are summarized in terms of sparsity and overlap values. This analysis provides an interesting observation about the discriminatory properties of the variables in the problem. The most dominant ones
From Numbers to Information Table 1. Description of the 7 clusters. (1» hyperbox and u; represents maximum coor Variable number
Cluster 1
Granules
;s minimum coordinates of the i-th
Cluster
Cluster 3
Cluster 4
Cluster 5
1.1265 3.3213
2.0099 2.0099
2
105
Cluster 6
Cluster 7
3.4744 8.9834
2.3783 73.5337
88.9762 88.9762
li ui
0.0063 0.0063
0.0686 2.7795
12 u2
0 0
0 0
0 0
0 0
0 0
0 0
0 0
8.1399 27.7400
19.5800 19.5800
19.5800 19.5800
18.1001 18.1001
18.1001 18.1001
18.1001 18.1001
0 0
1 1
0 0
1 1
0
0 0
13 u3 14 u4
0.7399 0.7399 0 0
0
15
0.3850
0.5200
0.8710
0.6050
0.6310
0.5320
0.6710
u5
0.3850
0.8710
0.8710
0.6050
0.7700
0.7700
0.6710
16
4.9730
4.9030
5.0120
7.9290
5.8750
4.1380
6.9680
u6
4.9730
6.4580
6.1290
7.9290
8.7800
7.0610
6.9680
17
6.0004
69.6999
88.0004
96.2005
82.8997
41.9002
91.8999
uT
6.0004
100.0000
100.0000
96.2005
97.4997
100.0000
91.8999
18
1.7984
1.3459
1.3216
2.0459
1.1296
1.1370
1.4165
u8
10.7103
3.9900
1.7494
2.0459
2.7227
3.7240
1.4165
19
1.0000
2.0000
4.9999
4.9999
24.0000
24.0000
24.0000
u9
8.0000
4.9999
4.9999
4.9999
24.0000
24.0000
24.0000
lio uio
192.9998 469.0011
188.0008 711.0000
402.9980 402.9980
402.9980
665.9989
665.9989
665.9989
402.9980
665.9989
665.9989
665.9989
In uii
12.6000 22.0000
14.7000 21.2000
14.7000 14.7000
14.7000
20.2000
20.2000
20.2000
14.7000
20.2000
20.2000
20.2000
112 U12
288.9906 396.9000
70.8002 396.9000
321.0184 396.9000
369.2980
347.8787
0.3200
396.9000
369.2980
395.4287
396.9000
396.9000
113 U13
1.9199 30.8101
6.4300 29.6801
12.1200 26.8200
3.7000
2.9600
3.2601
17.2099
3.7000
17.5999
37.9700
17.2099
114 U14
12.7000 50.0000
8.1000 24.3000
13.4002 17.0002
50.0000
17.7998
50.0000
10.4000
50.0000
5.0000
50.0000
10.4000
106
A. Bargiela & W. Pedrycz Table 2. Characterisation of the 7 clusters in terms of sparsity and c-overlap for each of the 14 variabled (dimensions). ible no. 1 2 3 4 5 6 7 8 9 10 11 12 13 14
sparsity
c-overlap
0.135 0.136 0.201 0.143 0.291 0.326 0.307 0.210 0.062 0.218 0.241 0.344 0.458 0.426
0.1826 0.7143 0.2194 0.3333 0.1933 0.3255 0.3759 0.3397 0.2109 0.2381 0.2234 0.4357 0.4399 0.3759
are: crime rate (1), nitric oxide concentration (5), index of accessibility to radial highways (9), and proportion of non-retail business acres (3). In other words, these are the variables that discriminate between hyperboxes (we stress that these discriminatory aspects were found in the setting of the information granules, rather than classes. 6.2.2. Study B In order to ascertain whether the selection of records for the training and the validation data sets had significantly influenced conclusions regarding the number of clusters in the original data set, we repeated the clustering process with the training and validation sets swapped. Again, the compatibility measure directed the clustering process and the asymptotic evolution of the measure, at the initial and final stages of the process, indicated that 6 data clusters mark a 'change-over' point in the clustering process (Fig. 30). The curve showing the average degree of overlap between the clusters, illustrated in Fig. 31, indicates that a minimum overlap is achieved with 6, 7 and 8 clusters. For the ease of comparison with the Study A case, we select 7 clusters for the validation stage. The average inclusion rate of the
From Numbers to Information
100
Granules
150
107
200
iteration Fig. 30. Compatibility measure of clusters formed from the even-numbered records in the Boston housing data set. Iteration no. 246 corresponds to 6 clusters.
feb 0
S
10
15
20
25
30
35
40
45
50
number of clusters Fig. 31.
Degree of average overlap of clusters in the last 50 iterations.
validation data set (odd- numbered records from the original data set) in the 7 clusters is slightly worse than in the previous case, averaging 86%. This is illustrated in Fig. 32. The reduction of the average inclusion rate in this case suggests that the training and validation sets contain a small number of unique patterns that do not have counterparts in the other set.
108
A. Bargiela & W. Pedrycz
\ v
^ \ Nv
~^ V
_
"N
_^ ---
15
20
25
30
35
40
45
50
number of clusters
Fig. 32.
Q
Inclusion measure evaluated for 1 to 50 clusters.
Cluster number Fig. 33. Cardinality (first bar) and the aggregate inclusion rate (second bar) for each of the 7 clusters.
The result is that although the distinctiveness of these patterns warrants their inclusion in separate clusters, the cross-comparison of these minority clusters' is very limited. This is further verified by the inspection of Fig. 33, which shows that clusters 3, 5 and 7 are representing 1, 1 and 2 patterns respectively, and they have no corresponding patterns in the validation set. It is also interesting to note that, compared to the Study A, there is a greater discrepancy between the cardinality of the clusters and the
From Numbers to Information
Granules
109
inclusion rate. We conclude therefore that the size of the data supports only firm conclusions about 2 clusters and the characterization of further clusters requires an order of magnitude larger data sample. The sparsity and c-overlap of the features (variables) are very similar to Study A, meaning that some global properties discovered in the data set have been retained. Table 3. Characterisation of the 7 clusters in terms of sparsity and c-overlap for each of the 14 variabled (dimensions). ible no. 1 2 3 4 5 6 7 8 9 10 i—l
12 13 14
sparsity
c-overlap
0.117 0.229 0.284 0.143 0.258 0.348 0.393 0.221 0.075 0.155 0.228 0.303 0.443 0.412
0.1414 0.2667 0.1432 0.5238 0.0985 0.3391 0.3144 0.1674 0.1769 0.1560 0.0760 0.3276 0.3762 0.3175
7. Conclusions The study has articulated an alternative view of unsupervised pattern recognition by providing a constructive method of forming information granules that capture the essence of large collections of heterogeneous numeric data. In this sense, the original data are compressed down to a few information granules whose location in the data space and granularity reflect the structure in the data. The approach promotes a data-driven problem solving by emphasizing the transparency of the results (hyperboxes). Formation of information granules is guided by two aspects: distance between information granules and size (granularity) of the potential information granule formed through merging two other granules. These two aspects
110
A. Bargiela & W. Pedrycz
are encapsulated in the form of the compatibility measure. Moreover, we discussed a number of indexes describing the hyperboxes and expressing relationships between such information granules. We show how to validate t h e granular structure. T h e resulting family of information granules is a concise descriptor of the structure of the d a t a — we may call t h e m a granular signature of the data. Some further extensions of the hyperbox approach may deal with more detailed instruments of information granulation such as fuzzy sets 7 ' 1 1 . It should be stressed t h a t the proposed approach to d a t a analysis is noninvasive meaning t h a t we have not a t t e m p t e d to formulate specific assumptions about the distribution of the d a t a but rather allow the d a t a to "speak" freely. This is accomplished in two main ways: • First, the hyperboxes are easily understood by a user as each dimension (variable) comes as a p a r t of the construct. • Second, the approach finds relationships t h a t are direction-free, meaning t h a t we do not distinguish between input and o u t p u t variables (which could be quite restrictive as we may not know in advance what implies w h a t ) . Obviously, this feature is common to all clustering methods. Furthermore the granulation mechanism puts the variables (features) existing in the problem in a new perspective. T h e two indexes such as sparsity and overlap are useful in understanding the relevance of the variables, in particular their discriminatory abilities. While the study was concerned with the development of information granules (hyperboxes), there are interesting inquires into their use in granular modeling. In particular, we are concerned with the fundamental inference problem • given an input d a t u m (information granule and numeric d a t u m , in particular) X defined in a certain subspace of dimension n' of the original space R n C R n and a collection of information granules B = { B ( l ) B ( 2 ) , . . . , B ( c ) } determine the corresponding information granule Y. T h e current paper provides a basis for this investigation. Acknowledgments Support from the Engineering and Physical Sciences Research Council (UK), the Natural Sciences and Engineering Research Council of C a n a d a
From Numbers to Information Granules
111
(NSERC) and the Alberta Consortium of Software Engineering (ASERC) is gratefully acknowledged. References 1. A. Baraldi, P. Blonda, A survey of clustering algorithms for pattern recognition. IEEE Trans. SMC: Part B, Vol. 29, 6, 1999, pp. 778-785. 2. A. Bargiela, Interval and ellipsoidal uncertainty models, In: W. Pedrycz (ed.) Granular Computing, Springer Verlag, 2001. 3. A. Bargiela, W. Pedrycz, Information granules: Aggregation and interpretation issues, submitted to IEEE Trans, on Syst. Man and Cybernetics. 4. J. C. Bezdek, J. M. Keller, R. Krishnapuram, N. R. Pal, Fuzzy Models and Algorithms for Pattern Recognition and Image Processing, Kluwer, 1999. 5. B. Gabrys, A. Bargiela, General fuzzy Min-Max neural network for clustering and classification, IEEE Trans, on Neural Networks, Vol. 11, No. 3, pp. 769783, 2000. 6. D. Harrison, D. L. Rubinfeld, Hedonic prices and the demand for clean air, J. Environ. Economics & Management, Vol. 5, pp. 81-102, 1978. 7. A. Kandel, Fuzzy Mathematical Techniques with Applications, AddisonWesley, Reading, MA, 1986. 8. T. Kohonen, Self-organized formation of topologically correct feature maps, Biological Cybernetics, 43, 59-69 (1982). 9. T. Kohonen, Self-organizing Maps, Springer Verlag, Berlin, 1995. 10. W. Pedrycz, Computational Intelligence: An Introduction, CRC Press, Boca Raton, FL, 1997. 11. W. Pedrycz, F. Gomide, An Introduction to Fuzzy Sets, Cambridge, MIT Press, Cambridge, MA, 1998. 12. W. Pedrycz, Fuzzy equalization in the construction of fuzzy sets, Fuzzy Sets and Systems, Vol. 119, 2, 2001, pp. 329-335. 13. W. Pedrycz, M. H. Smith, A. Bargiela, A granular signature of data, Proc. 19th Int. {IEEE) Conf. NAFIPS'2000, Atlanta, July 2000, pp. 69-73. 14. P. K. Simpson, Fuzzy Min-Max neural networks — Part 1: Classification, IEEE Trans, on Neural Networks, Vol. 3, No. 5, pp. 776-86, September 1992. 15. P. K. Simpson, Fuzzy Min-Max neural networks — Part 2: Clustering, IEEE Trans, on Neural Networks, Vol. 4, No. 1, pp. 32-45, February 1993. 16. L. A. Zadeh, Fuzzy sets and information granularity, In: M. M. Gupta, R. K. Ragade, R. R. Yager, eds., Advances in Fuzzy Set Theory and Applications, North Holland, Amsterdam, 1979, pp. 3-18. 17. L. A. Zadeh, Fuzzy logic = Computing with words, IEEE Trans, on Fuzzy Systems, Vol. 4, 2, 1996, pp. 103-111. 18. L. A. Zadeh, Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic, Fuzzy Sets and Systems, 90, 1997, pp. 111-117.
112
A. Bargiela & W. Pedrycz
19. M. Meneganti, F. S. Saviello, R. Tagliaferri, Fuzzy neural networks for classification and detection of anomalies, IEEE Trans, on Neural Networks, Vol. 9, 5, 1998, pp. 848-861. 20. A. Joshi, N. Ramakrishman, E. N. Houstis, J. R. Rice, On neurobiological, neuro-fuzzy, machine learning, and statistical pattern recognition techniques, IEEE Trans, on Neural Networks, Vol. 8, 1, 1997, pp. 18-31. 21. L. I. Kuncheva, J. C. Bezdek, Presupervised and post-supervised prototype classifier design, IEEE Trans, on Neural Networks, Vol. 10, 5, 1999, pp. 11421152.
CHAPTER 5 COMBINATION OF HIDDEN MARKOV MODELS AND NEURAL NETWORKS FOR HYBRID STATISTICAL PATTERN RECOGNITION Gerhard Rigoll Department of Computer Science Faculty of Electrical Engineering Gerhard-Mercator- University Duisburg, Germany E-mail: rigoll@fb9-ti. uni-duisburg. de
In this chapter, several methods for combining Hidden Markov Models (HMMs) and Neural Networks (NNs) for hybrid HMM/NN systems are presented. These systems can be used in a variety of pattern recognition applications that are most successfully handled by statistical recognition methods. Typical applications include speech and handwriting recognition, as well as face identification or similar computer vision problems. Hidden Markov Models as well as Neural Networks are both very powerful pattern recognition paradigms with different advantages and disadvantages, and therefore their combination seems to be an attractive option in order to improve current state-of-the-art pattern recognition systems. It turns out that the combination of HMMs and NNs is a challenging and interesting subject that leads to a variety of different approaches for the realization of hybrid systems. Besides architectural issues, that are concerned with finding the optimal structure for combining both paradigms, the joint estimation of system parameters - so that both paradigms can interact in an optimal way and support each other - is the major problem. The feasibility and success of hybrid systems is confirmed by numerous applications mentioned in this chapter, where hybrid systems have proved their superior performance compared to standard approaches.
1. I n t r o d u c t i o n T h e technique of Hidden Markov Models (HMMs) has emerged as one of the dominant p a t t e r n recognition technologies since the late 1970's. 9 This is especially the case for time-varying, dynamic patterns, such as in 113
114
G. Rigoll
speech recognition or handwriting recognition. Today, almost every successful speech recognition system - ranging from commercial products for telephone applications to highly advanced and optimized prototypes with vocabularies up to 100,000 words and millions of parameters - is based on HMM-technology.16 In handwriting, a similarly increasing dominance of HMM-based systems could be observed during the last decade. 13 Neural Networks (NNs) became very popular in the late 1980's,12 but it quickly became clear that pure neural approaches cannot cope with several of the most important requirements for state-of-the-art speech or handwriting recognition systems, especially the requirements of continuous recognition and very large vocabularies. Today, there are almost no successful speech or handwriting recognition systems that are based on pure neural network technology.16 Prom the early 1990's on, a third type of recognition architecture for dynamic patterns has been developed by a few researchers. This technology is the so-called hybrid technology, where HMMs and NNs are combined to construct a system with improved classification capabilities. 25 The reasons for combining HMMs and NNs into hybrid systems are quite obvious. Hidden Markov Models represent probably the most powerful tool for recognition of dynamic, time-varying patterns, where warping in time or space is one of the most important requirements. Neural Networks are superior static pattern classifiers and function approximators, and are furthermore known for their discriminative capabilities. HMMs are mainly trained as non-discriminative classifiers with very time-efficient MaximumLikelihood-based training methods, whereas the training methods for neural nets are known to be effective but rather time consuming. By combining both powerful paradigms, the superior warping capabilities of HMMs can be combined with the discriminative characteristics of NNs. Thus, the resulting hybrid classifier will consist of one part that can model the section of the recognition task that requires discriminative training between the classes and another part that derives additional classification parameters through fast ML-based estimation methods. In order to design hybrid HMM/NN systems, two major problems have to be solved. The first problem concerns the architectural aspect, i.e. how the structure of the HMM (e.g. discrete or continuous) should be chosen in order to fit to the NN and what NN paradigm should be used. The second problem is how both components can best be trained together. An analysis of the mathematical foundations of both paradigms, and the mathematical model of the combined
Combination
of Hidden Markov Models and Neural Networks
115
hybrid system, is required. In this chapter, a variety of different hybrid approaches are presented, most of them developed by our research group. They include the classical hybrid HMM/NN approach using neural nets as posterior probability estimators, the combination of discrete HMMs with Maximum Mutual Information Neural Networks, the use of neural nets as nonlinear feature extractors for continuous density HMMs, and a recently developed approach called Tied-Posteriors, related to the popular classical Tied-Mixture HMMs. It will also be shown that in many cases, it is possible to present hybrid HMM/NN systems in the framework of a pure neural implementation, and that this consideration leads to the derivation of relationships between all different hybrid systems as well as to their non-hybrid counterparts. The aim of this chapter is to present an overview of several different hybrid approaches and to give the reader a brief description of their functionalities. What is considered to be even more important is the establishment of relationships between these different approaches and relationships to the traditional HMM approaches. For most readers, a hybrid system may look and function completely different than a conventional system based on HMMs. For people having worked on hybrid systems for a long time, the difference between hybrid and traditional systems may have almost vanished. One of the goals of this chapter is to show the reader the close relationships between both different technologies, and to determine if hybrid recognition systems are a real alternative to traditional approaches. This chapter demonstrates that all popular statistical pattern recognition techniques are more or less related to each other, but that hybrid approaches are still an interesting alternative to traditional techniques, which will take an important role in future pattern recognition research. 2. Traditional HMM-based Systems It is (of course) not the goal of this section to present a complete overview of HMM systems. Instead, only the equations that are of interest for the establishment of relationships to hybrid approaches later on are given. It is assumed that most readers are specialists in pattern recognition, and thus may be at least partially familiar with the basic functionalities of HMMs. Therefore, typical issues such as estimation techniques (e.g. FB-algorithm) or decoding techniques (e.g. Viterbi-search) are not the major subject of this paper. Figure 1 gives a very compact explanation of the basic functionality
116
G. Rigoll
Fig. 1.
General structure of a Hidden Markov Model.
of a Hidden Markov Model. It can be seen that the HMM is in fact a stochastic finite state automata with several states, usually denoted as Si, i = 1 , . . . , /, where / is the total number of states. Transitions between state i and j are possible with certain transition probabilities a^-, which are usually summarized in a state transition matrix A = { a ^ } . In addition to these transition probabilities, there is an emission probability assigned to each state, which yields the probability of an observed feature vector x - denoted as p(x\si) - while the HMM is in state Sj. In Fig. 1, it can be also seen that the HMM processes a pattern that is represented as a sequence of feature vectors below the HMM. This pattern, denoted as X, consists of the feature vectors x(k), k = 1 , . . . ,K, where K is the length of the feature vector sequence X. The pattern recognition process basically consists of aligning this feature vector sequence to the HMM states, as can be seen in Fig. 1. This is usually accomplished by the Viterbi algorithm. Application of this algorithm does not only lead to such an alignment, but additionally also to the computation of the observation probability that the observed pattern X has been generated by the underlying Hidden Markov Model M. This probability p(X\M) is computed by the Viterbi algorithm by exploiting the transition and emission probabilities of the model M.
Combination
of Hidden Markov Models and Neural Networks
117
It turns out that the most important link between HMMs and Neural Networks is the emission probability component. In fact, all hybrid approaches are based on an attempt to replace the HMM output probability distribution by a neural component. This is because the output probability modeling component is clearly the component with the largest impact on the recognition performance of an HMM-based system. Therefore, in the following subsections, the traditional output modeling techniques are briefly reviewed.
2.1. Discrete
Hidden
Markov
Models
T h e basic functionality of an H M M o u t p u t modeling component is t o calculate t h e probability of a feature vector x if it is assumed t h a t t h e underlying H M M is in t h e state s while x is being generated. 1 8 This probability is denoted as p(x\s). T h e first approach used in t h e early generation of HMM-based systems is t o quantize t h e vector x into discrete labels yn and t o use t h e frequency of those labels as a discrete probability distribution. T h e quantization is usually performed with a k-means vector quantizer (VQ) t h a t assigns a discrete label yn t o each vector x_ according t o X
> Vn
(1)
Then, t h e state-conditional probability of this vector is expressed as p{x\s) =p(yn\s)
(2)
and t h e p(yn\s) axe obtained from a lookup table t h a t has been computed from t h e counts of t h e various labels assigned t o state s during training. It should be noted t h a t t h e k-means clustering algorithm does not take into account t h e class labels of t h e training vectors x_, and therefore this information is missing in t h e feature processor of a discrete system. Another problem is t h e quantization error resulting from t h e V Q process. T h e advantages of discrete systems include t h e speed of t h e table lookup procedure for calculating t h e o u t p u t probabilities, and t h e fact t h a t this method does not rely on any a priori assumptions about t h e distribution of t h e feature vectors. Figure 2 shows t h e basic structure of a discrete HMM. T h e feature vectors resulting from t h e input signal are processed by a vector quantizer which could be also implemented in form of a 1-layer neural network with winner-take-all characteristics in its o u t p u t layer (e.g. a Kohonen type neural network), resulting in t h e firing of t h e one single neuron t h a t
G. Rigoll
118
ffHHHHfHHf
INPUT SIGNAL
FEATURE VECTORS
VECTOR QUANTIZER (KOHONEN-MAP)
11 iliii
5
i lllil
i
1 1 1 1 1
Mill
0
hill
6
Fig. 2.
0
lllli
FIRING NEURON
iiili
0
EFARKOV
0
Basic structure of a discrete HMM.
has the largest (or smallest, depending on the propagation function) activation. This activation is set to 1.0, whereas all other activations take the value of 0.0. At the bottom of Fig. 2, the discrete emission probabilities for each state are shown as vertical bars. If they are multiplied with the activations of the neural network output layer, the resulting emission probability of the firing neuron (which is equivalent to the resulting VQ prototype) is generated. 2.2.
Continuous
Density
Hidden
Markov
Models
Contrary to discrete systems, the state-conditional probability is computed for continuous HMMs as a sum of weighted Gaussians according to: p(x\s) = 2 j p ( n | s ) • G(x\n, s)
(3)
where G(x\n,s) is the nth Gaussian for state s, p(n\s) is the weight (or occurrence probability) of the nth Gaussian and the summation is over all ns Gaussians that belong to state s. A very popular approach is the creation of one large pool of N Gaussians which are shared by all states s of the HMM system. In this way, the Gaussians become independent of
Combination
of Hidden Markov Models and Neural Networks
119
the state, and are denoted as G(x\n). Each state still keeps its individual weights for the N Gaussians, and if these weights are expressed as p(yn\s), the state-conditional probability for such a "Tied-Mixture-System" can be expressed as N
p(x\s)
= ] T p ( y n | s ) • G{x\n).
(4)
n=l
This formula has some similarity to Eq. 2 for the discrete case, and it t u r n s out t h a t the Gaussian weights p{yn\s) can be interpreted as discrete emission probabilities for a codebook of size N, which are smoothed by the Gaussian factors G(x\n) in Eq. 4. Therefore, this approach is also denoted as semi-continuous HMMs. In Eq. 4 both, the parameters p{yn\s) and also the parameters of the Gaussians G(x\n) are learned with the EM-algorithm, while exploiting extensively the class information of the training vectors. This is one of the major differences from discrete systems. This improved feature modeling component, and the smoothing capabilities of the continuous probability density functions, made the continuous HMMs the most popular and powerful approach, especially in the area of speech recognition, but also in handwriting recognition and in other applications. 2 . 3 . Neural
Implementation
of Hidden
Markov
Models
It is interesting t o see t h a t t h e o u t p u t modeling component of b o t h t h e traditional discrete, as well as the continuous HMM, can be implemented in a neural architecture. A first indication of this fact is already given in Fig. 2, since the discrete HMM displayed in this figure already contains a neural network as one of its major components. Based on this fact, the discrete system shown in Fig. 2 can be further transformed into a pure neural architecture as shown in Fig. 3. In Fig. 3, the feature vectors extracted from the input signal are presented to the input layer of the vector quantizer. As already mentioned in the subsection on discrete HMMs, the architecture of this VQ is identical to a Self-Organizing Map, trained with the k-means algorithm. Again, the hidden layer in Fig. 3 has a winner-take-all characteristic, and the activation of the neuron with the smallest o u t p u t is set to 1.0, while all other activations are set to 0.0. T h e o u t p u t layer size in Fig. 3 is identical to the total number of HMM states. T h e N weights connecting the N hidden layer neurons to one single neuron in the o u t p u t layer are identical to the
G. Rigoll
120
P t e N P(x|s2) t t t
t
p(x|Si) t class-conditi conditional
^ ^ • - ^ ^ ^ X ^ ^ A ^ ^ P robabilities
HMM } discrete probabilities p(y n l s ) VQ. weights^
VQ (Kohonen-map)
[ - ] [ • • • ] [ - ] [ - ] [ - ] [ - ] [ - ] feature vectors
inputsignal
ffH^WH^ Fig. 3.
Neural implementation of a discrete HMM.
vector of discrete output probabilities p(yn\s), n = 1 , . . . , N for the corresponding HMM state s. If only 1-state HMMs are assumed the system has an output layer whose size is identical to the number of classes in the HMM system. Consequently, the activation of the output neurons are directly the class-conditional probabilities p(x\class). If HMMs with M states each were used, the M times larger output layer would deliver the state-conditional probabilities p(x\si), where s; is one of the P x M possible states if P is the number of classes in the system. The dotted lines in Fig. 3 in the area of the output layer denote the transition probabilities of the HMM. A good way to think about their functionality is to first imagine that they are not present in Fig. 3, and first feed all feature vectors to the neural network in Fig. 3, in order to compute all output probabilities for all possible states. These state-conditional probabilities can then be evaluated jointly with the transition probabilities for decoding with the Viterbi algorithm, which finds the optimal state sequence using dynamic programming. Figure 4 shows an equivalent neural structure for a tied-mixture HMM. Note that the difference to the structure in Fig. 3 is not very large. The only major difference is the fact that the hidden layer in Fig. 4 has a nonlinear activation function (Gaussian in this case), and that it does not have a winner-take-all
Combination
of Hidden Markov Models and Neural
P(x|Si) p(x|s2)
Networks
121
P(x|Sj) class-conditional "probabilities
*4&U4&fcl
discrete
} probabilities p(y n l s >
RBF 'NN
}
Gaussian parameters G(x|n) _ ^
[•••][•••][•••][•••][•••][•••][•••]
feature vectors
fHH#f
input signal
Fig. 4.
Neural implementation of a tied-mixture HMM.
characteristic. Instead, all activations are propagated to the output layer. This results in a typical Radial-Basis-Function (RBF) neural network, and shows that discrete and continuous systems can be both implemented in a very similar neural network architecture. 3. Hybrid Systems 3.1. Neural Networks
as Posterior
Probability
Estimators
Figures 3 and 4 already indicate that the difference between traditional and hybrid system architectures might be smaller than probably anticipated. Although both figures show a typical neural network, the weights of these nets are still trained according to classical Maximum Likelihood principles, which are not used in most neural net paradigms. As already mentioned, the hybrid approach is typically used to replace the output probability modeling component of the HMM. This component is implemented in the neural network layers in Figs. 3 and 4. It has been shown in Ref. 2 that instead of using the RBF structure in Fig. 4 for a continuous modeling of the
122
G. Rigoll
output probabilities, it is also possible to use a multilayer-perceptron (MLP) for this task. In this case, it is also possible to use the well-known errorbackpropagation algorithm to train this MLP. This algorithm is typically used for training neural networks for classification tasks. Alternative but similar training procedures are available in Ref. 10, for example. In this case, the neuron representing the class of the presented input vector is given a target value of 1.0, and all other neurons receive a target value of 0.0. This results in target values 6jt(x) for the j t h output neuron, if the presented vector x_ belongs to class i. Typically, the activation function denoted as f(x) is chosen to be a sigmoid function for all neurons of the MLP except those in the output layer. In the output layer, the activation of the j t h neuron yj is additionally smoothed by the softmax function Oj(yj) according to: eVj/T
°}(yj)
(5)
= ^eyi,T
with the parameter T set to unity. In this case, the error between the actual output values Oj and the target values Sji can be expressed as
e
= ^EE^fe(fc))-^]2 k
(6)
j
and is minimized by the backpropagation algorithm. The derivative of Eq. 6 with respect to a weight w of the neural net leads to 1 4 :
fc
j
with J
Rj(k) = OMk))
• Y, [Oi(x{k)) - 5u] • fa - 0,(x(k))]
.
(8)
The expression dyj/dw in Eq. 7 depends on the internal neural network structure and can be obtained by considering the actual layer of the weight w and the activation function / . It has been shown in Ref. 3 that if the error function in Eq. 6 is minimized in order to obtain a global minimum of this function, the activation of the j t h neuron after presentation of the input vector x can be interpreted as the posterior probability of the j t h class, i.e. Oj(x)=p(sj\x)
(9)
Combination
of Hidden Markov Models and Neural
Networks
123
P(x Si) P(x|s2) class-conditional probabilities prior weighting posterior probabilities
y
MLP
[•••][•••][...][•••][...][•••][•••] feature vectors
inputsignal
ff^H^
Fig. 5. Hybrid approach based on posterior probability estimation with an MLP.
where Sj indicates the j t h category, i.e. either a class of the pattern recognition system or a certain state s of one of the HMMs used for representing a specific class. In the latter case, this state has been assigned to feature vector x by Viterbi alignment. The structure of such a system is shown in Fig. 5. Note that the structure of the system in Fig. 4 is not much different from the structure in Fig. 5, where the RBF neural network is replaced by the MLP. Obviously, in this hybrid system, the approximation of the output probability distributions is performed using a mixture of sigmoids rather than using a mixture of Gaussians. The major reason for interpreting Fig. 5 as a hybrid system is the additional use of the typical neural network training procedures for training the MLP weights. In this system, discriminative training objectives are used for the hybrid system rather than Maximum Likelihood objectives. As explained before, this leads to
124
G
Rigoll
the generation of posterior probabilities at the MLP output layer, which have to be divided by the prior probabilities of the classes in order to obtain the class-conditional probabilities that are required by the HMMs.2 This is performed in the additional output layer shown in Fig. 5. Although the structural similarity to traditional HMM-based systems can be still established by comparing Figs. 4 and 5, such a hybrid system has already a lot of differences concerning its practical use as a pattern recognition system. These practical differences are briefly summarized in the following points: • The system is not trained with the Forward-Backward algorithm. Instead, the MLP is trained separately with the backpropagation algorithm as explained previously. Since only 1-state HMMs are normally used, there is no HMM training procedure at all. For recognition, the sequence of feature vectors is presented to the MLP, which generates a sequence of emission probabilities for all the frames. This sequence is then evaluated for decoding using the Viterbi algorithm. • In most cases, the system is not used in a context-dependent mode. A context-dependent system requires the consideration of several thousand different HMM states for various realizations of the class units in their different possible contexts (e.g. phones in speech recognition or graphemes in handwriting). Such a system would require the training of a very large MLP with an output layer of several thousand neurons, which is impractical. However, this disadvantage is somewhat compensated for by the use of multiple frame inputs, where at least 5-7 adjacent frames are presented as inputs to the MLP. Such an input can be interpreted as context dependency at the frame level, and can be a substitute for context dependency at the unit level. This hybrid HMM/NN architecture based on posterior probabilities is by far the most popular hybrid approach, and has led to the development of several very successful systems - particularly in the area of speech recognition - especially if the fact is considered that such a hybrid system based on context-independent units uses only a fraction of the number of parameters of a context-dependent traditional system and has an almost equivalent recognition performance. This approach has been the basis for several other hybrid systems, which all use the basic principle of the neural posterior probability estimation. The most popular one is the replacement of
Combination
of Hidden Markov Models and Neural Networks
125
the MLP by a recurrent neural network 26 that avoids the use of the multiframe input, thus further reducing the number of system parameters. Other extensions of the baseline posterior approach are mainly aimed at the efficient introduction of context-dependency to the hybrid system. This can be obtained by introduction of context classes, for which the context probability can also be estimated with a number of neural networks. 11 A more recent method is the use of a hierarchical splitting of context states, where each hierarchy level can be characterized by a neural network estimating the posterior probability of this level.6 The entire probability of a context state can then be expressed as the product of the posteriors of all associated hierarchy levels. In this way, the complexity of the neural system can be kept relatively small despite the large number of context states. Although these systems are clearly different from the traditional system in Fig. 4, it would be an interesting experiment to train the RBF network in Fig. 4 on the estimation of posterior probabilities with error-backpropagation, and to use multiple frame inputs in combination with 1-state HMMs. Since RBF neural nets should be in principle capable of modeling posteriors very well, results similar to the MLP posterior approach could be expected. This demonstrates again that both the classical hybrid approach based on posteriors and the traditional tied-mixture approach are different, but are not extremely far away from each other. Hybrid systems based on posterior probability estimation have demonstrated their capabilities in numerous evaluations using a variety of popular databases, especially in the area of speech recognition. The most popular system that emerged from these activities is the ABBOT system, 5 which has obtained the best recognition results, and occasionally even outperformed the best continuous HMM systems on specific tasks. Another interesting outcome of the hybrid posterior probability approach is the fact that posteriors can be used very effectively in the decoding process for speech recognition. In Ref. 19, a decoder has been introduced that is especially designed for use in conjunction with hybrid systems and exploits the posterior probabilities delivered by the acoustic processor during decoding for efficient pruning strategies where the use of posteriors is much more advantageous than the use of class-dependent probabilities. In Ref. 33, it has been also shown that the availability of posterior probabilities can be directly evaluated, in order to obtain confidence measures that indicate the suitability of a classified acoustic pattern for use in adaptation or retrieval tasks.
G. Rigoll
126
3.2. Mutual Information
Neural
Networks
Although it has been shown in the previous section that the classical hybrid systems based on posterior probabilities are still related to traditional HMM systems, it is obvious that the HMM part of the hybrid system does not play a very large role in the system. In fact, it is almost possible to interpret that system as a pure neural system that only uses very simple 1-state HMMs for the sake of a dynamic programming implementation of the decoding phase. As already mentioned, one of the consequences of this is the fact that some of the useful features of HMM-based systems are difficult to implement for these hybrid systems. Examples for this are triphones, crossword triphones, or fast EM-based parameter estimation methods. In the following, an alternative hybrid system architecture is introduced that can be derived mathematically from the theory of traditional HMMs, and therefore retains many of the typical HMM characteristics, while using a neural feature processing module in conjunction with a complex HMMsystem. One of the interesting characteristics of this alternative hybrid approach is the fact that it can be derived from both the classical discrete as well as the continuous HMM approach. It will be shown that this approach combines the advantages of discrete and continuous HMMs, and can be interpreted as being in between both architectures. This can be achieved by considering the probabilistic relation between the generation of the discrete prototype yn (see Eq. 1) and the presentation of the continuous feature vector x_. This probabilistic relationship can be expressed using Bayes law as: P(Vn\x) • p{x) = p(x\yn) • p(yn).
(10)
The probability p(x) can thus be expressed as , ,
P(x\yn)
P{ ) =
^h)'P[Vn)-
-
(U)
This equation establishes a relationship between the probability of the continuous vector x and the probability of the prototype yn. Obviously, the quotient on the right hand side of Eq. 11 must be a probabilistic model of the vector quantization process, describing the probability of generating prototype yn when vector x is presented to the VQ. This VQ process is independent of the state s of the HMM, whereas the probabilities of the vector x and the prototype yn are always expressed as state-dependent probabilities
Combination
of Hidden Markov Models and Neural Networks
127
in HMM systems. Therefore, the following equation must hold: p(2L\s) = ^\-p(yn\s) p{yn\x)
(12)
and this equation can be further simplified by using Eq. 10 in order to modify the state-independent VQ-term in Eq. 12, yielding p{x\s) = ^-yP(yn\s).
(13)
Again, the quotient in this equation can be interpreted as a model of the vector quantization process, thus giving some insightful information about the relationship between the continuous model p{x\s) and the discrete probability distribution p(yn \ s). If the quotient of the total probabilities p(x) and p(Vn) was known, then the state-conditional continuous probability p(x\s) could be exactly predicted from the state-conditional discrete probability P{Vn\s)- Therefore, Eq. 13 is a discrete model that enables the calculation of continuous probabilities and therefore is between both the discrete and the continuous HMM approach. It also shows how discrete HMMs are used to approximate continuous systems: setting p(x) = p(yn) in Eq. 13 leads to the discrete model of Eq. 2. But this means that it is simply assumed that no quantization error exists and that vector x is mapped into a label yn that is unique for x. Therefore, Eq. 13 can be considered as the exact discrete model that takes the quantization error of the VQ into account, whereas Eq. 2 is the simplified discrete model. So far, the vector quantizer of an HMM has been mostly realized by the k-means algorithm and therefore was always designed in unsupervised mode. The exact VQ model of Eq. 13 can now be used to derive a new training criterion for an improved supervised vector quantizer, which can be implemented through a neural network. Therefore, the resulting system is a hybrid HMM/NN system, consisting of discrete HMMs and a neural network acting as neural vector quantizer. For this purpose, one has to recall that traditional HMMs are usually trained with the Maximum Likelihood (ML) principle, i.e. the parameters summarized in vector 9 of the complete system are chosen to maximize the likelihood of the training data: 9_* = argmax
< — Y^ log p(x(k)\s(k))
>
(14)
G. Rigoll
128
where again the state s(k) for every feature vector x(k) is known from a Viterbi-alignment. Inserting the exact probabilities delivered by the VQ in Eq. 13 into Eq. 14 yields: 0* = argmax — I \
- E
log p(x(k))
l0
S P(^(*)) + E
k
lo
S P(yn(k)\s(k)) 1 .
k
(15)
)
The terms in Eq. 15 are expectation values of log probabilities and can also be written as entropies, yielding: 0* = argmax {-H{X)
+ H(Y) - H(Y\S)}
(16)
9
where H(Y) is the entropy of the string of prototypes resulting from quantizing all available training feature vectors x and H(X) is the entropy of the feature stream X. The expression on the right hand side of Eq. 16 has to be maximized in order to fulfill the ML criterion for the entire pattern recognition system. If the parameters of a neural vector quantizer have to be optimized according to that criterion, it is obvious that these parameters can only affect the terms that contain the label sequence Y. The entropy H(X) cannot be affected by a VQ. Therefore, the remaining criterion is I(Y, S) = H(Y) - H(Y\S)
= H(S) - H(S\Y)
(17)
which is the mutual information between the string of firing neurons Y of a neural vector quantizer and the string S of the state sequence associated with the original stream of feature vectors x, presented for training. 21 It is possible to train neural networks with arbitrary topology and architecture in order to fulfill this Maximum Mutual Information (MMI) criterion. Such a neural net is then used as neural codebook replacing the original k-means-clustered codebook in the discrete HMM. Such networks may be also called Maximum Mutual Information Neural Networks. 20 ' 22 In order to obtain such a network, we may simply consider a multi-layer perceptron (MLP) equivalent to the type that was introduced in Section 2.1, with a softmax activation function Oj(yj) in the output layer, as already described in Eq. 5. In Eq. 17, the state sequence string S remains always constant for training with the same data. Therefore maximization of the left hand side of Eq. 17 is equivalent to the minimization of the entropy
Combination
H(S\Y) p(si,Vj)
of Hidden Markov Models and Neural Networks
129
which can be also rewritten as expectation over the distribution in the following manner:
H(S\Y) =
~^J2p{si,yj)-logp{sl\yj) i
J
Using Eq. 18, the problem is reduced to the derivation of the joint probabilities p(si,yj), which are the joint probabilities that the j t h neuron fires at a discrete time step that has been assigned to the HMM-state Sj. This can be calculated by simply summing over all the activations Oj of the j t h neuron yj at all discrete time steps k to which state Si has been assigned to, i.e. 1 K P{si,Vj) = ^ $ 3 <*"(*),* -°j(k)-
(19)
k=i
In this case, Oj(k) is the output of the softmax function in Eq. 5 with a very small value of parameter T in order to generate a crisp winner-take-all behavior. In this way, it is possible to approximate the firing behavior of the neural network while still having a differentiable model for calculating derivatives. The expression <5g(fc),Si is equal to 1.0 if the HMM state s(k) at discrete time step k is equal to the specific state S{. Through Eqs. 18, 19, and 5, it is now possible to use the chain rule in order to compute the derivatives of the expression H(S\Y) with respect to any weight w o f a neural network with arbitrary topology.14 Surprisingly, the result is relatively similar to the ordinary backpropagation algorithm; it consists of the expression in Eq. 7 (which remains unchanged), and Eq. 8 is replaced by 14 Rj
=
1 -K°j{-{k))
J
'^
lo
S P( s ( fc )l%) • ft; " Oi(x(k))}.
(20)
Equation 7 can be taken over from the classical MLP case and mainly depends on the internal NN topology. A major difference to the MLP posterior approach is the fact that the MMINN does not need the explicit formulation of target values for each neuron in the output layer, but instead develops its firing behavior towards a maximum mutual information value in a self-organizing way. In the case of multiple features, those features are usually quantized in separate codebooks and consequently will
130
G. Rigoll
also be quantized by separate neural networks in the hybrid MMINN approach. In this case, the various neural nets generate different neural firing strings denoted as Y\, Y2,... , Yx, It is possible to optimize each neural codebook separately by minimizing H(S\YX) according to Eq. 18. However, it is also possible to evaluate the correlations between the different features by considering the joint information of all neural firing strings and by minimizing the joint entropy H{S\Y\,Y2,... , Yx,...), leading to considerably more complicated equations given in Ref. 15. Such a joint feature optimization is not possible for ordinary multiple codebooks based on the k-means algorithm. The MMINN approach is obviously different from all other traditional or hybrid approaches. However, it is possible to establish interesting relationships to all those other recognition architectures. Figure 6 shows the relationship between the MMINN approach and the well-known traditional HMM techniques. The illustration in the upper part of this figure shows the conventional continuous HMM system, where each HMM state has a fixed number of Gaussians assigned to it. From there, the tied-mixture system can be easily generated by making available a large pool of Gaussians commonly shared by all states. The discrete HMM architecture can be directly derived from the tied mixture system by replacing the pool of Gaussians by a pool of regions with a discrete firing behavior characterized by a vector
Fig. 6.
Relationship between the MMINN and the traditional HMM approaches.
Combination
of Hidden Markov Models and Neural Networks
131
quantizer. This is shown by replacing the smooth Gaussians by the rectangular areas indicating the binary behavior of these units in the lower right part of Fig. 6. From the discrete system, the MMINN approach evolves directly by replacing the k-means VQ by a MMI neural network. This is indicated by the shifted rectangular areas in the lower left part of Fig. 6 showing the modified firing behavior of the neurons in order to fulfill the MMI criterion. It can therefore be seen that the MMINN approach can be considered to be somewhere in between the continuous and discrete HMM paradigms; the basic structure of the system is basically identical to that of a discrete system, but now the discrete neural firing areas can be trained using the class information of the training samples similar to the way the Gaussian parameters (means and variances) are estimated for the continuous HMMs using the EM-algorithm. Therefore, this hybrid system can also be considered as a continuous system, where the trained Gaussians are replaced by trained neural firing areas. The relationship to the classical hybrid approach based on posterior probability estimation can be established by comparing Figs. 3 and 5. Figure 3 shows the structure of the hybrid MMINN approach, which is basically identical to the structure of the discrete HMM except that the input layer in Fig. 3 may be replaced by an arbitrary neural network trained according to the MMI principle. This structure is not very different from the hybrid structure in Fig. 5, and the differences to Fig. 3 are similar to the differences between Figs. 3 and 4. Therefore, it is possible to establish a link between the hybrid MMINN approach through the tied-mixture system to the hybrid posterior probability estimator. Another link between both approaches is given by the similar training algorithms in Eqs. 7, 8, and 20. The MMINN approach has been tested with several popular speech databases. In this case, it turned out to be very advantageous that the MMINN approach can be combined with HMMs of arbitrary complexity. This has been exploited in Refs. 14 and 21 for speech recognition, where a complex hybrid MMI-connectionist/HMM system for the RM database has been developed, based on context-dependent HMMs using triphones and tree-based state clustering. In Ref. 27, a similar system has been developed for the WSJ database, 17 where the use of crossword triphones led to a system with recognition performance close to the best continuous systems available for the WSJ database. Due to the basic discrete structure
G. Rigoll
132
of the MMI-connectionist/HMM approach the decoding time of this system is close to realtime. Recently, the MMINN approach has been also tested for very large vocabulary handwriting recognition. In Ref. 24, it is demonstrated that for a 30,000 word handwriting recognition task, the hybrid approach has outperformed competing discrete and continuous density HMM systems. It has also been used in a very demanding handwriting recognition task (involving the recognition of 200,000 words) and clearly improved the recognition rate of a baseline discrete system in that case. An application of hybrid modeling techniques to on-line as well as off-line handwriting recognition has also been extensively investigated in Ref. 4. 3.3. Nonlinear Hybrids
Discriminant
Feature
Transformation
As has been shown in Section 3.1, the most popular hybrid approach based on posterior probability estimation relies on the continuous HMM approach and mainly aims at replacing the Gaussian modeling component by a neural network. In contrast to that approach, the MMINN approach in Section 3.2 is mainly based on the discrete HMM architecture. However, it is also possible to extend the MMINN principle to continuous HMMs in order to obtain a hybrid system that uses a continuous neural modeling technique based on information theory principles. A very obvious and natural way in order to achieve this is the addition of a neural network to the continuous tiedmixture architecture in Fig. 4. Such a neural network can be placed as an additional hidden layer just in front of the RBF component of the system shown in Fig. 4. It can even be implemented in a recurrent structure, as suggested in Fig. 7. The justification for such an extension can be given as follows. Obviously, one of the major reasons for using hybrid HMM/NN systems is the fact that the discriminative power of neural networks can have a positive influence on the system's recognition performance. Since the tiedmixture system in Fig. 4 can already be considered as a neural network, it would be theoretically possible to train the weights of this RBF network directly. It is however easy to show that this would be directly equivalent to the standard MMI principle for Hidden Markov Models1 and would therefore destroy the Maximum Likelihood (ML) objective on which the tied-mixture system was originally trained. In contrast to that, it has been shown that the MMI training of the neural vector quantizer in Section 3.2 corresponds directly to this ML principle.
Combination
of Hidden Markov Models and Neural
p(x|si)p(x|s2)
Networks
133
p(x|Si) class-conditional probabilities discrete
}probabilities P(Ynls)
RBF > NN
}
Gaussian parameters G(x*| n)
[-][-][...][•••][...][...l[...]transformed i JL J I - J I J L - J I JL J feature vectors x* recurrent Maximum Mutual Information Neural Network
[ . . . JL ] [ • • •JL ] [ • • •n] [ • • JL • ] [ • •n• ] [ . .JL . ] [ • •J• ] 1
i n
HHHff Fig. 7.
P
u t f e a t u r e
vectorsx input signal
Nonlinear discriminant feature transformation hybrid.
The tied-mixture system and the recurrent neural network now form one single neural network, where the newly introduced neural net represents the hidden units of this combined neural net. These units can now be trained using the MMI objective function, while the weights of the RBF neural network are frozen. Since this is now a continuous system, the mutual information now has to be expressed using the entropy H(X) of the stream X of continuous feature vectors in the following way: I(X,S)
= H(X) -
= -— ^ K
H(X\S)
log p{x{k)) + — J2 log
p(x(k)\s(k))
k
K
^ k
log
Pix{k)\s{k)) p(x(k))
*5>
P(x(k)\s(k)) EiP(£(fc)l s i)-p( s O
(21)
134
G. Rigoll
where the probabilities p(x\s) can be computed from the tied Gaussians by: p(x\s) = > an • —.
= • exp I — -(x* — mrl)TC~1
• (x* — m„) ) . (22)
In this case, x_{k) and s{k) denote the fcth training vector and its associated state or class, respectively, while s^ denotes any state of the HMM. The vector x* is the output vector of the additional MMI neural network and is shown in Fig. 7. A relationship between this vector and the original input vector x can be established via the weights of this additional neural network. Using this relation and Eqs. 21 and 22, a gradient of the mutual information expression in Eq. 21 with respect to the weights of the MMINN can be derived, which can be used for training the weights of this network. This training procedure with the frozen tied-mixture weights will have the effect of generating more discriminative features x* at the output layer of this MMINN. Therefore, the system will increase its discriminative behavior under the consideration of the active HMM parameters. Of course, this also leads to an increased mutual information of the overall system. But if in a subsequent step the HMM parameters are retrained with the MMINN weights held constant, the system regains its Maximum Likelihood characteristics, while now using features with improved discriminative characteristics. Such a procedure can be interpreted in at least 3 different ways: (1) As a hybrid system combining MMI neural nets and continuous HMMs (2) As a discriminant feature extraction method taking the parameters of the baseline HMMs into account (3) As an LDA-like (Linear Discriminant Analysis) feature transformation that incorporates nonlinear and recurrent feature transformations. As with an LDA, this approach also allows multiple frame input and can extract the relevant information out of a high dimensional feature vector. The advantage of this new hybrid approach is the fact that it can be implemented directly on top of a highly optimized tied-mixture system, which can also be a context-dependent system and already has an extremely good recognition performance. Such a system can then be further improved by the neural discriminant feature transformation. In this case, the MMINN can be trained in conjunction with a context-independent tied-mixture system and the trained MMINN can later be successfully used as a discriminant neural feature extractor in a complex context-dependent tied-mixture system. This has been demonstrated in Refs. 23 and 31, where a context-dependent
Combination
of Hidden Markov Models and Neural Networks
135
tied-mixture system for the RM database has been perfected so intensively that it was practically almost impossible to further improve its performance with any conventional method or LDA. It was, however, possible to further reduce the relative error rate by 10% using the new proposed hybrid approach. In such cases, the improvement will always be relatively small since the baseline system is already very good, but in contrast to most other conventional methods, an improvement is still possible. The relation of this approach to all other traditional and hybrid modeling techniques can easily be seen by the strong relation to the tied-mixture approach and the relation of the tied-mixture approach to the other approaches as outlined in Sections 3.1 and 3.2. Another interesting relationship between this approach and the hybrid posterior probability approach of Refs. 2, 3 has been established in Ref. 32. It is based on the fact that posterior probabilities can be computed efficiently from tied-mixture systems in the following manner: the posterior probability p(s\x) of a state s can be computed from the class-dependent probability according to Bayes formula as:
p(m
_ pm_m
(23)
In this case, the crucial point is in the computation of the probability p(x), which can be computed as p(xj = ^2p(s)-p(x\8).
(24)
s
Unfortunately, this requires a summation over all states, which is infeasible especially for large context-dependent systems with thousands of states. By exploiting the fact that the state-conditional probabilities are modeled by Gaussian mixtures according to Eq. 4, this equation can be inserted into Eq. 24 in order to obtain the following relationship: N
s
s
p(x) = Y^ p( ) • pfei ) = E s
s
( ) • E p(y«is) • G(x\n) n=l N
N
£
p s
^p(s)-p(y„[s)
G{x\n) = J2 P(Vn) • G{x\n)
(25)
P(V"\S) • P(s) •
( 26 )
n=l
with PiVn) = E
G. Rigoll
136
T h e quantities in Eq. 26 can be easily precomputed from a trained HMM system and stored in a table. Eq. 25 thus shows t h a t the probability p(x) can be obtained in a tied-mixture system simply by another summation over all N Gaussians. Using Eq. 23, it is now easy to make use of posterior probabilities even in traditional tied-mixture systems, and it can be shown t h a t a similar procedure is also possible for discrete systems, including hybrid MMI-connectionist/HMM systems. In Ref. 32 it has been shown how these posteriors are used for efficient decoding, but they can be also used for other purposes, e.g. for confidence measures. Therefore, Eq. 25 establishes another close link to the hybrid posterior probability approach from Section 3.1 and shows t h a t posteriors can also be used effectively in all the other hybrid paradigms presented in Sections 3.2 and 3.3 of this paper.
3.4.
The Tied-Posterior
Approach
One of our most recently developed hybrid approaches is focussed on an extension of the "classical" hybrid approach based on the use of posterior probabilities, presented first in Ref. 3 and previously outlined in Section 3.1 in this contribution. T h e major limitations of t h a t approach have already been mentioned, namely t h a t it is very difficult to use an arbitrary number of H M M states for such a system, and t h a t this also makes the use of context-dependent approaches for this hybrid architecture very difficult. Therefore, the use of context-dependent models needs special algorithms and structures in these hybrid systems (see e.g. Ref. 6). In this section we will present an extension of the hybrid posterior probability approach to overcome the problems of the limited structure and the context-dependent modeling described above. T h e approach presented here is based on a tiedmixture HMM technology. If we recall Eq. 4 from Section 2.2, we can express the Gaussian weights p(yn\s) also as a weighting factor c n and the Gaussian factors G{x\n) as an occurrence probability p{x\n) of the n t h Gaussian, resulting in the following equation for the emission probability p(x\si) for the ith s t a t e of the H M M 7 : N
p(x\si)
= ^
°in • P{x\n).
(27)
n=l
T h e weighting factors c;„ are usually estimated using the Maximum Likelihood (ML) algorithm. In Eq. 27 the sum is computed by conditional
Combination
of Hidden Markov Models and Neural Networks
137
probabilities p(x\n) multiplied with the factors c m . These conditional probabilities usually represent a Gaussian pdf. When using a neural network as a probability estimator to replace those Gaussian pdfs the result will be some kind of hybrid recognition system. 8 This replacement transforms a tied-mixture system into a hybrid "tied-posterior" system as proposed in this section. In order to replace the conditional probability used in the HMMs with the posterior probability p(n\x), which is the output of the neural network, a scaled likelihood (as in Ref. 2) can be used. This probability can be expressed using the posterior probabilities p(n\x) and the a priori class probabilities p(n), which can be estimated using the training data. Application of Bayes' law yields: P{x\n) p(x)
=
p(n\x) p(n)
Using neural network posteriors as described in Section 3.1 together with Eq. 27 and Eq. 28 leads to the tied-posterior approach, in which the emission probabilities can be computed as: N
i
i
\
where the probabilities p(n\x_) are now taken from the output of a neural network trained on the generation of posterior probabilities, and the probabilities p(n) are class priors that can be estimated from the training data. The factors Cj„ link the neural net output to the HMM structure and have to be estimated in the same way as for the traditional tied-mixture case using the Baum-Welch algorithm. This is shown in Fig. 8, which displays the traditional discrete and continuous HMM structures on the left side, and the classical hybrid posterior approach on the top right side. The tiedposterior approach pictured on the lower right is developed by weighting the network output with the weights c, n . The reason for calling this approach "tied-posteriors" is obvious if one considers that this approach can be obtained from the traditional tiedmixture system by replacing the tied Gaussians by the output of the posterior neural network. When instead of the ML-estimates the Cjn are chosen to be Cin = 0 if
i 7^ n
and
an = \
if
i=n
(30)
138
G. Rigoll
hybrid system with posterior probabilities
Gaussian mixture model
weights weights
tied-mixture model
Fig. 8.
tied-posterior system
Relation of the Tied-Posterior approach to other HMM paradigms.
the hybrid tied-posterior approach is transformed into a standard posterior hybrid recognition system equivalent to the system considered in Ref. 2. This shows that the standard hybrid system is a special case of the proposed tied-posterior recognizer. When using the weights Cjn, one has the advantage that now the estimation of the posterior probabilities becomes independent of the HMM structure. This means that, in contrast to the standard hybrid approach where the HMM only consists of a single state, it is now possible to use HMMs with more than one state. The difference between those states is then described by different state dependent vectors of the weights Cj„. This opens the opportunity to build multiple-state HMMs in a hybrid recognizer, as used in most standard HMM systems. The second advantage of this approach is that now it is possible to model context-dependent HMMs without changing the neural net used for the probability estimation. This extension is a straightforward adaption of the context-dependency used in tied-mixture systems. First all mono-class weights are copied to newly created context-dependent HMMs which are based on that mono-class. Then these context-dependent posterior weights can be retrained using the Baum-Welch algorithm, or the weights can be clustered, which can be done either with a data-driven or a tree-based clustering procedure. The neural network used as a probability estimator in this approach is the same network as would be used in standard hybrid approaches. This
Combination
of Hidden Markov Models and Neural Networks
139
means t h a t these hybrid systems can be transformed into a tied-posterior recognizer without a retraining of the neural weights. T h e approach provides the most efficient way to combine the advantages of posterior-based hybrid recognition technology with the advantages of context-dependent modeling. This context-dependent modeling can be done with established tools, such as tree-based clustering, and can be extended to all levels. Additionally, this approach combines discriminative training techniques for the posterior probabilities, with further optimization using Maximum Likelihood methods for the HMM parameters. Compared to a traditional tied-mixture system, the tied-posterior approach can exploit the well-known multi-frame technique much more efficiently. Typically, the use of multiple frames in s t a n d a r d (Gaussian) continuous systems does not lead to considerable improvements, and is thus usually implemented by keeping the original (low) dimensionality of the Gaussians and using an LDA-like procedure to merge the multiple frame dimension into the original dimension of one feature vector. However, it is well-known t h a t a multiframe input can be directly applied to hybrid systems by presenting the multiple frames directly to the neural net input, which typically leads to huge gains in system performance. This is also the case for our approach. Finally, another advantage of this approach lies in the fact t h a t the system is very compact (and thus uses relatively few parameters) in the following sense: when multiple features, comprising a large feature vector (consisting of e.g. cepstral, delta cepstral and energy features in the case of speech recognition), are used in a tied-mixture system, there are basically two options. T h e first one is the separation of the features and the use of multiple tied-mixture codebooks, which, however, does not lead to optimal results. More advantageous is the second option, where the feature vectors are left concatenated in order to exploit the correlations between the different features. This, however, implies the usage of very large tied-mixture codebooks (ranging from 500 to more t h a n 1000 prototypes) in order to cover the entire feature space of the concatenated feature vector. This is implicitly done in conventional continuous (i.e. untied-mixture) systems using a huge total number of state-individual Gaussians in conjunction with large concatenated feature vectors. In a tied-mixture system, however, the usage of such large codebooks would lead to many H M M parameters, because the number of tied-mixture weights per state is equal to the tied-mixture codebaok size. In our approach, we can afford to have the above mentioned
140
G. Rigoll
large concatenated multiple feature vectors as input to our neural network, and we can still limit the size of our output layer to the number of class units present for the recognition task. This is due to the fact that models in this approach are largely stored in the hidden layer of the MLP, whereas the standard (Gaussian) tied-mixture approach is basically equivalent to an RBF neural network architecture. Experiments in the area of continuous speech recognition described in detail in Refs. 28 and 29 have shown that this approach can outperform most other approaches, if only context-independent models are used in all cases, and comes close to the best comparable systems if context-dependent modeling is chosen. 4. Conclusion The purpose of this chapter was mainly the presentation of the most popular hybrid modeling techniques for pattern recognition, and the provision of a unified framework for these approaches, which includes the traditional HMM techniques. Four major hybrid modeling techniques have been identified: the posterior probability approach using MLPs and recurrent neural networks, the MMI neural network approach using neural vector quantizers, the hybrid approach based on a nonlinear discriminant neural feature extraction method, and the tied-posterior approach. Although the various methods seem to be dissimilar, several interesting relationships between them can be found, resulting in common frameworks concerning architectural viewpoints and sometimes also common training issues. Although it is not easy to obtain better results with hybrid systems, it has been shown that this is possible in specific application areas. Independently of this issue, the question of whether hybrid approaches are real alternatives to traditional HMM approaches can be answered positively. Hybrid systems offer a variety of alternative modeling issues that can become important depending on each single application. This includes the possibility of using fewer parameters and the use of mono-class systems in the case of hybrids based on posterior probabilities or tied-posteriors. It also includes fast decoding algorithms for MMINN hybrids due to their discrete structure, or the option of further improving very sophisticated context-dependent systems using neural discriminant feature transformations. It can be expected that the idea of combining HMMs and neural nets for pattern recognition tasks will be further investigated and perfected, and will lead to even more sophisticated hybrid systems in the future.
Combination of Hidden Markov Models and Neural Networks
141
Acknowledgments T h e author would like to t h a n k all his colleagues who have actively contributed to the development of hybrid systems at Gerhard-MercatorUniversity Duisburg. He is especially grateful to the contributions from Ch. Neukirchen, J. R o t t l a n d and D. Willett for the use of hybrid systems in speech recognition, as well as to A. Kosmala and A. Brakensiek for introducing hybrid technology to handwriting recognition. F . Wallhoff has been investigating the use of hybrid p a t t e r n recognition techniques for face recognition. J. S t a d e r m a n n has been mainly working on the tied-posterior approach for speech recognition, which has been also evaluated for handwriting recognition by A. Brakensiek. References 1. L. Bahl, P. Brown, P. DeSouza, R. Mercer: "Maximum Mutual Information Estimation of Hidden Markov Model Parameters for Speech Recognition", Proc. IEEE-ICASSP, Tokyo, 1986, pp. 49-52. 2. H. Bourlard, N. Morgan: Connectionist Speech Recognition: A Hybrid Approach, Kluwer Academic Publishers, 1994. 3. H. Bourlard, C.J. Wellekens: "Links Between Markov Models and Multilayer Perceptrons", IEEE Trans, on Pattern Analysis and Machine Intelligence, Vol. 12, Dec. 1990, pp. 1167-1178. 4. A. Brakensiek, A. Kosmala, D. Willett, W. Wang, and G. Rigoll: "Performance Evaluation of a New Hybrid Modeling Technique for Handwriting Recognition Using Identical On-Line and Off-Line Data", 5th Int. Conference on Document Analysis and Recognition (ICDAR), Bangalore, India, 1999, pp. 446-449. 5. G. Cook, D. Kershaw, J. Christie, C. Seymour, S. Waterhouse: "Transcription of Broadcast Television and Radio News: The 1996 Abbot System", Proc. IEEE-ICASSP, Munich, 1997, pp. 723-726. 6. J. Fritsch, M. Finke: "ACID/HNN: Clustering Hierarchies of Neural Networks for Context-Dependent Connectionist Acoustic Modeling", Proc. IEEEICASSP, Seattle, 1998, pp. 505-508. 7. X. D. Huang and M. A. Jack: Semi-continuous hidden Markov models for speech signals, Computer Speech and Language, 3(3):239-251, May 1989. 8. H. P. Hutter: "Comparison of a new hybrid connectionist-SCHMM approach with other hybrid approaches for speech recognition", Proc. ICASSP, 1995, pp. 3311-3314. 9. F. Jelinek: "Speech Recognition by Statistical Methods", Proc. IEEE, Vol. 64, No. 4, April 1976, pp. 532-556. 10. D. Johnson: ftp://ftp.icsi.berkeley.edu/pub/real/davidj/quicknetv0_96.tar.gz
142
G. Rigoll
11. D. J. Kershaw, M. Hochberg, A. J. Robinson: "Context-Dependent Classes in a Hybrid Recurrent Network-HMM Speech Recognition System", Advances in Neural Information Processing Systems 8 (NIPS'95), 1996, pp. 750-756. 12. R. P. Lippmann: An Introduction to Computing with Neural Nets, IEEE ASSP Magazine, April 1987, pp. 4-22. 13. S.-W. Lee: "Advances in Handwriting Recognition", World Scientific Publishing, Series in Machine Perception and Artificial Intelligence, Vol. 34, 1999. 14. Ch. Neukirchen, G. Rigoll: "Advanced Training Methods and New Network Topologies for Hybrid MMI-Connectionist/HMM Speech Recognition Systems", Proc. IEEE-ICASSP, Munich, 1997, pp. 3257-3260. 15. Ch. Neukirchen, D. Willett, S. Eickeler, S. Mullen "Exploiting Acoustic Feature Correlations by Joint Neural Vector Quantizer Design in a Discrete HMM System", Proc. IEEE-ICASSP, Seattle, 1998, pp. 5-8. 16. D. S. Pallett, J. G. Fiscus, W. M. Fisher, J. S. Garofolo, B. A. Lund, M. A. Przybocki: "1993 Benchmark Tests for the ARPA Spoken Language Program" , Proc. of the Human Language Technology Workshop, Plainsboro, New Jersey, March 1994, pp. 49-74. 17. D. B. Paul and J. M. Baker: "The Design for the Wall Street Journal-based CSR Corpus", Proc. of the DARPA Speech and Natural Language Workshop, Pacific Grove, CA, 1992, Morgan Kaufmann, pp. 357-362. 18. L. R. Rabiner, B. H. Juang: "An Introduction to Hidden Markov Models", IEEE ASSP Magazine, 1986, pp. 4-16. 19. S. Renals, M. Hochberg: "Efficient Search Using Posterior Phone Probability Estimates", Proc. IEEE-ICASSP, Detroit, 1995, pp. 596-599. 20. G. Rigoll: "Maximum Mutual Information Neural Networks for Hybrid Connectionist-HMM Speech Recognition Systems", IEEE Transactions on Speech and Audio Processing, Vol. 2, No. 1, Special Issue on Neural Networks for Speech Processing, January 1994, pp. 175-184. 21. G. Rigoll, Ch. Neukirchen, J. Rottland, "A New Hybrid System Based on MMI-Neural Networks for the RM Speech Recognition Task", Proc. IEEEICASSP, Atlanta, 1996, pp. 865-868. 22. G. Rigoll, Ch. Neukirchen: "A New Approach to Hybrid HMM/ANN Speech Recognition Using Mutual Information Neural Networks", Advances in Neural Information Processing Systems 9, (NIPS'96), 1997, pp. 772-778. 23. G. Rigoll, D. Willett: "A NN/HMM Hybrid for Continuous Speech Recognition with a Discriminant Nonlinear Feature Extraction", Proc. IEEEICASSP, Seattle, 1998, pp. 9-12. 24. G. Rigoll, A. Kosmala, D. Willett: "An Investigation of Context-Dependent and Hybrid Modeling Techniques for Very Large Vocabulary On-Line Cursive Handwriting Recognition", Proc. 6th Int. Workshop on Frontiers in Handwriting Recognition (IWFHR), Taejon, Korea, 1998, pp. 429-438. 25. G. Rigoll: "Hybrid Speech Recognition Systems: A Real Alternative to Traditional Approaches?" Survey Lecture, Proc. International Workshop Speech and Computer (SPECOM'98), St. Petersburg, Russia, 1998, pp. 33-42.
Combination
of Hidden Markov Models and Neural Networks
143
26. A. J. Robinson: "An Application of Recurrent Nets to Phone Probability Estimation", IEEE Trans, on Neural Networks, Vol. 5, No. 2, 1994, pp. 298305. 27. J. Rottland, Ch. Neukirchen, D. Willett: "Performance of Hybrid MMIConnectionist/HMM systems on the WSJ Database", Proc. IEEE-ICASSP, Munich, 1997, pp. 1747-1750. 28. J. Rottland, G. RigoU: "Tied Posteriors: An Approach for Effective Introduction of Context-Dependency in Hybrid NN/HMM LVCSR ", Proc. IEEE-ICASSP, Istanbul, 2000, pp. 1241-1244. 29. J. Stadermann, J. Rottland, G. Rigoll: "Tied Posteriors: A New Hybrid Speech Recognition Technology with Generic Capabilities and High Portability ", Proc. ISCA Workshop ASR2000, Paris, 2000, pp. 24-28. 30. J. Wawrzynek, K. Asanovic, B. Kingsbury, D. Johnson, J. Beck, and N. Morgan: "Spert-II: A Vector Microprocessor System", Computer, 29(3):79-86, March 1996. 31. D. Willett, G. Rigoll: "Hybrid NN/HMM-Based Speech Recognition with a Discriminant Neural Feature Extraction", Advances in Neural Information Processing Systems 10, (NIPS'91), 1998, pp. 763-769. 32. D. Willett, Ch. Neukirchen, G. Rigoll: "Efficient Search with Posterior Probability Estimates in HMM-Based Speech Recognition", Proc. IEEE-ICASSP, Seattle, 1998, pp. 821-82. 33. G. Williams, S. Renals: "Confidence Measures for Hybrid HMM/ANN Speech Recognition", Proc. Eurospeech, Rhodes, 1997, pp. 1955-1958.
This page is intentionally left blank
CHAPTER 6 FROM CHARACTER TO SENTENCES: A HYBRID NEURO-MARKOVIAN SYSTEM FOR ON-LINE HANDWRITING RECOGNITION* T. Artieres, P. Gallinari, H. Li and S. Marukatat LIP6, Universite Paris 6 8, rue du Capitaine Scott, 75015, Paris, France E-mail: { Thierry. Artieres, Patrick, Gallinari, Haifeng. Li, Sanparith. Marukatat] @lip6.fr
B. Dorizzi EPH, Institut National des Telecommunications 9 rue Charles Fourier, 91011, Evry, France E-mail: [email protected] In this chapter, we present a hybrid on-line handwriting recognition system based on Hidden Markov Models (HMMs) and Neural Networks (NNs). This system has been designed to recognize both isolated words and sentences. It is basically organized into three layers: letter models, word models and sentence models. Letter-models are Left-Right HMMs where the emission densities in each state are approximated by mixtures of predictive multilayer perceptrons (MLPs). These perform local regression on the handwriting signal, and thus allow us to incoporate contextual information. At the word and sentence level, recognition is based on a tree-structured dictionary. The proposed tree search and pruning techniques reduce the search space considerably without losing recognition accuracy. The system's performance is evaluated on a part of the UNIPEN international database. This chapter summarizes previous work done in our group on this topic and describes new results concerning the tree search structure for word-sentence recognition and the extension of the system for sentence recognition.
*Part of the work presented here was done under grant number 001B283 with France Telecom R&D. 145
146
T. Artieres et al.
1. Introduction The development of new interface modalities is greatly accelerating due to the development of mobile and portable devices (phones, personal digital assistants, electronic books, electronic tablets, etc.). In this framework, electronic pen drawing on tactile screens or tablets offer an attractive alternative to the traditional mouse and keyboard. These systems are equipped with sensors that allow the pen movements to be captured, buffered and analyzed. For the moment, the functionalities of pen-readers are still very limited. It is therefore particularly interesting to develop new systems able to recognize natural handwriting, without demanding that the user learns a new writing style. In this work, we present a new system for handwritten sentence recognition which is a first step towards the interpretation of note taking on an electronic tablet. Our objective is the recognition of words or sentences in a writer-independent framework with large vocabularies. To handle large vocabularies, we have developed Hidden Markov Models (HMMs) to model the input signal at the letter level. We have then built a cascade system where sentence-models are composed of word-models, which themselves derive from letter-models. HMMs have emerged in the last decade as one of the most powerful techniques to build speech or handwriting recognition systems. More recently, hybrid systems have been proposed to overcome some of the limitations of traditional HMM models. Most often, these systems combine HMMs and neural networks (NNs). The latter allows us to explicitly introduce low-level context into letter models. Different hybrids have already been used in handwriting recognition (HWR). 3 ' 1 1 , 1 9 , 2 1 ' 3 4 , 3 5 In most systems, NNs compute the posterior probability that a portion of a word belongs to a certain letter class. The HMM then processes likelihoods which are deduced from the posterior probabilities, via Bayes' rule. Our system also proposes the use of a hybrid HMM/NN, but in our system NNs are used as predictors for modeling the signal dynamics, extending Linear Auto-Regressive HMMs to a non-linear framework. In addition, we have investigated the use of mixtures of predictive NNs, to more efficiently account for handwriting variability due to different writing styles (script, cursive, mixed, etc.) and individual writing styles. Considering the difficulties encountered in the recognition of cursive handwriting, where ambiguities at the letter level are frequent, recognition
A Hybrid Neuro-Markovian
System for On-Line Handwriting
Recognition
147
at the word level is driven by a dictionary implemented as a tree structure. At the sentence level, a decoding algorithm is used that simultaneously segments the sentence into words and performs words recognition. This algorithm processes a sentence in two steps. In the first step, it builds a word graph representing, in compact form, the most probable word sequences. In a second step, this word graph is processed using a language model to determine the decoded sentence. This chapter is organized as follows. In the first part, we present in detail an isolated word recognition system which is an extension of our older system REMUS. 6 ' 7 The improvements include the preprocessing step, where baseline detection has been added, HMM modeling where state models have been improved by considering a mixture of predictive Multi-Layer Perceptrons (MLP), and word recognition where we have integrated a dictionary driven search procedure using a tree organized lexicon. In the second part we show how to perform sentence recognition as an extension of the isolated word recognition system. Segmentation of the sentence into words and word recognition are performed simultaneously. We first introduce the language constraints, which are integrated into the system. These constraints improve the global recognition performance. We then describe the decoding algorithm. Finally, results are presented on datasets corresponding to parts of the UNIPEN database. 15 2. Preprocessing The preprocessing step transforms the signal into the most appropriate form for the recognition system. The handwriting signal is captured on a digitizing tablet that samples the pen trajectory at regular time intervals. The raw signal is thus a sequence of points. The preprocessing of this sampled signal includes a succession of processes, some of which are detailed in the following subsections. A smooth interpolating filter first filters the signal. Then baseline detection is performed, which accomplishes two goals; it permits the normalization of the letter height in the signal and the detection of its general orientation, so that the signal may be unslanted and assigned to a standard size. These operations make the recognition phase more robust. Then a simple detection of diacritical marks is performed, the signal is normalized and unslanted, and spatially resampled. Finally, a feature extraction is performed at each point of the signal so that a handwritten signal is finally encoded by a sequence of multi-dimensional vectors (subsequently called frames).
148
2.1. Baseline
T. Artieres et al.
Detection,
Normalization
and
Resampling
We use a baseline detection algorithm proposed by Bengio and LeCun, 2 which attempts to identify four baselines corresponding to the core of the word and its lower and upper extensions. The four baselines are assumed to be polynomial of degree two. Although this assumption is restrictive and needs to be adapted for long sentences, it provides good results for sentences whose words are well aligned as it is most often the case in the UNIPEN database. Therefore, we use this modelling in our system. The parameters of the four baselines are estimated through an EM (Expectation Maximization) procedure. An example of baselines detection is given in Fig. 1. Once the baselines have been detected, the signal is deskewed and a scale factor is defined so that the distance between the two center baselines corresponds to a certain constant, say H. Using this scale factor, the signal is spatially resampled so that the distance between successive points of the resampled signal is H/4. An effect of this resampling is to eliminate the effects of writing speed, which is commonly considered as being too variable to be efficiently used in a recognition system. 7
Fig. 1.
2.2. Feature
Baselines detection on an UNIPEN sentence.
extraction
Although a handwritten word could be represented only by its corresponding sequence of points obtained after baselines detection and resampling, a richer representation leads to much better experimental results. Thus, at each point of the sequence, features are extracted, resulting in a frame of 15 components. Our feature extraction is similar to the one proposed by Guyon et al. 13 We use six temporal and nine spatial features. The temporal features are the displacement in the x-axis, the absolute ordinate, the cosine and sine of the angle between the tangent to the pen trajectory and the x-axis, and the cosine and sine of the angle between the local curvature of the pen trajectory and the x-axis. These two angles are approximated using the coordinates of neighboring points. The spatial features correspond to
A Hybrid Neuro-Markovian
System for On-Line Handwriting
Recognition
149
the grey levels of a bitmap centered on the point considered. It takes into account more global spatial information, including the presence of diacritical marks. 20 3. Hybrid H M M - N N Modeling In this section we will present our hybrid system in more detail. The overall architecture of our system is first described. Then we show how predictive neural networks (PNNs) and mixtures of predictive neural networks may be used instead of traditional Gaussian mixtures probability density estimators. 3.1. HMM System
Overview
In our system the handwritten signal is modeled at the letter level. This allows us to handle arbitrarily large vocabularies using a fixed number of models, one for each symbol in the alphabet. In this study we only consider lower-case words and sentences. Our systems are then based on 26 letter models. Each letter-model is a Left-Right (Bakis) HMM, as illustrated in Fig. 2. The first state is the initial state, and the last state is the final state. The only authorized transitions are from one state to itself or to the next state. All transitions are assumed to be equally likely. There are two reasons for this. First it is well known that transition probabilities do not greatly improve the performance of an HMM system, because these probabilities are much less variable than emission probabilities. Consequently, the most important point is to determine the allowed and forbidden transitions for the HMM architecture. Second, classical transition probability modeling leads to power law behavior with respect to the duration in a state, which is clearly unrealistic. The utility of such a left-right constrained architecture is that it forces the states to model successive parts of a letter, in the order they appear in the signal. In practice, the number of states for each letter model has been empirically chosen to be 7. Based on these letter models, a word model may be built by concatenation of the HMMs corresponding to its letters, the final state of a letter model being connected to the initial state of the next letter's model. In handwriting signal, especially in cursive handwriting, the drawing of a letter may vary depending on the previous and next letters. This is very analogous to the coarticulation phenomenon in speech. To improve modeling accuracy
150
T. Artieres et al.
^8—8^8 Fig. 2.
Bakis architecture of a HMM letter model.
some authors have proposed modelling this phenomenon through ligature models5 (models of transition between letters) or context dependent letters models, following works in speech processing. We did not use such techniques but we observed that the initial and final states of letter models automatically focus on such transition parts of the handwriting signal. 3.2. Predictive
Neural
Network
In classical HMMs, probability densities are modeled using Gaussian or a mixture of Gaussian models. These models are simple from a learning and computational point of view, and allow the use of very efficient training algorithms. However, this modeling is based on strong hypotheses, mainly the assumption of independence between successive frames. For several years, our team has been interested in the development of hybrid systems combining the temporal modeling of HMMs and the approximation power of neural networks. We have thus proposed new hybrid HMM/NN systems where emission probability densities are approximated using predictive multi-layer perceptrons, allowing us to relax the independence assumption of classical HMMs. Such systems have been applied to the modeling and classification of speech and handwriting signals. 7 ' 8 ' 9 ' 26,27 The main advantage of this approach lies in the natural modeling of the signal dynamics through the use of predictive models. Moreover, since MLPs are universal approximators, these predictive models are much more powerful than linear models. 1 We now describe how MLPs may be used as emission probability density estimators. The main idea in using predictive models lies in the assumption that a frame is dependent on a number of preceding frames. That is the multi-dimensional signal (the sequence of frames) is an auto-regressive process, where successive frames produced in a state obey the following law: Ot = Fs(0t-l,...
,Ot-p)+€t
(1)
A Hybrid Neuro-Markovian
System for On-Line Handwriting
Recognition
151
where ot is the tth frame of the sequence, p is the model order, Fs is the prediction function and et is an independently and identically distributed residual. This hypothesis means that: p s ( o t | o t _ i , . . . ,o t _ p ) =p€(et)
(2)
where pe denotes the residual probability distribution. If we approximate the prediction function Fs by a neural network we can then approximate p3(ot\ot-i,... , ot-p) by pe(ot - F^N(ot^i,... , ot~p)). The predictive neural network (PNN) is a MLP that implements this relationship. Two points remain: the choice of the pe, and the learning algorithm for the PNN. The latter of which is discussed later in Section 4. For simplicity, in the system at hand we have used the assumption that tt is white noise, although more sophisticated assumptions may be used. 1 With this white noise assumption: - logfototlot-!,... ,
,o t _ p )|| 2 .
(3)
Using this probabilistic interpretation of the residual, the minus loglikelihood of a sequence of frames may then be approximated through the sum of predictive quadratic residuals. 1 In our experiments, we tested different context sizes, together with different context types (past frames or future frames or mixed past and future frames). Although the various contexts produced noticeable changes in performances an efficient and economic model is to consider only the preceding frame. In the systems presented in this paper, the PNNs have 15 input units corresponding to the contextual frames, 15 output units for the approximated next frame, and a limited number of hidden units (5 to 10). 3.3. Mixture
of Predictive
Neural
Networks
To handle the variability of the handwritten signal arising from different writing styles and multiple writers, we used mixture models where the emission probability density associated with a state is assumed to follow: K
ps(ot\ot-i,...
,Ot-p) =^2wSti.p3ti(ot\ot-i,...
,ot-p)
(4)
where ws^ are weighting coefficients of the K mixture components. Using K PNNs, (i?si^JV)i=i,... ,K to approximate the K components mixtures as
152
T. Artieres et al.
discussed in the preceding section, we may approximate: K
p.iotlot-u...
l0t_p)«^«;Bii.e-5ll'"-<^
0
t-1,...,0t-J,||a.
(5)
As a simplification and because this does not lead to decreased performance in our experiments, we replaced the summation in equation 5 by a maximization and assumed uniform weights. T h u s -\og(ps(ot\ot-i,...
, o t _ p ) ) « m i n ( | | o t - F^N(ot-i,..
. ,ot_p)||2).
(6)
4. T r a i n i n g T h e system (for isolated word recognition as for sentence recognition) is trained on unlabelled isolated words; word labels are known b u t not their segmentation into letters. T h e training phase of the HMM system aims at learning the parameters of the system (weights of the PNNs) as well as the segmentation of words into letters and of letters into states. This is the classical scheme where the HMM framework is interesting. As is usual for HMM systems, training consists of the iteration, over the whole training set, of two steps, corresponding to the E M (ExpectationMaximization) steps. First, words are segmented into letters and states using the current HMM system parameters. Second, the system parameters are re-estimated based on the word segmentations found in the first step. Two main algorithms may be used, the classical Baum-Welch algorithm or a simplified and less computationally intensive, the Segmental K-Means algorithm. 1 8 We used this second algorithm, in which the re-estimation of the parameters is performed after the presentation and segmentation of each word of the training set, as is detailed in the algorithm below. It is then a kind of stochastic version of the Segmental K-Means algorithm which allows a faster convergence of the weights of NNs. This is a well known advantage of stochastic over batch gradient learning algorithms for NN. This training algorithm is run until a stopping criterion is satisfied. We use two stopping criteria, the first one is a m a x i m u m number of iteration, the second one is the ratio between the mean quadratic residual at the present iteration and the mean quadratic residual at the preceding iteration. Here is a sketch of the training algorithm: • Iterate till one of the stopping, criteria is satisfied
A Hybrid Neuro-Markovian
System for On-Line Handwriting
Recognition
153
— For all words w of the training set * Build the word model M(w) by concatenation of its letter models * Determine the optimal segmentation into states Sf = s\, s~2, • •. , ST of the signal Of = o\,... , oT with a Viterbi Algorithm: S{ = s\,s2,
...,ST
=
arg max(p(Of, Sf \M(w))) s
(7)
i
Let ct — o t _ i , . . . ,ot-p be the contextual information for frame Ot- This segmentation associates each pattern {ot,ct) to a state of the HMM M(w). A pattern consists of a frame together with its contextual information. * Perform re-estimation of the parameters of each state s of the HMM M(w) using all the patterns associated with it. The re-estimation of the parameters in a state s is done in a winner-takeall manner. All PNNs in the mixture of the state 5 are put in competition. For each pattern associated to s, the best PNN is determined. Then, each PNN in the mixture is re-estimated using the patterns associated with it. This is done by minimizing the quadratic error \\ot — -Fjvjv(ct)||2 using gradient back-propagation. This quadratic criterion is justified since, as illustrated by equations (3) and (6), minimizing the quadratic error corresponds to maximizing the conditional probability. Ideally, using this training strategy, a PNN learns to output the conditional expectation of a frame given its preceding frames, and thus implements a time-varying mean process. The above procedure cannot work efficiently without an appropriate initialization of the NN weights. We then choose to initialize the system with the following two steps procedure. In the first step, patterns (ot, ct) of all words in the training set are assigned to a particular state of the 26 letters models. This is done as follows. For each word in the training database, the word model is built, then the sequence of observations is linearly segmented, which means that a constant number of patterns is associated to all the states of the letter models of the word. After the segmentation of all the words in the training set, a set of patterns is associated with each state s of each of our 26 letter models. Then in the second step, for each state, a clustering method (K-Means algorithm) is run on the set of patterns associated with it to determine K clusters. Finally, each of the K PNNs in the mixture associated with s is trained independently on one partition of this set.
154
T. Artieres et al.
5. Lexicon-Driven Isolated Word Recognition In the recognition phase, if no lexical constraint is considered, a global HMM is built by concatenating the final states of the letter models with the initial states of all letter models. Then, the Viterbi decoding algorithm is used to determine the optimal path in the global HMM, which maximizes the likelihood of the input signal. The solution of the decoding step is thus a string w (a sequence of letters) corresponding to: w = arg max p(w\0'[) = arg max p(01 \w)
(8)
where V* stands for all sequences of letters and all strings w are equally likely since no lexical constraint is considered. In our system where PNNs are trained to minimize the prediction error, i.e. quadratics residuals, the above equation is implemented via: w — arg min [— log p{Oj\w)]. (9) wev* In the following, we will use the terminology score for minus log-likelihood. It should be noted that this decoding procedure does not take into account the existence of the decoded string in the language, although this information is essential to achieve good recognition results. To illustrate this point, we observed in our experiments that without such knowledge, a recognizer having a character recognition rate of about 80% recognizes only 10 to 20% of words without any character error (substitution, deletion or insertion). Introducing a lexical constraint consists of performing the decoding of words, which respect constraints of the language considered. This corresponds to recognizing sequences of letters while taking into consideration a priori probabilities of such sequences in the language: w = arg max {p(Oj\w)p(w)).
(10)
There are several ways to incorporate such lexical constraints. Soft constraints can be considered through the use of probabilities on a succession of characters (bigrams or trigrams of letters). The use of a dictionary of authorized words allows the introduction of hard constraints. In this case, all words in the dictionary are considered equally likely and all others words are considered impossible. In this work, we used the second approach, which has already been used for example in the speech recognition field with very
A Hybrid Neuro-Markovian
System for On-Line Handwriting
Recognition
155
large dictionaries of 100k words or more. Let D be the dictionary of authorized words. Our decoding algorithm determines the word: w = arg max p(Of \w).
(11)
wGD
Using a dictionary leads to a dramatic increase in the cost of the decoding algorithm since all words in the dictionary have to be considered and their probabilities computed. In order to limit the computational cost we used two optimization techniques which are the organization of the dictionary in prefix form and the frame-synchronous beam search strategy. The first optimization consists of organizing the dictionary in order to factorize computations common to different words, those sharing the same prefix. In such a lexical tree, a node corresponds to a letter (i.e. a seven states HMM in our system) arising at a precise place in the prefix of one or more words of D. With this organization, the computation of the probabilities of two words such as 'art' and 'arc' is partially common because these two words share the same prefix 'ar' (see Fig. 3). Such a tree organization allows the search effort to be reduced by a factor of about 2 to 3 (depending on the dictionary) compared with a flat dictionary where all the words are scored independently. 28
Fig. 3.
Tree organization of the dictionary.
However, this search space is still too large to be explored exhaustively. We then choose to use a frame synchronous beam search strategy. 22 ' 28 Note that there are other approaches to avoid exploring the whole search space. An interesting alternative is to use frame asynchronous search algorithms
156
T. Artieres et al.
such as A* which are also used in continuous speech recognition. 12,17 However, these algorithms generally involve a forward-backward procedure (the decoding cannot begin before the handwritten signal is complete), which is less appropriate for recognizing arbitrary long sentences, as we wish to consider later in this chapter. We now describe the frame synchronous beam search. The sequence of frames is processed in time order, and all word hypotheses are propagated in parallel for each incoming frame. Since all hypotheses cover the same portion of the input signal, their scores can be directly compared. This allows a data-driven search to be performed — which focuses on those hypotheses that are more likely to result in the best state sequence — instead of an exhaustive search. This is made possible thanks to a list organization of the search space. We introduce the notion of active node, which is a node of the lexical tree corresponding to one of the current hypotheses. 22 ' 29 This list is initialized with the root node of the lexical tree and is updated for each incoming frame using an extension-pruning procedure. In the extension step, all nodes in the active nodes list and all their successive nodes are processed with a Viterbi step, i.e. new scores are computed based on the cumulated scores of these hypotheses and the local scores for the frame considered. Then a pruning step is performed on all these hypotheses. The best score among all the hypotheses is determined, and all hypotheses whose scores are superior to this optimal score by more than a fixed value (called the beam) are removed from further consideration. This leads to a new active nodes list. This procedure is iterated for each frame of the sequence. The implementation of the beam search allows the search space to be built dynamically. Furthermore, by limiting the number of active nodes to a prespecified number, the computational cost of the search can be made independent of the overall size of the potential search space. An essential property of the search algorithm is that it produces a best word hypothesis, together with a list of alternative words whose scores are close to the best score. This is essential in a user-oriented approach where the system must offer alternatives in case of recognition errors. 6. Sentence Recognition Now we focus on the decoding of sequences of words, each belonging to a given dictionary. Two main approaches may be employed to extend isolated word recognition to sentence, or more generally, text recognition: a two step
A Hybrid Neuro-Markovian
System for On-Line Handwriting
Recognition
157
system composed of a segmentation module followed by a recognition module, or a one step system where the two tasks are carried out simultaneously. In the first approach, the sentence signal is segmented into isolated words. Then an isolated word recognizer is run on each of the words, producing a list of word hypotheses for each of the words in the sentence. Finally all these results are combined to determine the recognized text, using additional high level knowledge such as a language model. 14,24 ' 37 The second approach consists of segmenting and recognizing words simultaneously. 25 ' 38 The main advantage of the first approach is its algorithmic simplicity; its main drawback is the classical weakness of cascade systems, where an error in a module is propagated to the following module and cannot be recovered. The second approach overcomes this error-propagation problem at the price of increased algorithmic complexity. It is clear from this discussion that an efficient decoding algorithm has to be a combination of these two approaches. A pre-segmentation module should be used whenever possible, to segment unambiguously separated parts of text. To perform this pre-segmentation of a sentence, one can use techniques developed over many years in off-line document processing, which could be improved by taking into account the additional information in the on-line signal.32 Then, an integrated segmentation-recognition module could be used on these identified words or sequences of words. Finally a postprocessor would exploit all these results together with a higher-level knowledge of the language to output the decoded text. We are interested here in the development of the integrated segmentation and recognition module that is, from our point of view, the most difficult. In the following, we present the extension to sentence recognition of our system. This includes two main features, which we will describe sucessively: the integration of a language model and the extension of the decoding algorithm to sequence of words. 6.1. Language
Model
At the sentence level, as at the word level, the sequences of words obey language constraints — grammatical constraints, semantic constraints, etc. This knowledge must be integrated in the recognition process to achieve good performance. 14,31,37 The aim of the decoding process is to determine the number of words N and the sequence of N words w\,... , wp/ that maximizes the joint probability p(o\,... ,or,w\, • • • ,WN)- This probability
158
T. Artieres et al.
may be rewritten: p(oi,...
,OT,WI,...
,WN)
=p(o1,...
,OT\WI,...
,wN)p(w1,...
,wN)
(12) where the term p(u>i,... ,WN) is the prior probability of the sequence of words w\,... , WN , which is independent of the observations and is estimated with high level knowledge sources, i.e. the language model. This knowledge cannot be modeled with a dictionary of allowed sentences except in very specific applications. Most often, a language model is implemented via m-grams of words. The idea of using m-grams is to simplify the computation of probabilities p(wi,... ,WN) by using the assumption that the probability of observing a word given its history, depends only on the m—1 previous words and is independent of further preceding words. The prior probability of a sequence of words may be expressed by: N
p(wi,...
,wN) = p(wi)Y[
p(wi\wi,...
,Wi-i).
(13)
Using the m-grams assumption, this probability may be rewritten: N
p(w1}...
,wN) =p(wi,...
,wm)
Yl
p{u)i\wi-i,...,Wi-m).
(14)
i—m+l
Such language models (m-grams probabilities) are estimated on large corpora of text. In our system, we use words bigrams language model, which means that the probability of a word given its history depends only on the preceding word. Thus: N
p(wi,...
,wN) =p(w1) Y[ p(wi\wi-i).
(15)
Several algorithms exist to compute these bigrams from a large corpus. 17 The simplest way to do this is to compute relative frequencies of bigrams of words in a large corpus of texts. However, one must take into account that many bigrams may occur rarely or not at all in the corpus. Thus smoothing and/or discounting techniques must be used. In our present work, we use CMU's statistical language modeling toolkit 4 to estimate the bigrams from the corpus. CMU's toolkit provides several discounting strategies. Following the work from Marti and Bunke, 23 we choose to use the discounting strategy that leads to the lowest perplexity.
A Hybrid Neuro-Markovian
6.2. Decoding
System for On-Line Handwriting
Recognition
159
Algorithm
The extension of the decoding algorithm to a sequence of words is not straightforward. A number of alternatives exist, which differ in the computation cost, the ability to output alternative sentences, the possibility and ease of integrating a language model etc. 28 The potential search space is represented in Fig. 4 (a). The main idea of this extension is to use copies of the lexical tree (the dictionary in prefix form) and to allow any transition from a terminal node of a tree (the end of a word) to the root of a new tree copy (the beginning of a new word). The tree in Fig. 4 (a) is a prefix tree at the word level, where one can see copies of a lexical tree with three leaves corresponding to the three words in the dictionary, A, B and C. This structure is a direct generalization — at the word level — of the lexical tree at the letter level. Hence, it is clear that the extension to the sentence level means a huge search space since we wish to recognize arbitrary long sentences. An exhaustive search is clearly unrealistic and computations must be further factorized. Using only one lexical tree copy with each terminal node being connected to the root would allow the decoding of sentences. However, it would not
(a)
(b)
Fig. 4. Tree organized search space for sentence decoding using lexicon tree copies. The lexicon includes 3 words: A, B and C. Overall search space (a) and word predecessorconditioned organization (b).
160
T. Artieres et al.
easily offer the ability to provide word alternatives in ambiguous regions of the handwritten signal, which is a property t h a t we are looking for. To obtain such a property, a very efficient method has been developed in the speech recognition field, based on the notion of word graphs. 3 0 In this approach, the search is conducted through two successive levels. First a dynamic programming search is performed to build a word graph keeping only the best word hypotheses with their beginning and ending times (frame's indice). Then, a dynamic programming algorithm is run on this word graph to o u t p u t the best matching word sequence. T h e basic idea of a word graph is to represent all word sequences by a graph, in which an arc represents a word hypothesis, and a node is associated to an ending time. Compared to a classical N-best approach, a word graph is much more efficient since word hypotheses need only be generated locally, whereas in the N-best method each local alternative requires a whole sentence to be added to the N-best list. In addition, a word graph allows higher level language modeling to be integrated in a post-processing step. As with isolated word recognition, we use a frame-synchronous b e a m search using an active nodes list to build the search space dynamically. T h e building of the word graph is integrated into the frame-synchronous search. Whenever the final state of a word (a terminal node of a lexical tree copy) is reached, the corresponding hypothesis (the word, its partial score, its ending time, its predecessor word and their time boundary) is added to the word graph. Following research by Ney et al., 2 9 , 2 8 we implemented a word graph building algorithm based on predecessor conditioned lexical trees. In such an organization of the search space, there is a copy of a lexical tree for each predecessor word; i.e. there are as many lexical tree copies as there are words in the dictionary. Each terminal node of these lexical trees is connected to the root of the lexical tree which has this word as predecessor. T h e need for predecessor conditioned lexical trees comes from the formulation of the word pair approximation which states t h a t given a word pair and its ending time, the word boundary between the two words is independent of earlier words. Under this assumption, we only need a copy of the lexical tree for each predecessor word to investigate all potential word sequences and keep track of all alternative word sequences. To illustrate the search procedure based on this organization, let us look at the Fig. 4 (b). T h e structure shown in this figure correspond to a 3 words dictionary (A, B and C). We can see 3 lexical trees which are the
A Hybrid Neuro-Markovian
System for On-Line Handwriting
Recognition
161
one corresponding to A as previous word, the one corresponding to B as previous word and the one corresponding to C as previous word. Assume that, when considering the tth frame in the decoding process, there are two active nodes corresponding to the end of word B in the active nodes list (the square and the circle in the Fig. 4 (b)). One of these nodes corresponds to the decoding of word B following word A (the square), the other node corresponds to the decoding of word B following word C (the circle). The two hypotheses corresponding to these two active nodes are added to the word graph, but only the best one is further considered in the search process, i.e. extended to the root node of the lexical tree corresponding to B as previous word (the triangle in the Fig. 4 (b)). Once the decoding is complete, i.e. when the last frame has been processed, the word graph is prunned to eliminate less probable hypotheses. 36 Then, a simple Viterbi algorithm is performed on the remaining word graph to find the most likely word sequence. Alternative word sequences may be easily determined from the word graph as a side effect of this Viterbi step. Note that the post-processing of the word graph results in a negligible additional computational cost. 7. Experiments 7.1.
Database
All the experiments reported in this study have been performed on the UNIPEN database. 15 The UNIPEN project's aim is to build an international database for the development and the evaluation of on-line handwriting recognizers. This database contains isolated letters (in script, cursive or mixed forms), isolated words and sentences. It contains about 5 million characters, written as isolated letters, words or sentences. The handwritten samples come from more than 2200 writers from many different countries. These signals are organized in directories corresponding to various donators (universities and companies). These samples have been acquired using various hardwares (e.g. tablets with different characteristics). As a consequence, this database includes a large variety of writing styles (script, cursive and mixed) as well as signals of varied quality. Recognition systems for word and sentence recognition are trained on the same training database, composed of 30k isolated words written by 256 writers from various countries. We will present experimental results on multi-writer (MW) and writer-independent (Wl) isolated word recognition.
T. Artieres et al.
162
Multi-writer experiments mean that the writers in the test set all belong to the set of writers in the training set. Writer-independent experiments mean that the writers in the test set do not belong to the set of writers in the training set. We built two databases for isolated word recognition of 2k words each, one for multi-writer tests and one for writer-independent tests. In sentence recognition experiments, we built only one test database since there are not many sentences in UNIPEN. After cleaning non-English sentences, we collected a database of about 2600 sentences (3 words in length on average), written by 191 writers, half of them being writers encountered during training. Compiling all words from the three test databases, we built a basic dictionary for our experiments, composed of 2540 different words. To study the behavior of our systems on larger dictionaries, we added a number of words to this basic dictionary to build dictionaries of 4000, 5000 and 12000 words. 7.2. Performance
Criteria
Here we define the evaluation measures that we will use in our experiments. For the isolated word recognition task, we used the recognition rate as well as top-N rates, i.e. percentage of words that are in the N best hypotheses output by the recognizer. For the sentence recognition task, we computed three measures to characterize our systems performances. The first one is the letter-level accuracy (LLA). It is defined using the string edit distance at the letter level, between the recognized sentence and the true sentence, taken as sequences of letters. The second one is a word-level accuracy (WLA). It is defined using an edit distance between the recognized sentence and the true sentence, taken as sequences of words. The third measure, called word graph accuracy (WGA), is based on the edit distance between the true sentence and the word graph. It is computed using the minimal edit distance between the true sentence and the sentences contained in the word graph (i.e. paths through this word graph). 7.3. Isolated
Word
Recognition
In this section, we provide some experimental results of our systems for isolated word recognition in various conditions. Table 1 shows comparative results for two systems, the first one based on one PNN per state, the
A Hybrid Neuro-Markovian
System for On-Line Handwriting
Recognition
163
Table 1. Isolated word recognition and top-N rates for two systems using 1 or 3 PNNs per state. Results are given for each system on the two test d a t a sets. M W stands for Multi-Writer experiments, WI for Writer-Independent experiments. The size of the dictionary is 2.5k. # P N N per state
Test DataSet
1 St
1-2
1-3
1-5
1-10
1-30
1 1 3 3
MW WI MW WI
77.0 73.6 80.1 77.9
85.1 81.7 88.2 86.1
87.9 85.3 91.0 89.1
90.0 88.3 93.6 92.0
93.8 92.1 95.5 94.5
96.7 96.3 97.5 97.1
second one based on mixtures of 3 PNNs per state. Recognition results are given for multi-writer and writer-independent experiments with the 2.5k dictionary. For each experiment (i.e. row), the table shows the recognition rate and top-N rates for N = 2, 3, 5, 10 or 30. As expected, the system using mixtures of 3 PNNs outperforms the system with one PNN per state for both the MW and WI experiments. This follows from the fact that using mixtures of 3 PNNs allows more variability in the writing signal, and thus increases the modeling power of the system. Also, comparing MW and WI performances, one can see that this system shows improved robustness to signal variablity. To study the behavior of the system relative to the size of the vocabulary we performed experiments with dictionaries of various sizes. The results are compiled in Fig. 5. Results are given for the 3-PNN mixture system and for multi-writer and writer-independent tests. One can see that recognition rates are still high when considering large vocabulary experiments (12k words) and that top-ten performance decreases relatively slowly. It is worth noting that the average recognition time is about 1 second per word on a 500MHz Pentium. It is independent of the size of the vocabulary since we used a constant maximum number of hypotheses (3000 in our experiments) while performing the frame synchronous beam search. It is not straightforward to compare these results with other already published results, because experimental conditions differ. For example, Hu et al. 16 reports word recognition results on other parts of UNIPEN database achieving about 87.2% for a 2k lexicon and 79.8% with a 12k lexicon. Although these results are superior, it is clear from our experience that this difference may not be significant. Indeed, if we look at recognition rates obtained on parts of our test database each corresponding to a particular
T. Artieres et al.
164
2k
4k
5k
Fig. 5. Recognition rate as a function of the dictionary size. The system is based on mixtures of 3 PNNs per state. The four curves correspond to recognition rate for multiwriter experiments (MW) and writer-independent experiments (WI) as well as top-10 recognition rates for MW and WI.
directory of the UNIPEN database — a directory includes signals coming from a particular donator — a great variance in recognition rates was observed. 7.4. Sentence
Recognition
To explore the behavior of our system for sentence recognition, we compared the performance reached with and without the use of our language model based on word bigrams. We computed three bigrams sets. The first set (Bigram Bl) has been computed using the sentences in the test set. It provides an upper bound of how bigram probabilities may improve recognition. The second set (Bigram B2) has been computed on the Susanne corpus. 33 This corpus is relatively small. It contains about 127,000 instances of words and 14,000 words, and includes very different sentences from our test set. Thus, this corpus does not include much useful information for the decoding of sentences in our test set. Using Bigram B2 provides then a lower bound of expected improvements. Finally, we computed a third set (Bigram B3) on the Susanne corpus and our test set together. This gives a more realistic idea of expected gain using a bigram language model.
A Hybrid Neuro-Markovian
System for On-Line Handwriting
Recognition
165
Table 2 provides recognition results for our system using 3 PNNs per state. Word level and letter level accuracies are shown. First one may notice that the use of a bigram language model leads to improvements for these two criteria with any bigram set. Using such information allows a word recognition of about 82% with the most realistic bigrams set (Bigram B3) and 85% with a task specific language model (Bigram Bl). Table 2. Performance results — letter level, word level and word graph accuracy — for sentence recognition without a language model and with three different bigrams sets. The system uses 3 PNNs per state. The lexicon size is 2.5k. language model
letter level accuracy
word level accuracy
word graph accuracy
without bigram Bigram B l Bigram B2 Bigram B3
84.4 90.5 85.6 89.4
70.2 84.7 73.2 81.9
93.2 93.2 93.2 93.2
In addition, these results show a high WGA in all cases. This suggests that the word graph contains much information about the potential sequence of words. Furthermore, this means that the word accuracy could be improved with the use of a higher level language model. From the computational point of view, the average recognition time is about 15 seconds per sentence on a 500MHz Pentium, or 5 seconds per word in average. This increased cost compared to isolated word recognition comes from the fact that, since the search space is larger, we have to propagate more hypotheses to achieve good recognition performance. 8. Conclusion In this chapter we have presented an on-line handwritten sentence recognition system, based on a hybrid Hidden Markov Model—Neural Networks system. Our system is a first step towards the interpretation of note taking on electronic tablets. It is thus dedicated to writer independent, multistyles, and large vocabulary handwriting recognition and has been designed with this aim. A few hybrid systems have been proposed for signal classification tasks such as speech or handwriting. In these systems, neural networks are most
166
T. Artieres et al.
often used as classifiers to take advantage of their discriminative power. In our approach, NNs (Multi-Layer Perceptrons) are used to overcome some classical limitations of s t a n d a r d Gaussian HMMs, and we exploit their approximation ability to model the dynamics of a signal considered as a non-linear auto-regressive process. To do this, NNs are used in a predictive way in order to estimate emission probability densities instead of traditional local Gaussian models. NNs are thus embedded in the H M M framework and the whole system is trained using an approximated Maximum Likelihood Criterion. To handle the variability in the drawing of letters and words due to the differences in writing styles and individual idiosyncratics, emission probability densities in a state are implemented through mixtures of Predictive Neural Networks. This leads to increased robustness of the system with respect to signal variability. At the word level, the decoding strategy is based on a frame-synchronous beam search algorithm where the lexicon is organized as a tree in prefix form. T h e search space is built dynamically during the decoding process. By limiting t h e number of hypotheses t o a prespecified threshold, the computational cost may be made independent of the size of the vocabulary thus allows the use of large vocabularies. T h e extension of this decoding scheme to word sequences has been designed so t h a t it efficiently produces alternative solutions to the best decoded sentence. T h e algorithm includes a search space organized by word preconditioning. This organization is derived from the word pair approximation proposed in the speech recognition literature. This search space is explored using a frame-synchronous beam-search strategy, which includes t h e construction of a word graph. This word graph is t h e n processed using a dynamic programming algorithm where higher-level knowledge sources may be integrated. We investigated the use of a simple language model based on bigrams of words and showed how this may greatly improve the recognition performances at the word and sentence level. Different experiments have been carried out on the U N I P E N international database, which is a very realistic database. Word and sentence recognition performance in various experimental conditions show the strong potential of our approach for general on-line text recognition. However, lots of work remains to be done for note taking applications. Mainly, we have to develop a pre-segmentation module, which segment
A Hybrid Neuro-Markovian System for On-Line Handwriting Recognition
167
non-ambiguously separated parts of a handwritten page. We must also handle out-of-vocabulary words and integrate higher-level knowledge sources at the text level. References 1. T. Artires and P. Gallinari. Multi-state Predictive Neural Networks for text-independent Speaker Recognition. In European Conference on Speech Communication and Technology (EUROSPEECH), volume 1, pages 633-636, Madrid, Spain, September 1995. 2. Y. Bengio and Y. LeCun. Word normalization for online handwritten word recognition. In International Conference on Pattern Recognition (ICPR), volume 2, pages 409-413, Jerusalem, 1994. 3. Y. Bengio, Y. LeCun, C. Nohl, and C. Burges. LeRec: a NN/HMM hybrid for on-line handwriting recognition. Neural Computation, 7(5):1289-1303, 1995. 4. P. R. Clarkson and R. Rosenfeld. Statistical Language Modeling Using the CMU-Cambridge Toolkit. In European Conference on Speech Communication and Technology {EUROSPEECH), pages 2707-2710, Rhodes, Greece, September 1997. 5. J. Dolfing. A comparison of ligature and contextual models for hidden Markov model based on-line handwriting recognition. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 2, pages 10731076, Seattle, Washington, USA, May 1998. 6. S. Garcia-Salicetti, B. Dorizzi, P. Gallinari, and Z. Wimmer. Maximum mutual information training for an on-line neural predictive handwritten word recognition system. International Journal on Document Analysis and Recognition (IJDAR), 2001. (to appear). 7. S. Garcia-Salicetti, P. Gallinari, B. Dorrizi, A. Mellouk, and D. Fanchon. A Hidden Markov Model extension of a neural predictive system for online character recognition. In International Conference on Document Analysis and Recognition (ICDAR), volume 1, pages 50-53, Montral, Canada, August 1995. 8. S. Garcia-Salicetti, P. Gallinari, B. Dorrizi, A. Mellouk, and D. Fanchon. A neural predictive approach for on-line cursive script recognition. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 5, pages 3463-3466, Detroit, Michigan, USA, May 1995. 9. S. Garcia-Sallicetti, B. Dorizzi, P. Gallinari, and Z. Wimmer. Adaptive Discrimination in an HMM-based neural predictive system for on-line word recognition. In International Conference on Pattern Recognition (ICPR), volume 4, pages 515-519, Vienna, Austria, August 1996. 10. S. Garcia-Selicetti. A neural lexical post-processor for improved neural predictive word recognition. In International Conference on Artificial Neural Networks (ICANN), pages 587-592, Germany, July 1996.
168
T. Artieres et al.
11. M. Gilloux, B. Lemarie, and M. Leroux. A Hybrid Radial Basis Function Network/Hidden Markov Model Handwritten Word Recognition System. In International Conference on Document Analysis and Recognition (ICDAR), volume 1, pages 394-397, Montreal, Canada, August 1995. 12. P. S. Gopalakrishnan, L. R. Bahl, and Mercer R. L. A tree search strategy for large-vocabulary continous speech recognition. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 572575, Detroit, Michihan, USA, May 1995. 13. I. Guyon, P. Albrecht, Y. LeCun, J. Denker, and W. Hubbard. Design of a Neural Network character recognizer for a touch terminal. Pattern Recognition, 24(2): 105-120, Febuary 1991. 14. I. Guyon and F. Pereira. Design of a linguistic postprocessor using variable memory length Markov models. In International Conference on Document Analysis and Recognition (ICDAR), volume 1, pages 454-457, Montral, Canada, August 1995. 15. I. Guyon, L. Schomaker, R. Plamondon, M. Liberman, and S. Janet. UNIPEN project of on-line data exchange and recognizer benrchmak. In International Conference on Pattern Recognition (ICPR), volume 2, pages 29-33, Jerusalem, Israel, October 1994. 16. J. Hu, S. G. Lim, and M. K. Brown. Writer independent on-line handwriting recognition using an HMM approach. Pattern Recognition, 33(1):133-148, January 2000. 17. F. Jelinek. Statistical Methods For Speech Recognition. MIT Press, 1997. 18. B.-H. Juang and L. R. Rabiner. The Segmental K-Means algorithm for estimating parameters of Hidden Markov Models. IEEE Transactions on Acoustics, Speech, and Signal Processing, 38(9):1639-1641, September 1990. 19. S. Knerr and E. Augustin. A Neural Network-Hidden Markov Model Hybrid for Cursive Word Recognition. In International Conference on Pattern Recognition (ICPR), volume 2, pages 1518-1520, Brisbane, Australia, August 1998. 20. S. Manke, M. Finke, and A. Waibel. Combining bitmaps with dynamic writing information for on-line handwriting recognition. In International Conference on Pattern Recognition (ICPR), volume 2, pages 596-598, Jerusalem, Israel, October 1994. 21. S. Manke, M. Finke, and A. Waibel. N P e n + + : a writer-independent, large vocabulary on-line handwriting recognition system. In International Conference on Document Analysis and Recognition (ICDAR), volume 1, pages 403-408 Montreal, Canada, August 1995. 22. S. Manke, M. Finke, and A. Waibel. A fast search technique for large vocabulary on-line handwriting recognition. In International Workshop on Frontiers in Handwriting Recognition (IWFHR), pages 183-188, Essex, England, September 1996. 23. U. Marti and H. Bunke. Unconstrained Handwriting Recognition: Language Models, Perplexity, and System Performance. In International Workshop
A Hybrid Neuro-Markovian System for On-Line Handwriting Recognition
24.
25.
26.
27.
28.
29.
30.
31.
32.
33. 34.
35.
169
on Frontiers in Handwriting Recognition (IWFHR), pages 463-468, Taipei, Taiwan, December 1994. U. Marti and H. Bunke. Towards General Cursive Script Recognition. In International Workshop on Frontiers in Handwriting Recognition (IWFHR), pages 379-388, Taejon, Korea, August 1998. U. Marti and H. Bunke. Handwritten sentence recognition. In International Conference on Pattern Recognition (ICPR), volume 3, pages 467-470, Barcelona, Spain, September 2000. A. Mellouk and P. Gallinari. Discriminative training for improved Neural Prediction Systems. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 233-236, Adelaide, South Australia, April 1994. A. Mellouk and P. Gallinari. Global Discrimination for Neural Predictive Systems based on N-best algorithm. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 465-468, Detroit, Michihan, USA, May 1995. H. Ney and X. Aubert. Automatic Speech and Speaker Recognition, chapter 16, Dynamic programming search strategies: From digit strings to large vocabulary word graph, pages 385-411. Kluwer Academic Publishers, 1996. H. Ney, D. Mergel, A. Noll, and A. Paeseler. Data Driven Search Organization for Continous Speech Recognition. IEEE Trans, on Signal Processing, 40(2), Febuary 1992. M. Oerder and H. Ney. Word graphs: An efficient interface between continuous-speech recognition and language understanding. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 2, pages 119-122, Minneapolis, USA, April 1993. R. Plamondon, S. Clergeau, and C. Barriere. Handwritten Sentence Recognition: From Signal To Syntax. In International Conference on Pattern Recognition (ICPR), volume 2, pages 117-122, Jerusalem, Israel, October 1994. E. H. Ratzlaff. Inter-line distance estimation and text line extraction for unconstrained online handwriting. In International Workshop on Frontiers in Handwriting Recognition (IWFHR), pages 33-4, Amsterdam, Halland, September 2000. G. Sampson. THE SUSANNE CORPUS: Documentation. http://www.cogs.susx.ac.uk/users/geoffs/SueDoc.html. M. Schenkel, I. Guyon, and D. Henderson. On-line cursive script recognition using Time Delay Neural Networks and Hidden Markov Models. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 2, pages 637-640, Adelaide, South Australia, April 1994. G. Seni, N. Nasrcbadi, and R. Srihari. An on-line cursive word recognition system. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 404-410, Seattle, WA, USA, June 1994.
170
T. Artieres et al.
36. A. Sixtus and S. Ortmanns. High quality word graphs using forwardbackward pruning. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 2, pages 593-596, Phoenix, Arizona, USA, March 1999. 37. R. Srihari and C. Baltus. Incorporating syntactic constraints in recognizing handwritten sentences. In International Joint Conference on Artificial Intelligence (IJCAI), volume 2, pages 1262-1267, Chambery, France, August 1993. 38. T. Starner, J. Makhoul, R. Schwartz, and G. Chou. On-line Cursive Handwriting Recognition Using Speech Recognition Methods. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 5, pages 125-128, Adelaide, South Australia, April 1994.
CHAPTER 7 MULTIPLE CLASSIFIER COMBINATION: LESSONS A N D N E X T STEPS
Tin Kam Ho Bell Laboratories, Lucent Technologies 700 Mountain Avenue, 2C425, Murray Hill, NJ 07974, USA E-mail: [email protected]
During the 1990's many methods were proposed for combining multiple classifiers for a single recognition task. With these methods, the focus of the field shifted from the competition among specific statistical, syntactic, or structural approaches to the integration of all these as potential contributing components in a combined system. Deeper explorations of the combination methods revealed many links back to several fundamental issues of pattern recognition. Amid the excitement and confusion, there is a persistent uncertainty concerning the optimal match between a method and a problem, due to a strong dependence of classifier performance on the data. In this chapter I review several different motivations that have driven this development, summarize lessons learned in the exploration of combination methods, outline the difficulties encountered, and suggest ways to break out of the current plateau. 1. I n t r o d u c t i o n During the past decade the method of multiple classifier systems was firmly established as a practical and effective solution for difficult p a t t e r n recognition tasks. T h e idea appeared under many names: hybrid methods, decision combination, multiple experts, mixture of experts, classifier ensembles, cooperative agents, opinion pool, sensor fusion, and more. In some areas it was motivated by an empirical observation t h a t specialized classifiers often excel in different cases and make different errors. In other areas it arose naturally from the application context, such as the need to employ a variety of sensor types, which induces a n a t u r a l decomposition of the problem. In these cases, the input can be considered as vectors projected to the subspaces of 171
172
T. K. Ho
the full feature space, which have to be matched by different procedures. An example is authentication by fingerprints together with voice, gestures etc., or retrieving a document by string matching on keywords and by matching word count vectors. There were also cases where the motivation was to avoid having to make a meaningful choice of some arbitrary initial condition, such as the initial weights for a neural network. Or, disturbingly, there were arguments that introducing some randomness, no matter where, in classifier training is a good thing to do, just to make a collection containing some differences that, when combined, will automatically perform better than a single element. There are many ways to use more than one classifier in a single recognition problem. A divide-and-conquer approach would isolate the types of input for which each classifier performs well, and direct new inputs of a particular type to the suitable classifier. A sequential approach would use one classifier first, and invoke others only if it fails to yield a decision with sufficient confidence. All these can be said to be multiple classifier strategies, and have been explored to a certain extent. However, motivated by the factors mentioned earlier, the majority of classifier combination research focuses on applying all the available classifiers in parallel to the same input and combining their decisions. Naturally one asks, what is gained and lost in a parallel combination? When is it preferable to those alternative approaches? As the idea prospered, many different combination techniques emerged. It almost feels like we are simply bringing the fundamental pursuit to a different level. Instead of looking for the best set of features and the best classifier, now we look for the best set of classifiers and then the best combination method. One can imagine that very soon we will be looking for the best set of combination methods and then the best way to use them all. If we do not take the chance to review the fundamental problems arising from this challenge, we are bound to be driven into such an infinite recurrence, dragging along more and more complicated combination schemes and theories, and gradually losing sight of the original problem. The trend of parallel combination of many classifiers deviates from, or even follows an opposite philosophy of, the traditional selection approach, where one evaluates the available classifiers against a representative sample, and chooses the best one to use. Here, in essence, one abandons the attempt to optimize an individual classifier, and instead, tries to use all the
Multiple Classifier Combination:
Lessons and Next Steps
173
available ones in an intelligent way. This is the antithesis of economical design. Introducing needless classifiers would harm more than just efficiency. The agreement of two poor classifiers does not necessarily yield more correct decisions. And very soon one can fall into a situation that the same training data are used to estimate an increasing and potentially infinite number of classifier parameters, which is not an unfamiliar trap. So, is classifier combination a well justified, systematic methodology, or is it a desperate attempt to make the most out of imperfect designs? What has been learned in the past decade, what is still missing, and what should we do next? In this chapter, I review the development of the relevant ideas, discuss the lessons and implications, and try to outline some potentially fruitful future directions. After a decade of development, the literature on classifier combination has become huge, especially as one recognizes in retrospect similar attempts under different names in different communities. The review in this paper is not intended to be comprehensive. Rather, I attempt to point to works exemplifying several main ideas, with emphasis on distinguishing between methods appearing under apparently similar names, and pointing out their deficiencies and difficulties. For a more complete coverage, especially on successful, empirical examples, readers are referred to several recent survey articles, proceedings, and books. 14 ' 43 ' 48 ' 54 ' 69 I will start with the historical context most familiar to researchers and practitioners in pattern recognition, and extend into similar attempts in related disciplines. I will focus on the methods first and analyses later, since in most cases discoveries of the methods occurred first, and theoretical analyses followed only after much empirical exploration. There were notable exceptions to this where theoretical and algorithmic developments were tightly coupled and the two lines of discussion cannot be separated. 2. The Historical Context: Awareness of the Advantages of Multiple Approaches By the end of the 1980's, the field of pattern recognition had developed into a rich area featuring several active branches. There were the syntactic methods integrating results from formal languages and automata theory. There were the structural approaches focusing on innovative feature representations and matching procedures. Statistical approaches continued to
174
T. K. Ho
provide a core depository of methods, and were greatly supplemented by vivid activities using the neural network paradigm or following the machine learning themes. For a practitioner facing the diverse possibilities, selecting among these methods became a major challenge. In the absence of a consistent and dominant winner, each method had to be evaluated in the context of each new problem, within the constraints of many practical concerns. Efforts in designing a recognition system emphasized selecting the best classifier for a given task, supplemented occasionally by divide-and-conquer strategies that invoke different classifiers for different inputs under guidance of considerations such as confidence of decisions, cost of application, or other specific knowledge on the input. An early example is a two-classifier system that invokes a nearest neighbor classifier only when a linear classifier decides that the input falls in an ambiguous region. 11 The complementary advantages of different classification methods were recognized early on. Aside from efficiency concerns, the potential of using different feature representation and decision rules for the same problem was appealing. A review by Bunke 9 cited many of the early efforts in incorporating elements of one approach into another, such as combining statistical and syntactic methods through probabilistic and attributed grammars, 23 matching structural prototypes by syntactic methods via graph grammars, 27 or treating structural matching as a constraint satisfaction process. Influenced by popular ideas from artificial intelligence research, recognition was also seen as a process of inferences about class concepts from concrete examples. The inference steps can be organized as systems of production rules, which allows flexibility in including multiple sources of knowledge and multiple types of evidence. There, the emphasis was on knowledge representation methods and control procedures for the inference steps. Some attempted to model recognition as constraint satisfaction processes. The awareness of the utility of diverse knowledge sources also led to studies on fusing sensory data of multiple types, 58 ' 64 or combining the opinions of multiple experts using consensus theory. 3 The idea of mixing different classifiers for the same problem continued to evolve into hierarchical or multi-stage systems, where each stage identifies or rejects a subset of classes, or more sophisticated conditional systems where different classifiers are selectively applied or believed according to different characteristics of the input 82 or agreement on decisions.34 In mixture of
Multiple Classifier Combination:
Lessons and Next Steps
175
experts 62 a gating network was added to select among the outputs of a number of expert networks each of which specializes in particular classes or subclasses by way of unsupervised learning. Whereas each expert network learns the association between an input vector and its correct class, a gating network learns the association between the correctness of the decision of those expert networks and features of the input. Though a number of such proposals made their way into practical systems, most remained interesting possibilities due to a lack of demonstrated, consistent success. Design of each hybrid system was highly specific to the particular application. 3. Reaching a Consensus: Optimization of Combined Decisions Around the beginning of the 1990's, a more ambitious question emerged: if no classifiers are known to be best for all cases, is it possible to take advantage of the strengths of each method while avoiding their weaknesses? Different from most previous attempts, here the focus is on simultaneous applications of multiple classifiers to the same input, and the objective is to improve overall accuracy. In the early days, whether this can be achieved was by no means obvious. Why would one dilute the strengths of the best classifier with the often poor decisions of the inferior ones? How can one be sure that the correct decisions of the stronger classifiers are preserved when the weaker classifiers join in? Simple approaches such as the majority vote, though known to increase decision reliability, do not necessarily give a higher overall correct rate; if the system rejects a decision for cases lacking agreement by majority, the overall rate of successful recognition is actually lower than that of the best individual. Accuracy improvement becomes possible only when more information is taken from the classifiers. Instead of just a single decision on one of the N classes assumed for the problem, one needs to consider the secondary choices, or have the classifiers score each class with a degree of belief for the respective input. Classifiers operating by different principles do not necessarily produce belief scores on a compatible scale, which presents a problem in decision combination. For example, an attempt to bring syntactic and statistical methods into the same system 24 had to rely on an unnatural sum of the incommensurable syntactic distances between the symbol strings and the se-
176
T. K. Ho
mantic distance between the associated parameter vectors. A more justified approach was adopted in evidence combination, 5 9 where statistical models and the Bayes rule were used to transform the distance measures comp u t e d by t h e individual classifiers into belief scores for each class, and then Dempster-Shafer theory of evidence 6 8 was used t o combine the scores. Explorations in applying the Bayesian formalism to combine belief scores continued, making use of detailed knowledge of the classifiers' behavior contained in confusion matrices, 8 3 or accuracies associated with every p a t t e r n of joint decisions by all classifiers in the collection. 4 1 , 8 0 Methods relying on such detailed knowledge of the classifiers' behavior run into difficulties when the number of classes is large, as the number of training samples required to obtain reliable probabilistic estimates becomes prohibitive. A different strategy is to convert the class decisions to a uniform linear scale by reducing all similarity scores t o rank orderings of the classes, trading off some precision for robustness. Borda count or logistic regression were used to combine such rank scores, and the latter also helped to identify redundant classifiers in the collection. 3 4 T h e problem appears simpler in the context where all classifiers are of the same type. If each classifier computes an estimate of the posterior probability of each class using each feature vector, such estimates can be directly combined by trainable linear or logarithmic functions called the opinion pool,2 or by fixed maximum, median, mean, sum, or product rules. 4 7 Classifiers t h a t decide by principles of hypothesis testing can be combined by methods of meta-analysis 3 2 or logistic regression. 4 2 At the extreme, the decisions of individual classifiers, in whatever form, can be considered as inputs to a second level classifier. 57 T h e idea of employing multiple experts specialized for different aspects of a given task is probably as old as the history of h u m a n society. But such common wisdom does not necessarily and immediately apply t o the context of p a t t e r n recognition, where concepts such as disagreement and cooperation take on a concrete meaning. T h e behavior of the individual classifiers can be mathematically characterized, and with accuracy being an objective measure of effectiveness, the benefits of any proposal can potentially be quantified. Reasons for having multiple sources of knowledge about an input p a t t e r n can be various, but whether they should be maintained in separate representations is never obvious. Even less clear is whether one should integrate the separate representations and compare t h e m under a
Multiple Classifier Combination:
Lessons and Next Steps
177
single metric, or direct them to separate classifiers and defer the integration until after the classifiers have processed them. Regardless of the level where the integration is carried out, details of the integration procedure have to be stated in terms of a concise mathematical function and implemented in a well-defined algorithm. And there are vast differences in the effectiveness of each procedure as observed in the early explorations. From the engineering perspective, the discovery that the combined accuracy of several classifiers applied simultaneously can be made better than each of the individuals came as a surprise from experiments in the late 1980's and early 1990's. Numerous studies followed on, confirming former observations or leading to new proposals, and by now combining classifiers has become almost a routine. Still, accuracy improvement is not guaranteed, and not all combination schemes are successful. Have we fully understood the successes and failures and all the tradeoffs? In the following I outline some of the attempted analyses. An early result on committee votes is Condorcet's Jury Theorem in the 18th century 10 ' 55 which asserts that, if each individual member of the committee has an independent opinion about the subject to be voted on, the probability of the majority vote being correct increases or decreases monotonically with the size of the committee, and converges to one or zero depending on whether each member has a chance of being correct that is greater or less than 50%. Engvall 18 gave a least upper bound for the average accuracy of decisions agreed to by multiple classifiers to a given degree. Srihari 72,73 related the reliability of the committee decision to the average individual reliabilities and individual differences, and gave the number of voters yielding an optimal combined reliability for some specific cases. These are purely probabilistic arguments. Mazurov et al. 60 gave a geometrical analysis where the voters and their decisions are modeled as a system of linear inequalities. Conditions on the consistence of the inequalities were derived for the existence of a minimal committee achieving a particular fraction of agreed decisions, and methods were given for constructing such minimal committees. Lam and Suen 55 studied expected voting outcomes under different conditions on the committee. Simple voting schemes such as the majority or plurality rules are fixed regardless of the performance of the voters and the conditions of the input. So these studies focused on evaluation of such rules under different patterns of voter performance. The results show, under a specific performance goal,
178
T. K. Ho
whether one should use a particular rule or some specific number of voters, but the rules themselves are not trainable. Votes on non-binary choices can be given as a ranking of the choices (classes in the pattern recognition context). In behavior sciences, methods for combining such rankings are referred to as social choice functions or group consensus functions. Arrow's impossibility theorem 1 generalized an observation by Condorcet on cyclic majorities occurring when there are transitive preferences on the choices among the individual voters, and asserted that there is no decision combination function for preferences among three or more items by two or more individuals that can satisfy all four axioms capturing the intuition of a reasonable combination. Goodman and Markowitz 29 suggested that by modifying some of the conditions imposed by Arrow, some acceptable combination functions can be obtained. Fishburn reviewed these in detail 19 ' 20 and extended the theory to infinite sets of individuals. Related mathematical theories were also covered by Black.5 To quantify the differences between individual decision makers, Bogart 6 ' 7 presented a theory of distances between partial orders. In the social scientific theories of votings and elections, the focus is on obtaining a combined choice that best represents the voters' preferences. There is no notion of absolute correctness of the combined choice. The merit of a candidate in an election is solely determined by some specific characterization of voters' preferences. However, in classification, there is a true class associated with each input that is determined regardless of the combination mechanism. That makes a difference, since the combination function can be trained to optimize some objective accuracy measure. In classification, the prior performance of the voters can also be evaluated against the objective truth, and based on such evaluation the combination function can be tuned. These observations motivated the use of regression functions to combine the rank decisions,34 where merits of individual preferences are evaluated by their contribution to the correctness of decisions in the past. Degenerate forms of such tuned combination functions became various weighted voting schemes that can be applied to binary choices. In the context of pattern recognition, the individual decisions of the voters are never independent. They are intrinsically linked by the fact that they are responses to the same input pattern. The degree of agreement of the individual decisions on the same input case, besides characterizing
Multiple Classifier Combination:
Lessons and Next Steps
179
the amount of differences among the classifiers, is also a reflection of the relative difficulty of the case. With a well designed set of classifiers, for an easy case, all or most of the decisions should be the same. Therefore, the pattern of agreement can be used to differentiate the types of input cases. 34 ' 35 Tubbs and Alltop 75 derived measures of confidence associated with combining multiple ranked lists. More recently, Turner and Ghosh 77 quantified the improvement of classifier accuracy by biases in the ranked lists. Other inferences about rankings were pursued in the area of order statistics. 33,61 For combining estimates of posterior probabilities or belief scores represented in a normalized, continuous scale, Bayes decision theory and Dempster-Shafer's theory of evidence68 dominated. Prior performances of individual classifiers can be embedded into the combination function in the form of accuracy estimates conditioned on the individual decisions. Simpler combination rules that do not take into account the classifiers' prior performance were studied by Kittler et al., 47 where a justification was given for the sum rule that chooses the class maximizing the sum of individual estimates of posterior probabilities. The justification is from the sum rule's relative insensitivity to local estimation errors when compared to the product rule. There are also other methods that essentially treat the belief scores given by the individuals as input features for a classifier at another level. These derive support from the basic principles of classification. In statistics they are known as model-mix methods, 56 and are justified by a reduction of the bias of the combined estimator. The combination methods summarized above all assume that the classifiers are given and unchangeable, or consider them already optimized for the task. The design of the combination function occurs after the classifiers are chosen and their training is completed. The hope is to employ a combination function that maps the individual decisions to a final decision which, when evaluated over all inputs, has a better chance of being correct. Thus I refer to these as decision optimization methods. A system using several classifiers may not be able to achieve the highest accuracy for a problem if there are cases for which none of the classifiers' decisions is sufficiently close to being correct. A question then arose: is it possible to systematically create a collection of classifiers for a given problem, so that for each possible input pattern there exists at least one member that can correctly identify its true class? Such a classifier should
180
T. K. Ho
dominate the final decision of the combination to guarantee its effect. If this could be done, accuracy can be maximized for any recognition problem. This much more ambitious goal, together with other relatively modest hopes of getting more uniform coverage of the input cases and overcoming certain biases in the training process, motivated the development of another family of techniques t h a t I refer to as coverage optimization methods. There, instead of the classifiers, the decision combination function is assumed to be fixed and unchangeable in form. T h e strategy is to create a set of classifiers, observing some specific notion of complementariness, such t h a t they can yield a good final decision under the chosen combination function. 4. F o r c i n g C o m p l e m e n t a r i n e s s : G e n e r a t i v e M e t h o d s a n d Stochastic Searches Some classifiers have an inherent stochastic component, or places where arbitrary decisions have to be made. Neural network methods, having many architectural and algorithmic possibilities but less well-understood effects, run into such difficulties more often t h a n other techniques. Many neural networks need initial weights before training begins. Choosing one set of r a n d o m values means committing to a specific starting point which, in bad cases, could lead to a local optimum. T h e sequential order in which training samples are presented leads to another line of possibilities. And there are even more subtle decisions like the type of architecture, the number of hidden layers, or the number of units in each layer. T h e performances of individual networks trained for the same problem can differ by a large amount due to such design choices, and it was observed t h a t they can excel in different areas of the input space. Such observations led to the idea of combining the network decisions by majority or plurality votes, 3 1 ' 6 3 where the hope is t h a t the overlap of the correct decisions would dominate the coincidence of errors. Stacked generalization 8 1 introduced another possibility. T h e role of a classifier is understood to be generalizing class concepts from known examples to the unknown. Assuming each particular generalizer carries a certain bias, the method a t t e m p t s to deduce such biases by way of cross-validation, a well established statistical procedure t h a t , at each pass, reserves random parts of the training d a t a for model evaluation. Such biases are then modeled and reduced by a second level classifier trained on the raw decisions. T h e idea of learning the classifier bias from within the training set was
Multiple Classifier Combination:
Lessons and Next Steps
181
carried further in the method of bootstrap aggregation, or bagging, 8 where each classifier is trained by a bootstrap replicate of the training sample, i.e., sampling the training set to the same size with replacement. The positive effect of bagging is considered to be a result of variance reduction. The pursuit of complementary errors took an explicit form in the method of boosting, 21 a sequential procedure where later classifiers are trained by a subsample of the training set, biased towards the errors committed by the collection of earlier classifiers. Later 44 ' 71 similar ideas were tried with linear discriminators. Explanations attempted for the effect of the method include distribution of margins and error bounds derived from the VC dimension theory, 67 and fitting an additive model by forward stepwise optimization on a criterion similar to binomial log-likelihood.22 However, as we will see shortly, these attempted explanations are incomplete or almost irrelevant, as they fail to address, simultaneously, all the important factors involved in the process: the discriminating power and generalization ability of individual classifiers, and the correlation among them. Despite observed successes in many practical experiments, such training set subsampling methods suffer from a logical paradox. Weakening the individual classifiers by not training on or equally weighing in all available data is said to help avoid overfitting. Boosting simply cannot run on classifiers perfect for the training data, as there are no errors to train additional components. But the design is intended to make the entire system work well on the full training set. So do we want the classifiers perfectly adapt to the training set or not? If we do, what is the point of deferring this adaptation to the level of decision combination? If not, what is the point of adding more and more components to improve accuracy on the full training set? Why should the full system treat the training set differently from the component classifiers? If the training set is assumed to faithfully represent the unseen cases, why would one believe that by sacrificing training set accuracy one can gain testing set accuracy? If the feature space is small and the training sample is dense, the training set could overlap with the testing set perfectly. In such a case, what good will it do to deliberately sacrifice accuracy on the training set? On the other hand, without involving the generalization performance in the analysis, the argument that the methods can, eventually, do perfectly on the training set is useless; template matching can do the job, so there is no reason to bother with such elaborate
182
T. K. Ho
training procedures and sacrifices. Without a thorough understanding of how overfitting is avoided or controlled within the training process, there is no guarantee on the results, and empirical evidence does show that these methods do not always work. Then there is the question of the form of the component classifiers. All these methods are known to work well with decision trees, though the specific way data are split at each internal node matters. The much used notion of weak classifiers is not well defined. Fully split decision trees are very strong classifiers, though pruned or forced shallow versions have been used with some success. With others, like linear discriminators, things are less clear. If the component classifiers are too weak, given the simple decision combination function of weighted or unweighted averaging, the decisions of many bad classifiers can easily outweigh the good ones, especially in methods like boosting that focus on early errors. And how about mixing in different types of classifiers? Such fundamental issues are in the midst of confusion in several communities. Nevertheless, a rigorous analysis of these issues has been given in Kleinberg's theory and method of stochastic discrimination, 49 ' 50 which historically preceded most of the above developments. The analysis uses a set theoretic abstraction to remove all the algorithmic details of classifiers, features, and training procedures. It considers only the classifiers' decision regions in the form of point sets, called weak models, in the feature space. A collection of classifiers is thus just a sample from the power set of the feature space. If the sample satisfies a uniformity condition, i.e., if its coverage is unbiased for any local region of the feature space, then a symmetry is observed between two probabilities (w.r.t. the feature space and w.r.t. the power set, respectively) of the same event that a point of a particular class is covered by a component of the sample. Discrimination between classes is achieved by requiring some minimum difference in each component's inclusion of points of different classes, which is trivial to satisfy. The symmetry translates such differences across different points in the space to differences among the models on a single point. Accuracy in classification is then governed by the law of large numbers. If the sample of weak models is large, the discriminant function, defined on the coverage of the models on a single point and the class-specific differences within each model, converges to poles distinct by class with diminishing variance. Moreover, it is proved that the combined system maintains the projectability of the single weak models,
Multiple Classifier Combination:
Lessons and Next Steps
183
i.e., if each model is thick enough with respect to the spatial continuity of the classes, so that estimates of the point inclusion probabilities from the training set are close to the true probabilities, then the combined system would retain the same goodness of the estimate, which translates, by the symmetry, to accuracy in classifying unseen points. Berlind4 analyzed an alternative discriminant for multi-class problems. The theory of stochastic discrimination is complete. It identifies three and only three sufficient conditions for a classifier to achieve maximum accuracy for a problem. Each of the conditions is spelled out in concrete definitions and can be measured in precise terms. Nothing more is needed to reach optimal accuracy. What is good about building the classifier on weak models instead of strong models? Because weak models are easier to obtain, and their smaller capacity renders them less sensitive to sampling errors in small training sets. 78,79 Why are many models needed? Because the method relies on the law of large numbers to reduce the variance of the discriminant on each single point. The uniformity condition specifies exactly what kind of correlation is needed among the individual models. Moreover, accuracy is not achieved by intentionally limiting the VC dimension of the complete system; the combination of many weak models can have a very large VC dimension. It is a consequence of the symmetry relating probabilities in the two spaces, and the law of large numbers. It is a structural property of the topology. The theory includes explicitly each of the three elements long believed to be important in pattern recognition: discrimination power, complementary information, and generalization ability. Here the projectability of the weak models is a key element in the proof, and not an implicit side effect. Methods like boosting, when put in this framework, can be understood as heuristics for improving (but not guaranteeing) the uniformity of model coverage, where the lack of an explicit treatment of projectability leads to observed overfitting in some experiments. Bootstrapping just focuses on reducing the variances due to small training samples, and as the training sample gets larger, its usefulness diminishes. Finally, the notion of uniform coverage in stochastic discrimination is rigorously defined, as opposed to many other vague concepts such as diversity, independence, and correlation. Without arguments like these addressing all the three aspects, other schemes for introducing randomness into a collection are, at most, just counting on a blind hope.
184
T. K. Ho
As a constructive procedure, the method of stochastic discrimination depends on a detailed control of the uniformity of model coverage, which is outlined but not fully published in the literature. 51 The method of random subspaces followed these ideas but attempted a different approach. Instead of obtaining weak discrimination and projectability through simplicity of the model form, and forcing uniformity by sophisticated algorithms, the method uses complete, locally pure partitions as given in fully split decision trees 36 or nearest neighbor classifiers37 to achieve strong discrimination and uniformity, and then explicitly forces different generalization patterns on the component classifiers. This is done by training large capacity component classifiers such as nearest neighbors and decision trees to fully fit the data, but restricting the training of each classifier to a coordinate subspace of the feature space, such that classifications remain invariant in the complement subspace. If there is no ambiguity in the subspaces, the individual classifiers maintain maximum accuracy on the training data, with no cases deliberately chosen to be sacrificed, and thus the method does not run into the training set paradox. However the tension among the three factors persists. There is another difficult tradeoff in how much discriminating power to retain for the component classifiers. Can every one use only a single feature dimension so as to maximize invariance in the complement dimensions? Also, projection to coordinate subspaces sets parts of the decision boundaries parallel to the coordinate axes. Augmenting the raw features by simple transformations 36 introduces more flexibility, but it may still be insufficient for an arbitrary problem. Optimization of generalization performance will continue to depend on a detailed control of the projections to suit a particular problem. Another interesting idea explored was building different classifiers by partitioning the set of classes in different ways. 15 This works directly only for a large number of classes, but the idea may be extensible to subclasses within a smaller number of classes. The method trains each component classifier only to parts of the desired decision boundary, thereby maintaining some generalization power in the unadapted regions. However, it is unclear whether or how discrimination between all pairs of classes can be fully recovered in the combination. There were also attempts to mix and match all these different ways to generate component classifiers. This, like a few other ideas bearing similar names to the above mentioned, is not backed by any serious theories
Multiple Classifier Combination:
Lessons and Next Steps
185
and does not have a clear motivation. On the other hand, the complicated tradeoffs involved in those better motivated techniques are still open for investigation. The results obtained using decision trees as component classifiers do not always generalize to other types of classifiers, for example, linear discriminators. And what is the exact role of randomization? Can these methods use deterministic algorithms in selecting or omitting specific training cases or feature dimensions? In these methods, the combination of decisions is usually through simple or weighted averaging of the individual probability estimates. Can these methods work with more sophisticated decision combination mechanisms like those explored for given, fixed classifiers? 5. The Lessons The possibility, by now well supported by empirical evidence, of being able to go beyond the power of traditional classifiers is exciting. But more questions are raised with the proposal of each new method and associated experimentation. Theoretical explanations are either logically incomplete or have to stop at a level where the combinatorics defy detailed modeling. Empirical evidence is often very specific to particular applications, and is seldom systematically documented. So, after more than a decade, where do we stand? On the choices of methods, we have learned that there are two approaches, decision optimization or coverage optimization. The decision combination methods are at a higher level, as they can be applied to classifiers obtained via the coverage optimization methods. Choices of decision combination methods are dictated by several factors: what type of output the classifiers can produce, how many classes there are, and whether sufficient training data are available to tune the combination function. Table 1 summarizes the best known decision combination methods under joint consideration of two factors. Because of different contextual requirements, not all methods can be used with all problems. General performance claims about a particular combination strategy are thus difficult to make. Evaluation of the methods is further complicated by the fact that, by and large, only successful experiments are published, and it is difficult to find limits of a method's applicability. Nevertheless, it is obvious that very little can be tuned in the simple voting schemes, and without extensive training data, this may well
186
T. K. Ho Table 1. Contextual requirement of various decision combination methods. Resolution of belief scores Trainable with data
binary or one of N decisions
ranked lists of classes
prob. estimates or continuous scores
No
majority or plurality vote
Borda count
sum, median, or product rules
Yes
weighted vote
logistic regression
Bayes, DempsterShafer rules
be about all that can be done. With sophisticated output like estimates of posterior probabilities, more elaborate combination-level classifiers can be applied. However, for problems with a large number of classes, the availability of good estimates of posteriors is a very strong assumption. Without sufficient training data, the estimates given by the individual classifiers are inaccurate, and so are the estimates at the combination level. Thus, applying overly sophisticated combination methods is a dangerous game. The rank combination schemes weaken the requirement to only the availability of preferences, which are always there as long as the classifiers compute any numerical measure of similarity. For very large numbers of classes and mixed type classifiers, this is an interesting middle ground. But the linear scale of the ranks may be too crude an approximation, and to simplify the model, the combinations may have to be restricted to only a small number of top ranks. The coverage optimization methods suffer less from these problems, as they depend more or less on similar assumptions on the form of component classifiers and decisions. Still, the relative merits of each method are different on each dataset, and it is difficult to predict how a particular method would behave on a problem before actually constructing the classifier collection. As a result, every claim has to rely on empirical testing, and for each new problem the whole trial-and-error process has to be repeated. Hopes for a better understanding are placed on the development of good theories. But on the theory side, several problems persist. Most theoretical works suffer from a failure to model various details in a classification problem. For instance, general evidence combination methods, as given in the broader context of artificial intelligence research, are relatively weak because the special structure in the decisions of classifiers are not modeled.
Multiple Classifier Combination:
Lessons and Next Steps
187
There is more knowledge to be exploited than, say, the subjective judgements of human experts. Numerous arguments on decision combination use some notion of complementariness among the component classifiers. But the precise definition of the complementariness is seldom given. A very common assumption is that the classifiers' decisions are statistically independent, in the sense that the probability of joint success can be modeled as the product of each individual classifier's probability of success. But this is an imposed, very strong assumption which could be very far away from the truth. In a recognition system, classifier decisions are intrinsically correlated, as they respond to the same input. The correlation among the classifiers has to be measured from the data. Until measurements confirm the assumption of zero correlation, those theories dependent on it do not necessarily apply. This fact is well known at the level of feature representations, as any one who has compared a decision tree to a simplified Bayes classifier assuming feature independence would know, but is often neglected at the level of classifier decisions. There is also the dilemma of choosing between a probabilistic view and a geometrical view of classification. Many theories model a classifiers' decision as a probabilistic event, and assume that decisions about each test case are not related to those about others. However, in most application contexts, there is some geometrical continuity in the feature space, such that classifiers' decisions on neighboring points are likely to be similar. Some classifiers, such as decision trees, rely explicitly on this fact to partition the feature space into contiguous regions of the same class. However, with the exception of stochastic discrimination and consistent systems of inequalities, the notion of neighborhood almost never enters decision combination theories. Discussions on the optimal size of component decision trees lightly touch on this, but are never followed through. A precise characterization of the problem geometry will involve descriptors of the fragmentation of the Bayes optimal decision regions, global or local linear separability, and convexity and smoothness of boundaries. Many of these depend on the properties of a specific metric with which the classifier operates. Better modeling of the geometrical behavior of classifiers is attempted in some neural network studies, where classifier training is seen as finding a good approximation to the desired decision boundary, but integration of such models with the probabilistic view is not complete.
188
T. K. Ho
The probabilistic view is based on the need to study a problem by random sampling, due to the difficulty of obtaining complete data coverage. Then there are issues of sampling density and sample representativeness, which are intimately related to the classifier's generalization ability 4 ' 52 and in turn to its accuracy. If the training sample densely covers all the relevant regions in the feature space, many classifiers — as long as they are trainable and expandable in capacity — will work perfectly. Then the competition among methods is more about representation and efficiency. For example, a decision tree may be preferable to nearest neighbors because it is more efficient. So the difficulty of classification is mostly with sparse samples, and for this reason, all theories depending on assumptions of infinite sample size are useless. Those relying on a vague definition of the representativeness of the training samples are not much better, as quite typically such "representativeness" is not even parameterized by the sampling size relative to the size of the underlying problem.^ There are vast differences due to sample size. Consider a space where each point is randomly labeled as one of two classes. Whereas a dense sample may reveal the randomness to some extent, a two-point sample may suggest that the problem is linearly separable. With other less radical problems, sampling density affects the exact ways that isolated subclasses become connected and boundaries constrained, much more than what can be captured in a collective description by a simple count of points. Such problems can occur regardless of the dimensionality of the feature space, though they are more apparent in high-dimensional spaces where the decision boundary can vary with a larger degree of freedom. Observations of empirical evidence 25,65 ' 66 suggest strongly that a shortage of samples would ruin most promises from the classical approaches. This fact was addressed in many studies on error rate estimation as well. 30 ' 46 ' 74 Vapnik's capacity theory 78 ' 79 is among the first that directly faces the reality of small sample effects. It provides a link between the interacting effects of classifier geometry and training sample size. But the VC dimension theory is not constructive. It gives only a loose, worst case bound on the error rate given the geometrical characteristics of the classifier and the sample size. The difficulty in tightening the bounds is because of the distribution-free arguments. 12 ' 13 Nevertheless, as we have seen in the theory of stochastic discrimination, by using a different characterization of training set representativeness, it is possible to show tighter error bounds without
Multiple Classifier Combination:
Lessons and Next
Steps
189
involving the VC dimension argument. Also, by using specific geometrical models matched to the problem, it is possible to overcome the infamous curse of dimensionality.17 The theory is difficult because these factors interact with each other. With regard to the problem geometry, the classifier geometry, and the sampling and training processes, what exactly do we mean by saying that two or more classifiers are independent? How about other related concepts such as correlation, diversity, collinearity, coincidence, and equalization? What do they mean in each context where the decisions are represented as one out of many classes, as permutations of class preferences, or as continuous belief scores that are not necessarily probability estimates? Kleinberg's notion of uniformity offered a rigorous definition under the set theoretic abstraction. How can this be generalized to other models of classifier decisions? The bias/variance decomposition 26 gave another way to relate geometry and probabilities, and has been used in analyzing decision boundaries of combined systems. 76 However, such analyses are often flawed due to inadequate assumptions on decision independence. How can one relate local, point-wise measures of classification error and their correlation to collective measures over the entire training set? How do we relate the error rates of individual classifiers to their degree of agreement? If two classifiers agree only for cases that are easy for both of them, can we still tell whether these two classifiers are independent? How do we know that the agreement is due to the ease of the cases or that the classifiers simply decide by the same mechanism? Detailed studies on the patterns of correlation among the classifiers are necessary to answer these questions. 28 ' 53 As one compares different approaches of combination, and considers combination of combinations, there are a few more intriguing questions to ask: • If one defers the final decision and uses the output of individual classifiers only as some scores describing the input, are those scores different in nature from the feature measurements on the input? Are there intrinsic differences between the mappings from the input to the representations given by a classifier versus that of a feature extractor? • Is combination a necessity or a convenience? Is complementariness intrinsic to some chosen classifier training process? Or is it just an easy way to derive a desired decision boundary?
190
T. K. Ho
• Are there any commonalities among all t h e combination approaches? If many of t h e m are found to be similarly effective, are they essentially doing the same thing despite superficial differences? • Does the hierarchy of combinations converge to a limit at which one would have exhausted the knowledge contained in the training set, such t h a t no further improvement in accuracy is possible? 6. P r e c i s e C h a r a c t e r i z a t i o n of D a t a D e p e n d e n c e s of Performance Many of the above questions are there because we do not yet have a scientific understanding of the classifier combination mechanisms. We have not taken a close enough look at what is happening in a particular problem. Most of the empirical studies stop on arriving at some collective accuracy claims. But trying a method on 100 published problems and counting how many it wins on does not mean much, because these problems may all be very similar in certain aspects and may not be typical in reality. On the other hand, we will never have a fair sample of realistic problems because this set is hopelessly vague. So what can we do? If we have a way to characterize the problems t h a t is strongly relevant to classifier accuracy, we may hope to find certain rules relating those characteristics to the behavior of classifiers or systems generated and combined in a specific way. There may be empirical observations t h a t point to opportunities for detailed analysis where theorists can contribute. Here I am advocating a realist's approach, where the selection of a classifier or a combination m e t h o d is guided by the characteristics of the data. And the d a t a characteristics must include effects of the problem geometry and sampling density. Statements like "method X is of no help when the training sample is large" are overly simplified. How large is large? An absolute number on the sample size means little without knowledge of the size of a class boundary. We need much more systematic ways to characterize problems. We need a language to describe problems in ways more relevant to the actions of classifiers, i.e. not merely collective descriptors such as number of classes, number of samples, number of dimensions, etc. We need a better understanding of the geometry and topology of point sets in high dimensional spaces, preservation of such characteristics under feature transformations and sampling processes, and their interaction with the primitive geometrical
Multiple Classifier Combination:
Lessons and Next Steps
191
models used in most well known classifiers. We need to measure or estimate the length and curvature of the class boundaries, fragmentation of the decision regions in terms of existence, size, and connectedness of subclasses, and the stability of these characteristics as the sampling density changes. Some recent attempts are possible starting points. 16,38 ' 39 ' 40 ' 70 Once we find ways to characterize the problems, we can then ask the question: what type of problems does this method work for? We have to accept that a probabilistic flavor will remain whenever we have to deal with unseen data. This uncertainty is intrinsic. It can only be reduced by knowledge of the structural regularity of the problem, but can never be removed. Given a problem in a fixed feature space, is there a limit on what the automatically trainable methods can do? Recall that all such methods are based on certain particular geometrical primitives, such as convex regions, axis parallel cuts, rectangular boxes, Gaussian kernels, piecewise linear boundaries, etc. It needs to be established that such models will fit decision regions of arbitrary shapes and degrees of connectedness. At which point should we say that it is meaningless to continue training, and that any more improvement in accuracy will be from luck rather than effort? VC dimension theories give us a certain limit, but it could be very remote from what is achievable. By exploiting the structural regularities of the problems and matching them to appropriate classifiers, we should be able to do better than that. 7. Conclusions I reviewed several lines of ideas developed in the past decade on methods for combining parallel classifiers. In summary, the main results include 1. Numerous empirical, quantitative results that accuracies of combined systems can be better than that of each individual. 2. Dozens of practical algorithms for combining decisions by a small number of classifiers. 3. Several ways to generate a collection of classifiers containing complementary strengths. 4. A formal analysis of the relationship between individual classifier accuracy and generalization ability, their mutual agreement, and combined accuracy.
192
T. K. Ho
Among the many open questions raised in the studies, those involving the interaction of the geometric and probabilistic characteristics of a problem are the most intriguing. They hold the key to a better understanding and further improvements of t h e methods. An essential need in this direction is to find a better set of descriptors for the structure of a problem in the feature space and to describe the behavior of classifiers in corresponding terms. These descriptors can be used to categorize real world problems, which would permit studies and prediction of the behavior of various classifiers and combination methods on a whole class of problems. Besides detailed studies of the behavior of each combination technique and their interactions with problem characteristics, several methodological directions are also worth pursuing: • Ingenious designs of feature extractors and similarity measures t h a t can simplify the class boundary will continue to play an important role in real applications. Systematic searches with the same goal are even more interesting. • Unsupervised learning will play an increasing role in the context of supervised learning. Clustering methods will be applied more extensively and systematically to better understand the geometry of the class boundaries and their sensitivity to sampling density. Other ways for describing data, such as estimation of intrinsic and extrinsic dimensionalities of the classes or subclasses, or probabilistic mixture decomposition models, will also be helpful. • More emphasis should be put on localized (or dynamically selected) classification methods. A blind application of everything to everything will prove to be inferior to localized methods. Systematic strategies should be developed to fine t u n e the classifiers to the characteristics of local regions and to associate t h e m with corresponding input. • A better understanding is needed about the differences between deterministic and stochastic classifier generating methods. This will need a careful study of the exact role of randomization in various classifier or combination tuning processes, and the corresponding geometrical effects. • New methods can come from a merger of the decision optimization and coverage optimization strategies, such t h a t collections of fixed classifiers can be enhanced by introducing additional components with enforced complementariness, and coverage optimization methods may use more sophisticated decision combination schemes.
Multiple Classifier Combination: Lessons and Next Steps
193
By now, classifier combination has become a rich a n d exciting area with much proven success. This review is biased towards criticisms rather t h a n celebrations of the ideas explored over its brief history. It is my hope t h a t these discussions can call attention to some of the confusion and missing links in the methodology, and point out the more fruitful directions for further research. We can see t h a t some of the recent developments are already moving towards these directions. This is very encouraging. Acknowledgments T h e author wishes to thank George Nagy and the anonymous reviewers for detailed comments on the manuscript, and Eugene Kleinberg for many discussions over the past ten years on the theory of stochastic discrimination, its comparison to other approaches, and perspectives on the fundamental issues in p a t t e r n recognition. References 1. K. J. Arrow, Social Choice and Individual Values, John Wiley & Sons, Inc., New York, 1951, 2nd ed., 1963. 2. J. A. Benediktsson, "Consensus theoretic classification methods," IEEE Transactions on Systems, Man, and Cybernetics, SMC-22, 4, July/August 1992, 688-704. 3. C. Berenstein, L. N. Kanal, D. Lavine, "Consensus and evidence," in E. S. Gelsema, L. N. Kanal, (eds.), Pattern Recognition and Artificial Intelligence II, North Holland, 1986, 523-546. 4. R. Berlind, An Alternative Method of Stochastic Discrimination with Applications to Pattern Recognition, Doctoral Dissertation, Department of Mathematics, State University of New York at Buffalo, 1994. 5. D. Black, The Theory of Committees and Elections, Cambridge University Press, London, 1958, 2nd ed., 1963. 6. K. P. Bogart, "Preference Structures I: Distances between Transitive Preference Relations," Journal of Mathematical Sociology, 3, 1973, 49-67. 7. K. P. Bogart, "Preference Structures II: Distances Between Asymmetric Relations", SIAM Journal of Applied Mathematics, 29, 2, September 1975, 254-262. 8. L. Breiman, "Bagging predictors," Machine Learning, 24, 1996, 123-140. 9. H. Bunke, "Hybrid methods in pattern recognition," in P. A. Devijver, J. Kittler, (eds.), Pattern Recognition Theory and Applications, SpringerVerlag, 1987, 367-382. 10. N. C. de Condorcet, Essai sur I'application de I'analyse a la probabilite des decisions rendues a la pluralite des voix. Imprimerie Royale, Paris, 1785.
194
T. K. Ho
11. B. V. Dasarathy, B. V. Sheela, "A composite classifier system design: concepts and methodology," Proceedings of the IEEE, 67, 5, May 1979, 708-713. 12. L. Devroye, "Any discrimination rule can have an arbitrarily bad probability of error for finite sample size," IEEE Transactions on Pattern Analysis and Machine Intelligence, 4, 2, March 1982, 154-157. 13. L. Devroye, "Automatic pattern recognition: a study of the probability of error," IEEE Transactions on Pattern Analysis and Machine Intelligence, 10, 4, July 1988, 530-599. 14. T. G. Dietterich, "Machine-learning research: four current directions," AI Magazine, 18, 4, Winter 1997, 97-135. 15. T. G. Dietterich, G. Bakiri, "Solving multiclass learning problems via errorcorrecting output codes," Journal of Artificial Intelligence Research, 2, 1995, 263-286. 16. R. P. W. Duin, "Compactness and Complexity of Pattern Recognition Problems," in C. Perneel, (eds.), Proc. Int. Symposium on Pattern Recognition, In Memoriam Pierre Devijver, Royal Military Academy, Brussels, Feb 12, 1999, 124-128. 17. R. P. W. Duin, "Classifiers in almost empty spaces," Proceedings of the 15th International Conference on Pattern Recognition, Barcelona, September 3-8, 2000, II, 1-7. 18. J. L. Engvall, "A least upper bound for the average classification accuracy of multiple observers," Pattern Recognition, 12, 415-419. 19. P .C. Fishburn, The Theory of Social Choice, Princeton University Press, Princeton, 1972. 20. P. C. Fishburn, Interprofile Conditions and Impossibility, Fundamentals of Pure and Applied Economics 18, Hardwood Academic Publishers, 1987. 21. Y. Freund, R. E. Schapire, "Experiments with a New Boosting Algorithm," Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy, July 3-6, 1996, 148-156. 22. J. Friedman, T. Hastie, R. Tibshirani, "Additive logistic regression: a statistical view of boosting," Annals of Statistics, 28, 2, April 2000, 337-374. 23. K. S. Fu, Syntactic Pattern Recognition and Applications, Prentice-Hall, 1982. 24. K. S. Fu, "A step towards unification of syntactic and statistical pattern recognition," IEEE Transactions of Pattern Analysis and Machine Intelligence, PAMI-5, March 1983, 200-205. 25. K. Fukunaga, D. L. Kessell, "Estimation of classification error," IEEE Transactions on Computers, 20, 12, December 1971, 1521-1527. 26. S. Geman, E. Bienenstock, R. Doursat, "Neural networks and the bias/variance dilemma," Neural Computation, 4, 1992, 1-58. 27. D. Gernert, "Distance or similarity measures which respect the internal structure of objects," Methods of Operations Research, 43, 1981, 329-335. 28. G. Giacinto, F. Roli, "A theoretical framework for dynamic classifier
Multiple Classifier Combination: Lessons and Next Steps
29. 30. 31.
32. 33. 34.
35.
36.
37.
38.
39.
40. 41.
42.
43.
195
selection," Proceedings of the 15th International Conference on Pattern Recognition, Barcelona, September 3-8, 2000, II, 8-11. L. A. Goodman, H. Markowitz, "Social welfare functions based on individual rankings," The American Journal of Sociology, 58, 1952, 257-262. D. J. Hand, "Recent advances in error rate estimation," Pattern Recognition Letters, 4, October 1986, 335-346. L. K. Hansen, P. Salamon, "Neural network ensembles," IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-12, 10, October 1990, 993-1001. L. V. Hedges, I. Olkin, Statistical Methods for Meta-Analysis, Academic Press, 1985. T. P. Hettmansperger, Statistical Inference Based on Ranks, John Wiley & Sons, 1984. T. K. Ho, J. J.- Hull, S. N. Srihari, "Decision Combination in Multiple Classifier Systems," IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-16, 1, January 1994, 66-75. T. K. Ho, "Adaptive coordination of multiple classifiers," in J. J. Hull, S. L. Taylor (eds.), Document Analysis Systems II, World Scientific Publishing Co., 1997, 371-384. T. K. Ho, "The random subspace method for constructing decision forests," IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 8, August 1998, 832-844. T. K. Ho, "Nearest neighbors in random subspaces," Proceedings of the Second International Workshop on Statistical Techniques in Pattern Recognition, Sydney, Australia, August 11-13, 1998, 640-648. T. K. Ho, M. Basu, "Measuring the Complexity of Classification Problems," Proceedings of the 15th International Conference on Pattern Recognition, Barcelona, September 3-8, 2000, II, 43-47. T. K. Ho, "Complexity of classification problems and comparative advantages of combined classifiers," in J. Kittler, F. Roli, (eds.), Multiple Classifier Systems, Lecture Notes in Computer Science 1857, Springer, 2000, 97-106. A. Hoekstra, R. P. W. Duin, "On the nonlinearity of pattern classifiers," Proc. of the 13th ICPR, Vienna, August 1996, D271-275. Y. S. Huang, C. Y. Suen, "A method of combining multiple experts for the recognition of unconstrained handwritten numerals," IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-17, 1, January 1995, 9094. A. K. Jain, S. Prabhakar, S. Chen, "Combining multiple matchers for a high security fingerprint verification system," Pattern Recognition Letters, 20, 1999, 1371-1379. A. K. Jain, R. P. W. Duin, J. Mao, "Statistical pattern recognition: A review," IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-22, 1, January 2000, 4-37.
196
T. K. Ho
44. C. Ji, S. Ma, "Combinations of weak classifiers," IEEE Transactions on Neural Networks, 8, 1, January 1997, 32-42. 45. L. Kanal, B. Chandrasekaran, "On dimensionality and sample size in statistical pattern classification," Pattern Recognition, 3, 1971, 225-234. 46. J. Kittler, P. A. Devijver, "Statistical properties of error estimators in performance assessment of recognition systems," IEEE Transactions on Pattern Analysis and Machine Intelligence, 4, 2, March 1982, 215-220. 47. J. Kittler, M. Hatef, R. P. W. Duin, J. Matas, "On combining classifiers," IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI20, 3, March 1998, 226-239. 48. J. Kittler, F. Roli, (eds.), Multiple Classifier Systems, Lecture Notes in Computer Science 1857, Springer, 2000. 49. E. M. Kleinberg, "Stochastic Discrimination," Annals of Mathematics and Artificial Intelligence, 1, 1990, 207-239. 50. E. M. Kleinberg, "An overtraining-resistant stochastic modeling method for pattern recognition," Annals of Statistics, 4, 6, December 1996, 2319-2349. 51. E. M. Kleinberg, "On the algorithmic implementation of stochastic discrimination," IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-22, 5, May 2000, 473-490. 52. E. M. Kleinberg, "A mathematically rigorous foundation for supervised learning," in J. Kittler, F. Roli, (eds.), Multiple Classifier Systems, Lecture Notes in Computer Science 1857, Springer, 2000, 67-76. 53. L. I. Kuncheva, C. J. Whitaker, C. A. Shipp, R. P. W. Duin, "Is independence good for combining classifiers?" Proceedings of the 15th International Conference on Pattern Recognition, Barcelona, September 3-8, 2000, II, 168171. 54. L. Lam, "Classifier combinations: implementations and theoretical issues", in J. Kittler, F. Roli, (eds.), Multiple Classifier Systems, Lecture Notes in Computer Science 1857, Springer, 2000, 77-86. 55. L. Lam, C. Y. Suen, "Application of majority voting to pattern recognition," IEEE Transactions on Systems, Man, and Cybernetics, SMC-27, 5, September/October 1997, 553-568. 56. M. LeBlanc, R. Tibshirani, "Combining estimates in regression and classification," Journal of the American Statistical Association, 91, 436, December 1996, 1641-1650. 57. D. S. Lee, A Theory of Classifier Combination: The Neural Network Approach, Doctoral Dissertation, Department of Computer Science, State University of New York at Buffalo, 1995. 58. R. C. Luo, "Multisensor integration and fusion in intelligent systems," IEEE Transactions on Systems, Man, and Cybernetics, SMC-19, 5, September/ October 1989, 901-931. 59. E. Mandler, J. Schuermann, "Combining the classification results of independent classifiers based on Dempster/Shafer theory of evidence," in E. S.
Multiple Classifier Combination: Lessons and Next Steps
60.
61. 62.
63. 64.
65.
66.
67.
68. 69. 70.
71.
72. 73. 74. 75.
197
Gelsema, L. N. Kanal, (eds.), Pattern Recognition and Artificial Intelligence, North Holland, 1988, 381-393. V. D. Mazurov, A. I. Krivonogov, V. L. Kazantsev, "Solving of Optimization and Identification Problems by the Committee Methods," Pattern Recognition, 20, 4, 1987, 371-378. R. Meddis, Statistics Using Ranks, A Unified Approach, Basil Blackwell, 1984. S. J. Nowlan, "Competing experts: an experimental investigation of associative mixture models," Technical Report CRG-TR-90-5, Department of Computer Science, University of Toronto, September 1990. D. Partridge, W. B. Yates, "Engineering multiversion neural-net systems," Neural Computation, 8, 4, 1996, 869-893. L. F. Pau, "Fusion of multisensor data in pattern recognition," in J. Kittler, K. S. Fu, and L. F. Pau, (eds.), Pattern Recognition Theory and Applications, Reidel, 1982, 189-201. S. Raudys, V. Pikelis, "On dimensionality, sample size, classification error, and complexity of classification algorithm in pattern recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2, 3, May 1980, 242-252. S. Raudys, A. K. Jain, "Small sample size effects in statistical pattern recognition: Recommendations for practitioners," IEEE Transactions on Pattern Analysis and Machine Intelligence, 13, 3, 1991, 252-264. R. E. Schapire, Y. Freund, P. Bartlett, W. S. Lee, "Boosting the margin: a new explanation for the effectiveness of voting methods," Annals of Statistics, 26, 5, October 1998, 1651-1686. G. Shafer, A Mathematical Theory of Evidence, Princeton University Press, 1976. A. J. C. Sharkey, Combining Artificial Neural Nets: Ensemble and Modular Multi-Net Systems, Springer-Verlag, 1999. S. Y. Sohn, "Meta analysis of classification algorithms for pattern recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2 1 , 11, 1999, 1137-1144. M. Skurichina, R. P. W. Duin, "Boosting in linear discriminant analysis," in J. Kittler, F. Roli, (eds.), Multiple Classifier Systems, Lecture Notes in Computer Science 1857, Springer, 2000, 190-199. S. N. Srihari, "Reliability analysis of majority vote systems," Information Sciences, 26, 1982, 243-256. S. N. Srihari, "Reliability analysis of biased majority-vote systems," IEEE Transactions on Reliability, R-31, 1, April 1982, 117-118. G. T. Toussaint, "Bibliography on estimation of misclassification," IEEE Transactions on Information Theory, 20, 4, July 1974, 472-479. J. D. Tubbs, W. O. Alltop, "Measures of confidence associated with combining classification results," IEEE Transactions on Systems, Man, and Cybernetics, SMC-21, 3, May/June 1991, 690-692.
198
T. K. Ho
76. K. Turner, J. Ghosh, "Analysis of decision boundaries in linearly combined neural classifiers," Pattern Recognition, 29, 1996, 341-348. 77. K. Turner, J. Ghosh, "Linear and order statistics combiners for pattern recognition," in A. Sharkey, (ed.), Combining Artificial Neural Nets: Ensemble and Modular Multi-Net Systems, Springer-Verlag, 1999, 127-162. 78. V, Vapnik, Estimation of Dependences Based on Empirical Data, SpringerVerlag, 1982. 79. V. Vapnik, Statistical Learning Theory, John Wiley & Sons, 1998. 80. K. D. Wernecke, "A coupling procedure for the discrimination of mixed data," Biometrics, 48, June 1992, 497-506. 81. D. H. Wolpert, "Stacked generalization," Neural Networks, 5, 1992, 241-259. 82. K. Woods, W. P. Kegelmeyer Jr., K. Bowyer, "Combination of multiple classifiers using local accuracy estimates," IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-19, 4, April 1997, 405-410. 83. L. Xu, A. Krzyzak, C. Y. Suen, "Methods of combining multiple classifiers and their applications to handwriting recognition," IEEE Transactions on Systems, Man, and Cybernetics, SMC-22, 3, May/June 1992, 418-435.
CHAPTER 8
D E S I G N OF MULTIPLE CLASSIFIER S Y S T E M S
Fabio Roli and Giorgio Giacinto Department of Electrical and Electronic Engineering University of Cagliari Piazza d'Armi, 091 S3, Cagliari, Italy E-mail: {roli,giacinto}@diee.unica.it
In the field of pattern recognition, multiple classifier systems based on the combination of outputs of a set of different classifiers have been proposed as a method for the development of high performance classification systems. In this chapter, the problem of multiple classifier system design is discussed and the reader is provided with a critical survey of the state of the art. A formulation of the design problem that provides motivations for the different design methods described in the literature is proposed. In particular, such a formulation points out the rationale behind the so-called overproduce and choose design paradigm. Six design methods based on this paradigm are described and compared by experiments with three different data sets. Though these design methods have some interesting features, they do not guarantee an optimal multiple classifier system design for the classification task at hand. Accordingly, the main conclusion in this chapter is that optimal design is still an open problem.
1. I n t r o d u c t i o n In the past decade, a number of p a p e r s 1 9 ' 2 8 have proposed the combination of multiple classifiers for designing high performance p a t t e r n classification systems. T h e rationale behind the growing interest in multiple classifier systems (MCSs) is t h a t the classical approach to designing a p a t t e r n recognition system, which focuses on the search for the best individual classifier, has some serious drawbacks. 1 8 T h e main drawback is t h a t the best individual classifier for the classification task at hand is very difficult t o identify, 199
200
F. Roli & G.
Giacinto
unless deep prior knowledge is available for such a task. 3,8 In addition, with a single classifier it is not possible to exploit the complementary discriminatory information that other classifiers may encapsulate. It is worth noting that the motivations in favour of MCS strongly resemble those of a "hybrid" intelligent system. 15 ' 23 The obvious reason for this is that MCS can be regarded as a special-purpose hybrid intelligent system. A great number of methods for combining multiple classifiers have been proposed in the past ten years. 19 Several techniques for designing complementary classifiers, which when combined improve classification performance, have also been proposed. 19 An overview of works that have focused on the problem of MCS design is given in Section 2. A broader overview can be found in Refs. 19, 20 and 24 Roughly speaking, an MCS includes both an ensemble of different classification algorithms, and a decision function for combining classifier outputs. Therefore, the design of MCSs involves two main phases: the design of the classifier ensemble, and the design of the combination function. Though this formulation of the design problem leads one to think that effective design should address both phases, most design methods described in the literature have only focused on one phase. We discuss the rationale behind this design philosophy in Section 3. In particular, methods that focus on classifier ensemble design 19,24 assume a simple, fixed decision function and aim to generate a set of mutually complementary classifiers that achieve optimal accuracy using that decision function. A common approach to the generation of such classifier ensembles is to use some form of a "sampling" technique, 1 such that each classifier is trained on a different subset of the training data. On the other hand, methods that focus on combination function design assume a given set of carefully designed classifiers and aim to find an optimal combination of decisions from those classifiers. In order to perform such an optimization, a large set of combination functions of increasing complexity, ranging from simple voting rules through "trainable" combination functions, is available to the designer. 5,9,19,20 In spite of the fact that some design methods proved very effective, and that the comparative advantages of different methods have been investigated, 12 no clear guidelines are available for choosing the best design method for the classification task at hand. The designer of an MCS has a toolbox containing a large number of instruments for generating and combining classifiers. A myriad of different MCSs can be designed by coupling
Design of Multiple Classifier
Systems
201
different techniques for creating classifier ensembles with different combination functions. However, the best MCS can only be determined by performance evaluation. Accordingly, to design the most appropriate MCS for the task at hand, some researchers 6 ' 7,22 have proposed the so-called "overproduce and choose" paradigm (also called the "test and select" approach 25 ). The basic idea is to produce an initial large set of "candidate" classifier ensembles, and then select the ensemble that can be combined to achieve optimal accuracy. Typically, constraints and heuristic criteria are used to limit the computational complexity of the "choice" phase (e.g., the performances of a limited number of candidate ensembles are evaluated by a simple combination function such as the majority voting rule 22,25 ). This chapter opens with an overview of works dealing with MCS design (Section 2). The pros and cons of the main design approaches are discussed. In Section 3, a formulation of the problem of MCS design is proposed, which allows us to discuss the rationale behind the design approaches in the literature, as well as the limits of these approaches. In addition, the design paradigm based on the concept of "overproduction and choice" is introduced. Section 4 presents some design methods based on such a paradigm. Two methods proposed by Partridge and Yates 22 (Section 4.1), and other methods developed by the authors are described (Sections 4.3 and 4.4). The measures of classifier "diversity" used for MCS design are discussed in Section 4.2. The performance of the design methods described in this chapter have been assessed and compared using three different data sets. The results are reported in Section 5. Conclusions are drawn in Section 6. 2. Related Work As pointed out in the previous section, two main design approaches have been proposed. Following the definitions given by Ho, 12 we refer to methods that focus on the design of the classifier ensemble as "coverage optimization" methods, because they aim to generate a set of mutually complementary classifiers whose combination optimally "covers" the data set (i.e., the combination will classify the data set optimally). Analogously, we refer to methods based on the other design approach as "decision optimization" methods, because they aim to find an optimal decision function to combine classifier decisions. With regard to coverage optimization methods, it is easy to see that the required mutually complementary classifiers should have a certain degree
202
F. Roli & G.
Giacinto
of "error diversity" 24 ; that is, they should make different classification errors. Ideally, classifiers that exhibit high accuracy and great diversity are required. The degree of error diversity obviously depends on the combination function used. As an example, coincident errors can be tolerated if the majority-voting rule is used, provided that the majority is always correct. Several techniques have been proposed to generate such mutually complementary classifiers. In the following, we briefly review the main techniques that manipulate the training set to generate complementary classifiers. An overview of alternative techniques can be found in Refs. 2 and 24 Such techniques try to generate complementary classifiers by training them with different data sets. To this end, the most straightforward way is the use of disjoint training sets obtained by splitting the original training set (this technique is called sampling without replacement). Training set re-sampling is used by two very popular techniques called Bagging and Boosting. 1,4 Bagging trains each classifier with a training set that consists of a set of m patterns drawn randomly with replacement from the original training set of m patterns. While Bagging samples each training pattern with equal probability, Boosting focuses on training patterns that are most often misclassified. Essentially, a set of weights is maintained over the training set and adaptive re-sampling is performed, such that the weights are increased for those patterns that are misclassified. A quite different sampling technique, called the Random Subspace Method, has recently been proposed by Ho. 11 The feature space is randomly sampled instead of the training data, so that complementary classifiers are obtained that have been trained with different feature sets. It should be noted that coverage optimization methods usually generate large classifier ensembles (e.g., ensembles made up of a hundred classifiers). In practical applications, techniques for selecting the subset of the most diverse classifiers are therefore required. Bagging and Boosting work well for "unstable" classifiers, that is, classifiers whose outputs substantially change in response to small changes in the training data (e.g., decision trees and neural nets). Different techniques should be used for stable algorithms such as the nearest neighbor classifier. The Random Subspace Method is expected to work well when there is a redundancy in the feature set. With regard to decision optimization methods, the MCS designer can search for the optimal combination function within a very large set. This set contains functions designed for classifier fusion or selection. 20,28 Functions
Design of Multiple Classifier
Systems
203
for classifier fusion range from very simple combination rules through sophisticated fusion architectures. Simple fusion functions such as the majority rule, linear combination, etc., require strong assumptions on the degree of error diversity among classifiers (e.g., error independence is assumed). More complex functions, like the so-called Behaviour Knowledge Space, 13 relax these assumptions at the cost of increasing the required size of the training set. Huge training sets are required in order to use the so-called "trainable" fusion strategies where the outputs of component classifiers are provided as inputs to another classifier that performs fusion. Functions for classifier selection are designed to choose the classifier for each pattern that is most likely to classify it correctly. 5 ' 9 ' 20 Roughly speaking, each classifier has a domain of superior competence, and is selected when the input pattern falls into this domain. Classifier selection functions do not require strong assumptions. On the other hand, large training sets are required. To sum up, decision optimization methods either require strong assumptions on the degree of error diversity among classifiers, or large training sets. However, it should be noted that coverage optimization methods are also effective if the designer can create an ensemble of classifiers that should satisfy strong assumptions on error diversity. Therefore, decision optimization methods should be applied when the designer intends to use a set of highly specialized classifiers that show complex error-correlation patterns. Techniques for removing very correlated classifiers should be used to simplify the decision optimization task, since the complexity of the required combination function increases with the size of the ensemble. Finally, it should be noted that Kittler 18 has recently proposed a classifier fusion architecture that partially merges coverage and decision optimization design approaches. In fact, coverage optimization methods can be embodied within such a fusion architecture. With regard to the problem of combination function optimization, Kittler showed hypotheses under which classifier fusion can be effectively performed by the product function or the sum function. Although many design methods have been proposed, none can be claimed to be the best. In 1992, Wolpert 27 described the guidance available in choosing the appropriate design method as a "black art". Very recently, some quantitative criteria have been proposed. 12 However, clear guidelines are not yet available to choose the best design method for the classification task at hand. Therefore, as pointed out in Section 1, some researchers
204
F. Roli & G.
Giacinto
proposed the overproduce and choose paradigm. Partridge and Yates 22 described design methods based on this paradigm. They introduced some interesting error diversity measures, which can be used to choose a subset of complementary classifiers. However, they did not propose a systematic method to choose such a subset, but only described an experimental investigation of some heuristic techniques. Sharkey et al 25 proposed an approach that involves testing candidate ensembles on a validation set, and selecting the best performing ensemble. This approach can be applied effectively by limiting the number of candidate ensembles. In a recent work, Kang and Lee16 also described a design approach based on the assumption that the number of candidate ensembles is small. Giacinto and Roli 6 ' 7 proposed a method that does not need this assumption, because the choice phase is implemented by a classifier "clustering" that selects a subset of complementary classifiers without having to provide an exhaustive enumeration of all possible candidate ensembles. Other works 26 ' 17 follow the overproduce and choose paradigm. 3. Design of a Multiple Classifier System: Problem Formulation In Section 1, we pointed out that MCS design involves two main phases: the design of the classifier ensemble, and the design of the combination function. In order to discuss the rationale behind the design methods in the literature and the limits of these methods, we reformulate the problem of MCS design by pointing out a possible analogy with the design of a pattern recognition system based on a single classifier. The design cycle of a single-classifier pattern recognition system can be modeled by three main phases: the feature design phase (including feature extraction and selection), the classifier design phase, and the performance evaluation phase. (It should be noted that this simplified description of the design cycle is strongly biased in view of our discussion. For a detailed description of the design cycle of a single-classifier pattern recognition system, the reader is referred to Ref. 3). MCS design involves phases that can be related to the above-mentioned ones. Figure 1 points out this analogy and the relations among the phases. The feature design phase in a singleclassifier pattern recognition system can be related to the ensemble design phase in MCS, because the goal of both phases is to design a set of "information components" used to perform the final classification. This analogy is
Design of Multiple Classifier
Feature Design
Systems
Ensemble Design
"
"
Classifier Design
Combiner Design
'r Performance Evaluation
Fig. 1.
205
+-i
4—
V
—>
Performance Evaluation
—•
Design cycles of single classifier and multiple classifier systems.
very clear for MCSs that use a classifier as a combination function. For such MCSs, the outputs of the classifier ensemble are themselves a set of features. The analogy between single-classifier design and combination-function design is quite clear. In MCSs, the "combiner" plays the role of the classical single-classifier decision module. The analogy among performance evaluation phases is straightforward. Finally, it is worth noting that the analogy between the feature design phase, including feature selection, and the ensemble design phase suggests that a phase of classifier "selection" must be included in the MCS design. This is in agreement with the overproduce and choose design paradigm (Section 2). It is well known that in single-classifier pattern recognition systems an ideal feature set greatly simplifies the task of the classifier, while a very powerful classifier can work well with a feature set of very poor quality.3 Based on the above analogy, it easy to see that an ideal classifier ensemble made up of very accurate and diverse classifiers greatly simplifies the task of the combination function. Analogously, a very powerful combiner can work well even with poor complementary classifiers. The above leads to the rationale behind the two main design methods proposed in the literature, namely, the coverage and decision optimization methods (Section 2). Coverage optimization methods attempt to design an ideal classifier ensemble. Accordingly, they assume that simple combination functions are used. On the other hand, decision optimization methods attempt to design a powerful combination function capable of handling poor complementary classifiers. Nevertheless, as discussed in Section 2, optimality of either coverage or decision optimization methods is not guaranteed. Therefore, as shown in Fig. 1, feedback from later design phases to the earlier ones may be
F. Roli & G.
Giacinto
Ensemble Overproduction
+
Ensemble Choice
Combiner Design
Performance Evaluation
Fig. 2.
MCS design cycle based on the overproduce and choose paradigm.
necessary. Classifier ensembles or combination functions must be redesigned when the output of the performance evaluation phase is not satisfactory. (It should be noted that the design cycle of a single-classifier pattern recognition system contains a similar feedback loop). Since a large number of techniques are available to generate different classifier sets (Section 2), the ensemble design phase should produce many candidate ensembles. Such overproduction can help to reduce the number of design iterations. On the other hand, classifier selection should be applied to identify the most appropriate classifier subset for the task at hand, and simplify the subsequent combiner design phase. Based on the above, the MCS design cycle can be reformulated as shown in Fig. 2. The ensemble design phase has been subdivided into the overproduction and the choice phases. It is worth noting that this formulation of the MCS design problem provides an additional motivation to the overproduce and choose design paradigm proposed in the literature. 4. Design Methods Based on the Overproduce and Choose Paradigm According to the problem formulation proposed in the previous section, the MCS design cycle can be subdivided into the following phases: 1. 2. 3. 4.
Ensemble Overproduction Ensemble Choice Combiner Design Performance Evaluation
Design of Multiple Classifier
Systems
207
The overproduction design phase produces a large set of candidate classifier ensembles. To this end, techniques such as Bagging and Boosting that manipulate the training set can be adopted (Section 2). Different classifiers can also be designed using different initializations of the respective learning parameters, different classifier types, or different classifier architectures. In practical applications, variations of the classifier parameters based on designer expertise can provide very effective candidate classifiers. The choice phase selects the subset of classifiers that can be combined to achieve optimal accuracy. It is easy to see that such an optimal subset could be obtained by exhaustive enumeration. Such a performance evaluation should be performed on a given combination function (e.g., the majority voting rule). Unfortunately, if N is the size of the set produced by the overproduction phase, the number of possible subsets is equal to J2i=i (N)Therefore, different strategies have been proposed to limit the computational complexity of the choice phase (Section 2). Although the choice phase usually assumes a given combination function to evaluate the performances of classifier ensembles, there is a strong interest in techniques that choose effective classifier ensembles without hypothesizing a specific combination rule. This is seen in the analogy with the feature selection problem, where techniques have been developed to choose the features that best preserve class separability.3 Accordingly, techniques to evaluate the error diversity of classifiers that make up an ensemble have been used for classifier selection purposes. We review some of these techniques in Section 4.2. With regard to combiner design, theoretically speaking, the choice of the combination function should take into account dependency among classifiers. In actual practice, a trial and error procedure is performed, because a clear model of the dependency among classifiers is difficult to obtain. Performance evaluation is performed by assessing the classification accuracy of the selected classifier ensemble, using the combination function designed in the previous phase. Though feedback from the performance evaluation phase to earlier phases may be necessary (Fig. 2), it should be noted that the overproduction phase can reduce the number of design iterations, because a large set of candidates is initially available. In addition, the choice phase can simplify the combiner design, because it selects an ensemble of classifiers that can be combined effectively. In the following, without losing generality, we shall assume that the overproduction and choice design phases are so effective that they allow the
F. Roli & G.
208
Giacinto
designer to use a fixed combination function (e.g., the majority voting rule). Therefore, the proposed design methods do not include a combiner design phase. For simplicity, and without losing generality, feedback from the performance evaluation phase to the earlier phases will be disregarded. Finally, we shall describe design methods by focusing on the choice phase, because several well-known techniques are available to implement the overproduction phase (e.g., training data re-sampling, the use of different classifiers, etc.). Accordingly, in the following, we shall assume that a large ensemble C made up of N classifiers was created in the overproduction phase: C = { c i , c 2 ) . . . ,cN} .
(1)
The choice phase selects the subset C* of classifiers that can be combined to achieve optimal accuracy.
4 . 1 . Methods
Based on Heuristic 22
Rules
Partridge and Yates proposed a few design methods following the overproduction and choose paradigm. In particular, they proposed some techniques that exploit heuristic rules to choose classifier ensembles. One technique can be named "choose the best". It assumes an a priori fixed size M of the "optimal" subset C*. Then, it selects M classifiers from set C with the highest classification accuracy in order to create the subset C*. The rationale behind this heuristic choice is that all the classifier subsets possess similar degrees of error diversity. Accordingly, the choice is only based on the classifier accuracy. The other choice technique proposed by Partridge and Yates can be named "choose the best in the class". For each classifier "class", it chooses the classifier with the highest accuracy. Therefore, if the initial set C is made up of classifiers belonging to three classifier types (e.g., a multilayer perceptron neural net, a k-nearest neighbors classifier, and a radial basis function neural net), a subset C* made up of three classifiers will be created. With respect to the " choose the best" rule, the " choose the best in the class" rule also acknowledges that classifiers of different types should be more error independent than classifiers of the same type. It should be noted that thanks to heuristic rules the computational complexity of the choice phase can be greatly reduced, because it is not necessary to evaluate different classifier subsets. On the other hand, it is obvious that the general validity of such heuristics is not guaranteed.
Design of Multiple Classifier
4.2. Diversity
Systems
209
Measures
As pointed out previously, a few measures of error diversity for classifier ensembles have been proposed. Partridge and Yates 22 presented one named "within-set generalization diversity", or simply GD. This measure is computed as follows: GD=1_p(2bothfaU)_
p (1 fails) where p (2 both fail) indicates the probability that two randomly selected classifiers from set C will both fail on a randomly selected input, and p (1 fails) indicates the probability that one randomly selected classifier will fail on a randomly selected input. GD takes values in the range [0,... ,1] and provides a measure of the diversity of classifiers forming the ensemble. Partridge and Yates also proposed a measure, named "between-set generalization diversity", or simply GDB. This measure is designed to evaluate the degree of error diversity between two classifier ensembles A and B: p(l fails in A and 1 fails in B) max[p(l fails in A),p(l fails in B)\ where the single failure probabilities are as above. GDB takes values in the range [0,... ,1]. Details on the above diversity measures can be found in Ref. 22. Another diversity measure was proposed by Kuncheva et al. 21 Let X = {Xi, X2, • • • , XM} be a labeled data set. For each classifier Cj, we can design an M-dimensional output vector Oj = [Oi,i,... , OM,%\, such that Ojti = 1, if Ci correctly classifies the pattern Xj, and 0 otherwise. Q statistics allow us to evaluate the diversity of two classifiers Cj and c*,: ^ik
_ NnN00 ~ NUN00
- N01NW + N01N10
where Nab is the number of elements Xj of X for which Ojti = a and Ojjk = b. (M = N00 + N01 + Nw + N11). Q varies between - 1 and 1. Classifiers that tend to classify the same patterns correctly, that is, positively correlated classifiers, will have positive values for Q. Classifiers that make errors on different patterns will have negative values for Q. For statistically independent classifiers, Qi^ = 0. The average Q computed over all possible classifier couples is used to evaluate the diversity of a classifier ensemble.
210
F. Roli & G.
Giacinto
Giacinto and Roli7 proposed a simple diversity measure, named "compound diversity", or simply CD, based on the compound error probability for two classifiers <\ and cy. CD — 1 — prob(ci fails, Cj fails).
(5)
As with Q, the average CD computed over all the possible classifier couples is used to evaluate the diversity of a classifier ensemble. Giacinto and Roli also proposed a measure of the degree of error diversity between two classifier ensembles, A and B: diversity (A, B) =
max
{CD (CJ, c,)} .
(6)
It should be noted that the above measures as well as those proposed by Partridge and Yates are based on similar concepts. Since none of the above measures is demonstrably superior, we have used them all in Sections 4.3 and 4.4, and compared their performances in Section 5. 4.3. Methods
Based on Search
Algorithms
It is easy to see that search algorithms are the most natural way of implementing the choice phase required by the overproduce and choose design paradigm. Sharkey et al. 25 proposed an exhaustive search algorithm based on the assumption that the number of candidate classifier ensembles is small. In order to avoid the computational burden of exhaustive search, we developed three choice techniques based on search algorithms that were originally developed for feature selection purposes (forward search and backward search 3 ), and for the solution of complex optimization tasks (tabu search 10 ). All these search algorithms use an evaluation function to assess the effectiveness of candidate ensembles. The diversity measures above and the classification accuracy assessed by the majority voting rule have been used as evaluation functions. It should be noted that the following search algorithms avoid exhaustive enumeration, but selection of the optimal classifier ensemble is not guaranteed. Forward search In order to illustrate our search algorithms, let us use a simple example in which the set C created by the overproduction phase is made up of four
Design of Multiple Classifier
Fig. 3.
Systems
211
An example of forward search for classifier choice.
classifiers (Fig. 3). The choice phase based on the forward search algorithm starts by creating an ensemble made up of a single classifier (the classifier c
F. Roli & G.
212
Fig. 4.
Giacinto
An example of backward search for classifier choice.
a non-monotonic behavior, it can be effective to continue the search process even if the evaluation function is decreasing. Tabu search is based on this concept. In addition, it implements both forward and backward search strategies. The search starts from the full classifier set. At each step, new subsets are created by adding or eliminating one classifier. Then, the subset that shows the highest evaluation function value is selected to create new subsets. It should be pointed out that this subset is selected even if the evaluation function is lower than in the previous step. In order to avoid creating the same subsets in successive search steps (i.e., in order to avoid "cycles" in the search process), a classifier that has been added or eliminated cannot be inserted/deleted for a certain number of search steps. Different stop criteria can be used. For example, the search can stop after a certain number of steps, and the best subset created during the search process is returned. 4.4. A Method Based on Clustering
of
Classifiers
We have developed a choice phase approach that identifies an effective subset of classifiers with a limited computational effort. This approach is based on a few hypotheses on the set C created by the overproduction phase (equations 7-9). Let us assume that this set C is made up of the following union of M subsets Ci\ M
C=\JCi
(7)
where the subsets Cj meet the following assumptions: Vi,j
ijtj,
Cif]Cj
= 0
(8)
Design of Multiple Classifier
Systems
213
and the classifiers forming the above subsets satisfy the following conditions: » Oj, Gj, Q, c m , c n ,
i ^= j ,
ci,cm £ Gj,
c n £ Cj
prob (Q fails, cm fails) > prob (ci fails, cn fails).
(9)
In the above equation, the terms prob (ci fails, cm fails) and prob (c; fails, cn fails) are the compound error probabilities of the related classifier couples. Equation 9 states that the compound error probability between any two classifiers belonging to the same subset is greater than that between any two classifiers belonging to different subsets. It is easy to see that such a condition provides a useful guide for the choice of MCS members. In fact, according to equation 9, effective members can be extracted from different subsets C,. It is clear that the more correlated the classifier errors are within the same subset, and the more independent the classifier errors are in different subsets, the more effective this strategy will be. The extent to which our hypotheses (equations 7-9) can be deemed realistic, and the proposed approach effective, is discussed in Ref. 7. Under the hypotheses of equations 7-9, we define a choice phase made up of the following steps: • identification of the subsets d by clustering of classifiers; • extraction of the classifiers from the above subsets in order to create an effective classifier ensemble C*. Clustering of classifiers for subset identification This step is implemented by the hierarchical agglomerative clustering algorithm. 14 The classifiers belonging to set C play the role of the "data", and the subsets d represent the data "clusters". The compound error probability among couples of classifiers plays the role of the distance measure used in data clustering. In particular, we define a distance measure between two classifiers based on compound error diversity (Section 4.2): V cs, ct E C
d(cs,ct) = 1 - prob (cs fails, ct fails).
(10)
According to equation 10, the further apart two classifiers are, the fewer the coincident errors between them. Therefore, the above distance measure assigns classifiers that make a large number of coincident errors to the same cluster, while it assigns classifiers that make few coincident errors to
F. Roli & G.
214
Giacinto
different clusters. It is easy to see that in order to perform such clustering of classifiers, a distance measure between two clusters of classifiers is also necessary. We defined the distance between any two clusters Cj and Cj as the maximum distance between two classifiers belonging to such clusters: VCuCj,
i±j
d{Ci,Cj)=
max {d(cs,ct)}. (11) caeCi,cteCj It is worth noting that to avoid "overfitting" problems our method computes all the above distance measures with a validation set. Creation of subset C* At each iteration of the clustering algorithm, a candidate ensemble C* = {c*, C j , . . . , c*M} is created by taking one classifier from each of M clusters. In particular, for each cluster, the classifier with the maximum average distance from all other clusters is chosen. The distance between a classifier and a cluster is computed according to equation 11. For each candidate ensemble C*, the M classifiers are then combined by majority voting, and the classification accuracy is computed on a validation data set. Finally, the performances of all ensembles created during the clustering process are compared, and the one with the highest performance is chosen. It is worth noting that N candidate ensembles are created during the clustering process. Therefore, our approach shows low computational complexity. Further details on this design method can be found in Refs. 6 and 7. 5. Experimental Results 5.1. Data
Sets
The following data sets were used in our experiments: • the Feltwell data set generated from multisensor remote-sensing image data; • the Phoneme.CR data set and the Satimage_CR data set contained in the ELENA database. The Feltwell data set consists of a set of multisensor remote-sensing images related to an agricultural area near the village of Feltwell (UK). A section (250 x 350 pixels) of a scene acquired by an optical sensor (an Airborne Thematic Mapper scanner) and a radar sensor (a NASA/JPL synthetic aperture radar) were used. Our experiments characterized each
Design of Multiple Classifier
Systems
215
pixel by a fifteen-element feature vector containing brightness values in six optical bands and over nine radar channels. We selected 10944 pixels belonging to five agricultural classes (i.e., sugar beet, stubble, bare soil, potatoes, and carrots), and randomly subdivided them into a training set (5124 pixels), a validation set (528 pixels), and a test set (5238 pixels). We used a small validation set to simulate real cases where validation data are difficult to obtain. A detailed description of this data set can be found in Refs. 8 and 23. The ELENA (Enhanced Learning for Evolutive Neural Architecture) database consists of various data sets designed for testing and benchmarking classification algorithms. We used two of these data sets: the Phoneme_CR data set and the Satimage_CR data set. The Phoneme.CR data set consists of 5404 phonemes belonging to two classes: nasal and oral vowels. Five different numerical attributes are used to characterize each vowel. The Satimage.CR data set was generated from Landsat Multi-Spectral-Scanner satellite image data. This data set contains 6435 patterns belonging to six land cover classes. Each pattern is characterised by 36 numerical attributes. For each data set, the data were randomly subdivided into a training set (50% of the whole data set), a validation set (20% of the data set), and a test set (30% of the data set). Further details on these data sets can be obtained via anonymous ftp at ftp.dice.ucl.ac.be in the directory pub/neuralnets/ELENA/databases. 5.2. Experiments
with the Feltwell data set
Our experiments were mainly aimed at: • assessing the performances of the proposed design methods (Sections 4.3 and 4.4); • comparing our methods with other design methods proposed in the literature (Section 4.1). To this end, we performed a different number of overproduction phases, thus creating different initial ensembles C (see equation 1). Such sets were created using different classifier types, namely, Multilayer Perceptrons (MLPs), Radial Basis Function (RBF) neural networks, Probabilistic Neural Networks (PNNs), and the k-nearest neighbor classifier (fc-nn). For each classifier type, ensembles were created by varying some design parameters (e.g., the network architecture, the initial random weights, the value of the
F. Roli & G.
216
Giacinto
k parameter for the fc-nn classifier, and so on). In the following, we report the results related to three initial sets C, here referred to as sets C 1 , C 2 , and C 3 , generated by distinct overproduction phases: set C1 contains fifty MLPs. Five architectures, with one or two hidden layers and different numbers of neurons per layer, were used. For each architecture, ten training phases with different initial weights were performed. All the networks had fifteen input units and five output units as input features and data classes, respectively (see Section 5.1); set C2 contains the same MLPs belonging to C 1 and fourteen fc-nn classifiers. The &;-nn classifiers were obtained by varying the value of the k parameter in the following two ranges: (15, 17, 19, 21, 23, 25, 27) and (75, 77, 79, 81, 83, 85, 87); set C 3 contains thirty MLPs, three k-xrn classifiers, three RBF neural networks, and one PNN. For the RBF neural network, three different architectures were used. Experiments with set C1 First, we evaluated the performance of the whole set C 1 , of the best classifier in the ensemble, and of the ensembles designed by the two methods based on heuristic rules (see Section 4.1). These performances are reported in Table 1 in terms of accuracy, rejection rates, and differences between accuracy and rejection values. The sizes of the selected ensembles are shown. The classifiers were always combined by the majority-voting rule. A pattern was rejected when the classifiers assigning it to the same data class were not the majority. All values reported in Table 1 refer to the test set. For the method named "choose the best" (indicated with the term "Best" in Table 1), the performances of ensembles of sizes ranging from 3 through 15 were assessed. The size of the ensemble designed by the method named "choose the best in the class" (indicated with the term "Best-class") is five, because five types of classifiers (namely, five types of net architectures) were used to create the ensemble C 1 (Section 4.1). For each ensemble, in order to show the degree of error diversity among the classifiers (equation 2), the value of the Generalisation Diversity measure (GD) is reported. Table 1 shows that the design methods based on heuristic rules can improve on the accuracy of the initial ensemble C1, and on that of the best
Design of Multiple Classifier
Systems
217
Table 1. Performances of the whole set C 1 , of the best classifier in the ensemble, and of the ensembles designed by the two methods based on heuristic rules, namely, the "choose the best" method (indicated with the term "Best"), and the "choose the best in the class" method (indicated with the term "Best-class"). For the method "choose the best", the performances of ensembles of a size ranging from 3 through 15 were assessed. For each ensemble, the value of the GD measure is also reported. The sizes of the selected ensembles are shown. Ensemble 1
Initial set C B e s t classifier Best Best Best Best Best Best Best Best-class
Size
Accuracy
Rejection
Acc-Rej
GD
50 1 3 5 7 9 11 13 15 5
89.8357 89.2516 90.2565 90.4278 90.0134 90.0459 90.0747 89.9732 89.9712 89.9847
1.2027 0.0000 0.2673 0.4773 0.4009 0.2673 0.3627 0.2291 0.4391 0.4964
88.6330 89.2516 89.9892 89.9505 89.6125 89.7786 89.7120 89.7441 89.5321 89.4883
0.2948 N/A 0.2399 0.1937 0.1801 0.1783 0.1935 0.2008 0.2063 0.2617
Table 2. Performances of the ensembles generated by the design method based on the backward search algorithm. Diversity measures and the accuracy value assessed by the majority-voting rule were used as evaluation functions to guide the search. The evaluation function used is indicated within brackets. The sizes of the selected ensembles are reported. Choice Method 1
Initial set C B e s t classifier Backward ( G D ) Backward ( C D ) Backward ( A c c u r a c y ) B a c k w a r d (Q)
Size
Accuracy
Rejection
Acc-Rej
GD
50 1 3 3 45 3
89.8357 89.2516 89.9981 90.4890 89.8517 88.6901
1.2027 0.0000 1.6991 0.8400 0.8591 0.7446
88.6330 89.2516 88.2990 89.6490 88.9926 87.9455
0.2948 N/A 0.4752 0.3573 0.2950 0.4129
classifier. However, such improvements are slight. It should be noted that these design methods do not provide improvements in terms of error diversity as assessed by the GD measure. This can be explained by observing that such methods select classifiers on the basis of accuracy, and they do not take error diversity explicitly into account. Tables 2, 3, and 4 report results obtained by design methods based on search algorithms (Section 4.3). The classifiers were always combined by the majority-voting rule. A pattern was rejected when the classifiers
F. Roll & G. Giacinto
218
Table 3. Performances of the ensembles generated by the design method based on the forward search algorithm. Diversity measures and the accuracy value assessed by the majority-voting rule were used as evaluation functions to guide the search. The evaluation function used is indicated within brackets. The sizes of the selected ensembles are reported. It should be noted that the search process was carried out starting from the best classifier or from a randomly selected classifier. The rows from the third to the sixth report results obtained starting from the best classifier. The last four rows show results obtained starting from a randomly selected classifier. Choice Method 1
Initial set C B e s t classifier Forward ( G D ) Forward ( C D ) Forward ( A c c u r a c y ) Forward (Q) Forward ( G D ) Forward ( C D ) Forward ( A c c u r a c y ) Forward (Q)
Size
Accuracy
Rejection
Acc-Rej
GD
50 1 3 3 11 3 3 7 7 3
89.8357 89.2516 89.5669 90.0499 90.3965 88.2387 89.0993 90.2866 90.2420 87.0609
1.2027 0.0000 1.6991 0.8400 0.8591 0.7446 1.2218 0.7446 0.6109 0.5536
88.6330 89.2516 88.2990 89.6490 88.9926 87.9455 87.8775 89.5420 89.6311 86.5073
0.2948 N/A 0.4752 0.3573 0.2950 0.4129 0.3958 0.3346 0.2589 0.3845
Table 4. Performances of the ensembles generated by the design method based on the Tabu search algorithm. Diversity measures and the accuracy value assessed by the majority-voting rule were used as evaluation functions to guide the search. The evaluation function used is indicated within brackets. The sizes of the selected ensembles are reported. Choice Method
Size
Accuracy
Initial set C 1
50
B e s t classifier Tabu ( G D ) Tabu ( C D ) Tabu ( A c c u r a c y ) Tabu (Q)
1 3 3 9 3
Rejection
Ace—Re
GD
89.8357
1.2027
88.6330
0.2948
89.2516 89.9459 90.1425 90.1156 89.8180
0.0000 1.2600 0.8400 0.9164 1.3746
89.2516 88.6859 89.3025 89.1992 88.4434
N/A 0.4613 0.3806 0.3416 0.4826
assigning it to the same data class was not the majority. It should be noted that these design methods improve error diversity, that is, the ensembles are characterised by GD values higher that the ones reported in Table 1. However, the improvements in accuracy with respect to the initial ensemble C1 and the best classifier are similar to those provided by methods based on heuristic rules.
Design of Multiple Classifier
Systems
219
Table 5. Performances of the ensembles generated by the design method based on classifier clustering. Different diversity measures were used as evaluation functions to guide the search. The evaluation function used is indicated within brackets. The sizes of the selected ensembles are reported. Choice Method
Size
Accuracy
Rejection
Acc-Rej
GD
Initial set C 1 B e s t classifier Cluster ( C D ) Cluster(Q) Cluster(GD)
50 1 7 49 9
89.8357 89.2516 90.5294 89.6592 89.6179
1.2027 0.0000 0.8209 0.8591 1.0691
88.6330 89.2516 89.7085 88.8001 88.5488
0.2948 N/A 0.3193 0.2962 0.3788
Table 5 shows results obtained by our method based on classifier clustering (Section 4.4). Conclusions similar to those of the design methods based on search algorithms can be drawn. It is worth noting that the performances of various design methods are slightly better than those of the initial ensemble C 1 and of the best classifier, but the differences are small. However, it should be noted that methods based on search algorithms and classifier clustering improve classifier error diversity. Experiments with set C2 and C 3 The same experiments previously described for set C 1 were performed for sets C2 and C 3 . For the sake of brevity, for each design method, we report the average performances in terms of accuracy and error diversity values. Table 6 shows the average accuracy values of different design methods applied to sets C2 and C 3 . Table 7 shows the average error diversity values of different design methods applied to sets C2 and C 3 . With regard to the experiments performed on sets C2 and C 3 , the performances of various design methods are close to, or better than, those of the initial ensembles and of the best classifier. Significant improvements were obtained for some experiments performed on set C 2 . 5.3. Experiments
with the Satimage-CR
Data
Set
As with the Feltwell data set, we performed different overproduction phases, thus creating different initial ensembles C. Such sets were created using different classifier types. For each classifier type, ensembles were created by
F. Roli & G.
Giacinto
Table 6. For each design method, the average percentage accuracy value is reported for the experiment with the set C 2 and for the experiment with the set C3. Choice Method
Set C*2
Set C 3
Initial set B e s t classifier Choose the best C h o o s e t h e best in t h e class Backward Forward from t h e best Forward Tabu Clustering
90.4918 90.0916 90.1090 89.9847 89.8945 89.9024 89.7408 90.0931 89.9013
89.4645 88.2016 91.1097 92.0613 92.3871 93.2471 91.5023 93.5092 92.1911
Table 7. For each design method, the average error diversity value is reported for the experiment with set C 2 and for the experiment with set C 3 . Choice Method
Set C 2
Set C 3
Initial set Choose the best C h o o s e t h e b e s t in t h e class Backward Forward from t h e best Forward Tabu Clustering
0.3170 0.1989 0.2617 0.3488 0.3400 0.3270 0.3631 0.3410
0.3819 0.3279 0.4905 0.5851 0.5917 0.5969 0.6225 0.5383
varying some design parameters (e.g., the network architecture, the initial random weights, the value of the k parameter for the fc-nn classifier, and so on). In the following, we report the average results relating to three initial sets C, here referred to as sets C 1 , C 2 , and C 3 , generated by overproduction phases: set C1 contains ninety-two classifiers. Sixty MLPs, twenty-six fc-nearest neighbor classifiers, five RBF nets, and one PNN. set C2 contains sixty MLPs; set C 3 contains thirty MLPs, five fc-nn classifiers, five RBF neural networks, and one PNN.
Design of Multiple Classifier
Systems
221
Table 8. For each design method, the average percentage accuracy values are reported for the three experiments with sets C1, C 2 , and C3 applied to the Satimage_CR data set. Choice Method
Set C1
Set C 2
Set C 3
Initial set B e s t classifier C h o o s e t h e best C h o o s e t h e b e s t in t h e class Backward Forward from t h e b e s t Forward Tabu Clustering
91.1734 90.4145 91.0452 91.8678 91.0552 91.5933 91.3593 91.3552 91.7811
91.4815 88.9637 91.4399 91.2382 90.9133 91.0161 91.1110 91.0325 91.4740
91.1780 88.9119 91.1250 90.7943 90.9714 91.1590 90.9963 91.1627 91.9537
Table 9. For each design method, the average error diversity values are reported for the three experiments with the sets C 1 , C 2 and C 3 applied to the Satimage.CR data set. Choice Method
Set C 1
Set C 2
Set C 3
Initial set C h o o s e t h e best C h o o s e t h e b e s t in t h e class Backward Forward from t h e best Forward Tabu Clustering
0.3284 0.2829 0.3472 0.3715 0.3842 0.3957 0.3949 0.3834
0.3218 0.3266 0.3143 0.3384 0.3431 0.3472 0.3518 0.3575
0.3412 0.3385 0.3426 0.3616 0.3860 0.3782 0.3878 0.3722
Table 8 shows the average accuracy values of different design methods applied to sets C 1 , C2 and C 3 created for the Satimage_CR data set. Table 9 shows the average error diversity values of different design methods applied to sets C 1 , C 2 , and C 3 created for the Satimage_CR data set.
5.4. Experiments
with the Phoneme-CR
Data
Set
The same initial sets described in Section 5.3 were used for the experiments with the Phoneme_CR data set. Table 10 shows the average accuracy values of different design methods applied to sets C 1 , C 2 , and C 3 . Table 11 shows the average error diversity values of different design methods applied to sets C 1 , C 2 , and C 3 .
222
F. Roli & G.
Giacinto
Table 10. For each design method, the average percentage accuracy values are reported for the three experiments with sets C 1 , C 2 , and C 3 applied to the Phonerne.CR data set. Choice Method
Set C 1
Set C 2
Set C 3
Initial set B e s t classifier C h o o s e t h e best C h o o s e t h e b e s t in t h e class Backward Forward from t h e best Forward Tabu Clustering
85.2195 87.2918 87.3887 86.6132 85.3717 86.9756 86.5361 85.6416 86.7058
85.0560 85.3177 86.4105 85.2560 84.6468 85.3948 84.9090 85.3563 86.0117
84.2690 85.8112 86.7895 84.7625 84.9553 85.7496 85.4180 85.8344 85.9192
Table 11. For each design method, the average error diversity values are reported for the three experiments with sets C 1 , C 2 , and C 3 applied to the Phoneme_CR data set. Choice Method
Set C 1
Set C 2
Set C3
Initial set C h o o s e t h e best C h o o s e t h e b e s t in t h e class Backward Forward from t h e best Forward Tabu Clustering
0.3989 0.3400 0.4593 0.4600 0.4907 0.4939 0.5005 0.4550
0.3995 0.3363 0.4143 0.4453 0.4270 0.4348 0.4442 0.4252
0.4041 0.4018 0.4184 0.4606 0.4883 0.4751 0.4798 0.4870
6. Discussion and Conclusions Although final conclusions cannot be drawn on the basis of this limited set of experiments, the following observations can be made: • the design methods based on the overproduce and choose paradigm allow us to create small ensembles of classifiers whose performances are close to, or higher than, those of the large ensembles C created by the overproduction phases; • in some experiments, the performances of the selected ensemble were significantly higher than those of the initial ensemble C; • in some experiments, the performances of the selected ensemble were higher that those of the best classifier in the ensemble C;
Design of Multiple Classifier
Systems
223
• in most experiments, it was possible to generate small classifier ensembles without a significant loss of classification accuracy; • no choice method is demonstrably superior because the superiority of one over the other depends on the classification task at hand. On the basis of the above observations, some preliminary conclusions can be drawn: • the overproduce and choose paradigm does not guarantee optimal MCS design for the classification task at hand. Accordingly, optimal MCS design is still an open issue; • the main motivation behind the use of the overproduce and choose paradigm is that at present clear guidelines to choose the best design method for the classification task at hand are lacking; • thanks to this design paradigm it is possible to exploit the large set of tools developed to generate and combine classifiers. The designer can create a myriad of different MCSs by coupling different techniques to create classifier ensembles with different combination functions. Then, the most appropriate MCS can be selected by performance evaluation. It is worth noting that this approach is commonly used in engineering fields where optimal design methods are not available (e.g., software engineering); • the overproduce and choose paradigm allows us to create MCSs made up of small sets of classifiers. This is a very important feature for practical applications. To sum up, in this chapter we discussed the problem of MCS design in order to provide the reader with a critical survey of the state of the art. We also proposed a formulation of the MCS design problem that provided motivations for the different design methods proposed in the literature. In particular, our formulation pointed out the rationale behind the overproduce and choose design paradigm. We described and experimentally assessed different design methods based on this paradigm. Although these design methods show interesting features, they do not guarantee optimal MCS design for the classification task at hand. Accordingly, the main conclusion of this chapter is that optimal MCS design is still an open issue. In conclusion, it is worth remarking that MCS design shows analogies with the classical pattern recognition system design based on a single classifier. We pointed out some of these analogies in Section 3. Based on this work, it seems to the authors that the problem with both designs is the
224
F. Roli & G. Giacinto
current lack of clear guidelines in choosing the best design method for the classification task at hand. In the case of single-classifier systems, a possible solution could be the combination of different classifiers. W i t h MCSs, the solution could be to combine different MCSs. However, we think t h a t the overproduce and choose design paradigm provides a more practical and effective solution, because the combination of different MCSs is computationally expensive. Future work should clearly address the problem of defining guidelines in order to: • assess when the use of MCSs can improve accuracy compared to the use of individual classifiers; • choose the best MCS design method for the classification task at hand. We also believe t h a t since it is difficult to define clear guidelines for all practical situations, the overproduce and choose design paradigm may also be useful for future MCS designers. Acknowledgments This work was supported by the Italian Space Agency, within the framework of the project "Metodologie innovative di integrazione, gestione, analisi di dati d a sensori spaziali per l'osservazione della idrosfera, dei fenomeni di precipitazione e del suolo" (Innovative methods of integration, management, and d a t a analysis from space sensors for the observation of the hydrosphere, rainfall, and soil).
References 1. L. Breiman, "Bagging Predictors", Machine Learning, 24, 123-140 (1996). 2. T. G. Dietterich, "Ensemble methods in machine learning", in Multiple Classifier Systems, LNCS 1857, J. Kittler and F. Roli Eds., 1-15 (Springer-Verlag, 2000). 3. R. O. Duda, P. E. Hart and D. G. Stork, Pattern Classification (John Wiley & Sons 2000). 4. Y. Freund and R. Schapire, "Experiments with a new boosting algorithm", Proc. of the Thirteeth Int. Conf. on Machine Learning, 148-156 (1996). 5. G. Giacinto and F. Roli, "Dynamic Classifier Selection based on Multiple Classifier Behaviour", Pattern Recognition, 34, 179-181 (2001). 6. G. Giacinto and F. Roli, "Design of effective neural network ensembles for image classification purposes", Image and Vision Computing Journal 19, 697-705 (2001).
Design of Multiple Classifier Systems
225
7. G. Giacinto and F. Roli, "An approach to the automatic design of multiple classifier systems", Pattern Recognition Letters 22, 25-33 (2001). 8. G. Giacinto, F. Roli and L. Bruzzone, "Combination of Neural and Statistical Algorithms for Supervised Classification of Remote-Sensing Images", Pattern Recognition Letters 2 1 , 385-397 (2000). 9. G. Giacinto, F. Roli and G. Fumera, "Selection of Image Classifiers", Electronics Letters 36, 420-422 (2000). 10. F. Glover and M. Laguna, Tabu Search, Kluwer Academic Publishers (1997). 11. T. K. Ho, "The random subspace method for constructing decision forests", IEEE Trans, on Pattern Analysis and Machine Intelligence 20, 832-844 (1998). 12. T. K. Ho, "Complexity of classification problems and comparative advantages of combined classifiers", in Multiple Classifier Systems, LNCS 1857, J. Kittler and F. Roli, Eds., 97-106 (Springer-Verlag, 2000). 13. Y. S. Huang and C. Y. Suen, "A method of combining multiple experts for the recognition of unconstrained handwritten numerals", IEEE Trans. On Pattern Analysis and Machine Intelligence 17, 90-94 (1995). 14. A. K. Jain and R. C. Dubes, Algorithms for clustering data (Prentice Hall, 1988). 15. A. Kandel and G. Langholz, Hybrid architectures for intelligent systems (CRC Press, 1992). 16. H. Kang and S. Lee, "An information-theoretic strategy for constructing multiple classifier systems", Proc. 15th Int. Conference on Pattern Recognition, Barcelona, Spain, 2, 483-486 (2000). 17. J. Kim, K. Seo and K. Chung, "A systematic approach to classifier selection on combining multiple classifiers for handwritten digit recognition", Proc. Int. Conf. on Document Analysis and Recognition, 459-462 (1997). 18. J. Kittler, "A framework for classifier fusion: is it still needed?", in Advances in Pattern Recognition, LNCS 1876, F. J. Ferri, J. M. Inesta, A. Amin and P. Pudil, Eds., 45-56 (Springer-Verlag, 2000). 19. J. Kittler and F. Roli, (Eds.): Multiple Classifier Systems, LNCS 1857, 404 (Springer-Verlag, 2000). 20. L. I. Kuncheva, "Combinations of multiple classifiers using fuzzy sets", in Fuzzy Classifier Design, 233-267 (Springer-Verlag, 2000). 21. L. I. Kuncheva, C. A. Whitaker, C. A. Shipp and R. P. W. Duin, "Is independence good for combining classifiers?" Proc. of 15th Int. Conference on Pattern Recognition, Barcelona, Spain, 2, 168-171 (2000). 22. D. Partridge and W. B. Yates, "Engineering multiversion neural-net systems", Neural Computation 8, 869-893 (1996). 23. F. Roli, S. B. Serpico and G. Vernazza, "A hybrid system for 2D image recognition" , Proceedings of the IEEE, Special Issue on "Signals and Symbols" 84, 1659-1681, (1996). 24. A. J. C. Sharkey, "Multi-Net Systems", in Combining Artificial Neural Nets, Ensemble and Modular Multi-Net Systems, 1-27, (Springer-Verlag, 1999).
226
F. Roli & G. Giacinto
25. A. J. C. Sharkey, N. E. Sharkey, U. Gerecke and G. O. Chandroth, "The 'test and select' approach to ensemble combination", in Multiple Classifier Systems, LNCS 1857, J. Kittler and F. Roli, Eds., 30-44 (Springer-Verlag, 2000). 26. W. Wang, P. Jones and D. Partridge, "Diversity between neural networks and decision trees for building multiple classifier systems", in Multiple Classifier Systems, LNCS 1857, J. Kittler and F. Roli, Eds., 240-249 (Springer-Verlag, 2000). 27. D. H. Wolpert, "Stacked generalisation", Neural Networks 5, 241-259 (1992). 28. L. Xu, A. Krzyzak and C. Y. Suen, "Methods for combining multiple classifiers and their applications to handwriting recognition", IEEE Trans, on Systems, Man, and Cyb. 22, 418-435 (1992).
CHAPTER 9 F U S I N G N E U R A L N E T W O R K S T H R O U G H FUZZY INTEGRATION
A. Verikas Intelligent Systems Laboratory, Halmstad University Box 823, S-301 18, Halmstad, Sweden Department of Applied Electronics, Kaunas University of Technology 3031 Kaunas, Lithuania E-mail: [email protected]
A. Lipnickas and M. Bacauskiene Department of Applied Electronics, Kaunas University of Technology 3031 Kaunas, Lithuania E-mail: [email protected], [email protected]
K. Malmqvist Intelligent Systems Laboratory, Halmstad University Box 823, S-301 18, Halmstad, Sweden E-mail: [email protected]
To improve recognition results, decisions of multiple neural networks can be aggregated into a committee decision. An efficient committee should consist of networks that are not only very accurate, but also diverse in the sense that the network errors occur in different regions of the input space. One issue investigated in this chapter is the effectiveness of the half & half sampling approach in creating accurate neural network committees fused through the Choquet integral. The second issue investigated is the influence of different types of fuzzy measures on the classification accuracy obtained from neural network committees fused by a fuzzy integral. The ordinary fuzzy measure, the A-fuzzy measure, the cardinal fuzzy measure, and the 2-order additive fuzzy measure are investigated. Parameters determining the fuzzy measures are learned from a training data set. Learning the fuzzy measures through minimization 227
A. Verikas et al.
228
of the classification error rate resulted in a significant improvement of the accuracy of the committees as compared to learning the measures by minimizing the sum-of-squared error.
1. I n t r o d u c t i o n It is well known t h a t a combination of many different neural networks can improve classification accuracy. A variety of schemes have been proposed for combining multiple classifiers. T h e approaches used most often include the majority v o t e , 2 , 2 8 , 3 7 , 6 2 averaging, 2 3 ' 4 5 ' 5 5 , 5 7 weighted averaging, 2 2 , 2 4 , 2 7 ' 4 1 , 4 8 , 6 0 the Bayesian approach, 3 7 , 6 3 the fuzzy integral, 9 , 1 0 , 1 7 , 1 8 , 2 1 , 4 3 , 5 3 ' 5 6 - 6 0 the Dempster-Shafer theory, 3 , 1 2 , 5 0 , 6 3 the Borda count, 2 5 aggregation through order statistics, 8 , 5 8 , 5 9 probabilistic aggregation, 3 1 , 3 2 , 3 3 the fuzzy templates, 3 6 ' 3 5 and aggregation by a neural network. 7 , 2 6 , 2 9 In this study, we consider networks t h a t use a 1 — of — N coding scheme for o u t p u t s . In practice, o u t p u t s from multiple neural networks are usually highly correlated. Therefore, it is desirable to assign weights not only to individual networks, but also to groups of networks. This expresses the correlations between different networks. Aggregation based on fuzzy integrals possess this valuable property. In such schemes, o u t p u t s of different networks are fused into a final decision by a fuzzy integral with respect to a fuzzy measure. However, to utilize this property, we need to construct fuzzy measures t h a t express the actual interaction among networks with respect to classification performance. T h e fuzzy measures represent the weights on each network and also the weights on each group of networks. Most often a separate fuzzy measure is defined for each decision class. Therefore, the number of weights (coefficients) for defining the fuzzy measures is Q2L, where Q is the number of decision classes and L is the number of networks. Even for moderate values of L, the number of coefficients is very large. This large number of coefficients makes procedures for learning the fuzzy measures from examples very expensive in time and memory. Moreover, since the fuzzy measures are monotonic set functions, the Q2L coefficients must satisfy QL2L~1 constrains. Therefore, only constrained optimization techniques can be applied. To get around the problem a simplified version of the fuzzy integral with a single fuzzy measure for all t h e decision classes is used, or some simplified type of the fuzzy measure (such as the A-fuzzy measure or the
Fusing Neural Networks
Through Fuzzy
Integration
229
2-order additive fuzzy measure) is applied. To cope with the computation and memory problems we used one fuzzy measure that is common for all decision classes. One issue investigated in this chapter is the influence of different types of fuzzy measures on the classification accuracy obtained from neural network committees fused by a fuzzy integral. The ordinary fuzzy measure, the Afuzzy measure, the cardinal fuzzy measure, and the 2-order additive fuzzy measure are investigated. Numerous previous works on neural network committees have shown that an efficient committee should consist of networks that are not only very accurate, but also diverse in the sense that the network errors occur in different regions of the input space. 39 ' 47 ' 62 Bootstrapping, 4 ' 14 ' 24 ' 65 Boosting, 1 ' 13 ' 62 and AdaBoosting 16,49,51,52 are the most often used approaches for data sampling when training members of neural network committees. It has been recently shown that half & half bagging through majority voting is capable of creating very accurate committees of decision trees. 6 The second issue investigated in this chapter is the effectiveness of the half & half sampling approach in creating accurate neural network committees fused through the Choquet integral. 2. Half & Half Bagging It has been demonstrated that the AdaBoost algorithm 16 generates classifiers with a low generalization error. 13 ' 15 ' 52 AdaBoost is a complex algorithm. Breiman has recently proposed a very simple algorithm, the so called half & half bagging approach. 6 When tested on decision trees the approach was competitive with the AdaBoost algorithm. The basic idea of half & half bagging is very simple. It is assumed that the training set contains N data points. Suppose that k classifiers have been already constructed. To obtain the next training set, randomly select a data point x. Present x to that subset of k classifiers which did not use x in their training sets. Use the majority vote to predict the classification result of x by that subset of classifiers. If x is misclassified, put it in set MC. Otherwise, put x in set CC. Stop when the sizes of both MC and CC are equal to M, where 2M < N. In Ref. 6, M = JV/4 has been used. The next training set is given by a union of the sets MC and CC. In this chapter, we investigate the effectiveness of the half & half bagging approach in the fuzzy-integral-based fusion of neural networks for classification. We compare the approach with the bootstrapping technique.
A. Veritas et al.
230
3. Fuzzy Measure and Fuzzy Integral Let Z be a non-empty finite set. Definition 1. A set function g : 2Z —> [0,1] is a fuzzy measure if 1) 5 (0) = 0; g(Z) = 1, 2) if A,B C 2Z and Ac B then g(A) < #(£), 3) if An C 2 Z for 1 < n < oo and the sequence {A„} is monotonic in the sense of inclusion, then linin^oo g(An) = ^(limn^oo An). Definition 2. Let g be a fuzzy measure on Z. The discrete Choquet integral of a function /i : Z —> K + with respect to g is defined as L
Cg(h(Zl),...
, / i ( ^ ) ) = $ > ( * ) - M*i-i)]ff(^i)
(1)
where indices i have been permuted so that 0 < h(z\)... /I(ZL) < 1, Ai = {zi,... , zi}; h(zo) = 0, and L is the number of elements in the set Z.19 4. The Fuzzy Measures Used We investigate four types of fuzzy measures, namely the ordinary fuzzy measure, the A-fuzzy measure, the cardinal fuzzy measure, and the 2-order additive fuzzy measure. The ordinary fuzzy measure (see Definition 1). Since g(0) = 0 and g(Z) = 1 the ordinary fuzzy measure is defined by 2L — 2 real coefficients where L is the number of elements in the set Z. The A-fuzzy measure. In general, the ordinary fuzzy measure of a union of two disjoint subsets cannot be directly computed from the ordinary fuzzy measures of the subsets. Sugeno54 introduced the so called A-fuzzy measure, which satisfies the following additional property g(AU B) = g(A) + g(B) + Xg(A)g(B)
(2)
for all A, B c Z and A C\ B = 0, and for some A > - 1 . Let Z = {z\, Z2,. • • , ZL} be a finite set (a set of committee members in our case) and let g% = g({zi}). The values gl are called the densities of the A-fuzzy measure. When g is the A-fuzzy measure, the values of g(Ai) can be computed recursively: g(A1) = g({zi}) = g1, g(At) = gi + 5 ( ^ - 1 ) + Xg^iAi-i), for Ki
Fusing Neural Networks Through Fuzzy
Integration
231
The cardinal fuzzy measure. The fuzzy measure can be chosen so that the measure of a set depends only on the cardinality of the set 46 i-l
g(A) = 5 > L - J
(3)
3=0
VA such that |^4.| = z where uii is a coefficient. If the fuzzy measure is cardinal, then the Choquet integral reduces to an ordered weighted averaging (OWA) operator 64 (see Section 5.5). 4.1.
The 2-Order
Additive
Fuzzy
Measure
A pseudo-Boolean function is a real valued function / : {0,1} L —> M. A fuzzy measure can be viewed as a particular case of the pseudo-Boolean function, defined for any A C Z, such that A is equivalent to a point (zi,... ,z£) in {0,1} L , where zt = 1 iff i 6 A.20 It can be shown that any pseudo-Boolean function can be expressed as a multilinear polynomial in L variables
/ ( * ) = £ {a(T)n*} TC.Z I
ieT
(4)
J
with a(T) e R and z^ e {0,1} L . The coefficients, a{T), T C Z can be interpreted as the Mobius transform of a set function.42 If a measure is additive and expressed by the coefficients gl, i = 1 , . . . , L then the corresponding pseudo-Boolean function will be linear: f(z) = J2i=iaizi- Note that gl = ai. By extension, the k-order additive fuzzy measure having a polynomial representation of degree k can be defined. Definition 3. 2 0 A fuzzy measure g defined on Z is said to be fc-order additive if its corresponding pseudo-Boolean function is a multilinear polynomial of degree k, i.e. a{T) = OVT such that \T\ > fc, and there exists at least one T of k elements such that a(T) ^ 0. For any K C Z and \K\ > 2 the 2-additive fuzzy measure, is defined by:
(iq = X > + E ieK
{i,j}CK
a
^= E {i,j}CK
0<j + (1*1 - 2) £>*
(5)
i€K
with \K\ being the cardinality of K and g^ = ({ZJ, Zj}) = a^ + a,j + a^- = 3* + 9"* + a ij • The 2-additive fuzzy measure is determined by the coefficients g% and g%K
232
A. Verikas et al.
To construct the 2-additive fuzzy measure, only L(L + l ) / 2 coefficients gl and g^, i,j g Z have to be determined from training data. In order to obtain a monotonic fuzzy measure, the coefficients g% and g13 must satisfy particular conditions. The monotonicity constraints on the coefficients of the 2-additive fuzzy measure, expressed through the coefficients gl and gli can be formulated as follows43:
YJ9ij-Y.9i-(L-2)gi>Q,
VieZ,KCZ\{i}
(6)
where \Z\ = L. To obtain a fuzzy measure normalized to the interval [0,1], the coefficients g% and g%i must also satisfy the normalization condition, which can be derived from Eq. 5 for if = Z:
£
«,"-(L-2)Xy = l.
{ij}cz 4.2. Constructing
(7)
iez
the 2-Additive
Fuzzy
Measure
To construct the 2-order additive fuzzy measure Grabisch proposed the use of the Shapley values and interaction indices.19 Grabisch manually determines the sets of Shapley values and interaction indices. 19 However, such a manual determination of the indices is not an easy task. In Ref. 43, the 2-additive fuzzy measure is identified by the so-called heuristic HLMS algorithm. We determine the coefficients of the 2-additive fuzzy measure by minimizing some cost function. We consider two forms of the cost function. The first one is given by the quadratic criterion (Eq. 21), minimized using the quadratic programming technique. The second cost function is classification error. We minimize the error through stochastic optimization. 61 We compute the coefficients of the 2-additive fuzzy measure by using the general formula (Eq. 5), for example, gtjl = g({zi,Zj,zi}) = gl + gi + gl + cnj + an + dji and requiring that
g{Z) = Y,9%+ E iez
°« = 1-
(8)
{ij}ez
We ensure the monotonicity of the 2-additive fuzzy measure by imposing the set of constraints given in Eq. 6. In the case of the Mobius representation,
Fusing Neural Networks
Through Fuzzy Integration
233
the monotonicity and normalization constraints are given by a(0) = 0 ai = l
Y a-i + Y iEL
i
{ij}CL
(9)
Oj > 0, Vi G L a-i + Y
ai
i - °>
Vi e L
'
VT
-
L
\W•
5. Aggregation Schemes Used In our investigations, we used seven aggregation schemes, namely the majority vote, averaging, the median rule, the weighted averaging, fusion by the Choquet integral, the linear combination of order statistics (LOS), and aggregation via Zimmermann's compensatory operator. The majority vote rule, averaging, and the median rule were included in the comparisons as being simple aggregation rules that do not require any training. The weighted averaging approach has been used to serve for the overall reference combination scheme in our comparisons, since the weighted sum is recovered by the Choquet integral with respect to an additive measure. The LOS was included due to the fact that it can be viewed as the fuzzy integral with a special type of a fuzzy measure. Like fuzzy integrals Zimmermann's compensatory operator is an adaptive fuzzy aggregation operator with trainable aggregation weights. Therefore, we have also used the operator in our comparisons. We now briefly describe these aggregation schemes used.
5.1. Simple Aggregation
Schemes
Majority vote. The correct class is the one chosen by the most classifiers. If all the classifiers indicate different classes, then the one with the overall maximum output value is selected to indicate the correct class. Ties can be broken randomly. Averaging. This approach simply averages the individual classifier outputs. The output yielding the maximum of the averaged values is chosen as the correct class c: c = arg max
I ^(x) = - ^yi;,-(x) J
(10)
234
A. Verikas et al.
where Q is the number of classes, L is the number of classifiers and yij (x) represents the j t h output of the ith classifier given an input pattern x. Median rule. It is well known that a robust estimate of the mean is the median. 33 Median aggregation leads to the following rule: c = arg . max
L-(x) = ^ M E D ^ ^ x ) ) .
0=1,... ,Q \
5.2. Weighted
L
(11)
J
Averaging
In the first three approaches, each classifier can be deemed to have the same influence on the final decision. Equal weights represent just a single point in the potential weight space. By exploring other weight combinations we may achieve better performance. For notational convenience, we consider networks with a single output Hi. We denote the desired output of a network by d(x). Thus, the actual output of each network can be written as the desired output plus an error: j/i(x) = d(x) + £j(x). The average squared error for the ith network can be written as ei = E[(yi(x)
- d(x)) 2 ] = £?[£?]
(12)
where E[] denotes the expectation. A weighted combination of outputs of a set of L networks (a committee output) can be written as L
x
L
w
y( ) = 5Z ^ ( ) = ( ) + XI wi£i(x) i=l
x
rf x
( 13 )
j=l
where the weights Wi need to be determined. The error due to a committee can be written as: L
L
e = E[{y(x) - d(x)} 2 ] = E i=ij=i
(14) where C is the error covariance matrix given by Qj = E[(yi{xn) - d(x n ))(|/j(x n ) - d(x n ))].
(15)
Optimal values for the weights Wi can be determined by the minimization of e. A non-trivial minimum can be found by requiring X^=i w^ — 1.
Fusing Neural Networks Through Fuzzy
Integration
235
The solution for w^ is
where C _ 1
E j = l ( C ~ )ij Wi = —^—^—z Z-/fc=l 2jj = l(C~ )kj is the inverse of the error covariance matrix C.
5.3. Aggregation
by Zimmermann's
Compensatory
, . (16)
Operator
Fuzzy integrals are weighted aggregation operators whose weights are defined not only on the different elements being aggregated, but also on all subsets of them. It is interesting, therefore, to compare aggregation by fuzzy integrals with aggregation by fuzzy operators of some other type. We have chosen Zimmermann's compensatory operator for this purpose 66 :
v (x) = (fiiviWr ]
(i - n(i - j/iwr j
(17)
where Yli=iwi = L, 0 < 7 < 1 and yi(x) G [0,1]. The parameter 7 controls the degree of compensation between the union and intersection parts of the operator. The parameter Wi represents the weight associated with the ith network (classifier). Since the operator is continuous and differentiable with respect to 7 and Wi, gradient descent methods can be used to obtain the values of the parameters that best mach the given inputs and the corresponding desired outputs. The constraints on 7 and wt can be eliminated by modifying the definition of the parameters as follows34: 7 = a2/(a2 + b2) and Wi = Ld2/ 5^fc=i dt- Now a, b, and d\ can be chosen without any constraints. 5.4. Fusion by the Choquet
Integral
We assume that committee members have Q outputs representing Q classes, and a data point x needs to be assigned into one of the classes. The class label c of the data point x is then determined as follows: c = arg max
C„{q)
(18)
q=l,...,Q
where Cg(q) is the Choquet integral for the class q with respect to the fuzzy measure g. The values of the function h(z) that appear in the Choquet integral (see Definition 2) are given by the output values from each member of the committee.
A. Verikas et al.
236
5.5. Linear Combination
of Order Statistics
(LOS)
The discrete Choquet integral can be written as L
Cg = J2^i)[9(Ai)-g(Ai+1)}
(19)
where g(AL+i) = 0. The fuzzy measure can be chosen so that the measure of a set depends only on the cardinality of the set as shown in Equation 3. Then g(Ai) does not depend on the ordering, h(z\) < hfa) < ••• < h(zi,). Thus, the differences g(Ai)~g(Ai+i) can be written as Wi = g(Ai)—g(Ai+i). Therefore Cg = X)i=i Wih(zi). Let z = (zi,Z2,--- ,ZL) be a vector. The iih order statistic zu\ of z is the ith smallest element of z where z^ < Z(2) < • • • < Z(L)- Let w = (tui, u>2,... ,WL) be a weight vector constrained so that J2i=iwi = 1 a n d 0 < Wi < 1, Vi = 1,2,... ,L. The linear combination of order statistics of z = (z\, z2,... ,ZL) with the weight vector w — (u>i,W2, • • • ,WL) is defined as L
LOS(z,w) = J2wiz(i)-
( 20 )
i=l
Thus, the LOS operator is, in fact, an ordered weighted averaging (OWA) operator 64 or the Choquet integral with the cardinal fuzzy measure. 6. Learning the Fuzzy Measures In the first set of experiments, the quadratic programming technique has been used to determine weights of the LOS, the 2-additive fuzzy measure, and the ordinary fuzzy measure. To learn densities of the A-fuzzy measure the random search procedure was utilized. 61 The quadratic error between the expected and actual output values of the committee was the cost function during this search. We learned the weights of the ordinary fuzzy measure and the 2-additive fuzzy measure by minimizing the quadratic criterion proposed in Ref. 21. The quadratic criterion for two classes with one common fuzzy measure g takes the following form: iVi
J = Y^H^n) n=l
N2
- C29«) - l]2 + £[C 2 (x 2 ) - C]{*1) - l] 2 n=l
(21)
Fusing Neural Networks
Through Fuzzy
Integration
237
where Ni stands for the number of training samples x\,x%2,... ,xlN.: i = (1,2) from the ith class; Cl(xln) and C 2 (x£j are the Choquet integral for -g\—nJ
"""
^g\
the first and second class, respectively. The learning algorithm for the fuzzy measure can be reduced to a quadratic programming problem. 21 The criterion generalizes to Q classes in the following way: J
Q
Ni
Q
E |A^(xj,)-^(4)| 2
=E E i=l n=l
(22)
j=lj^£i
where A C ^ ( x j J = C£(xj,) - CjCxjJ, < # ' « ) is the desired value for AC'^xJ,), and Ni is the number of training samples from class i. The optimal weights for the LOS operator have been found by minimizing the following quantity N
Q
EE n—1j=i
' L
(23) .8 = 1
where ^ i = 1 Wi = 1 and 0 < Wi < 1, Vi = 1,2,... , L, N is the number of data samples, Q is the number of classes, z and d stand for the neural network output value and the target value, respectively. This is a quadratic objective function with linear constraints. Our experiments showed that the generalization (Eq. 22) of the quadratic criterion (Eq. 21) to the multiclass case as a sequence of twoclass problems does not work very well in practice. Moreover, the criteria do not reflect the classification error. Therefore, in the second set of experiments, we learned the fuzzy measures by minimizing the classification error rate. Since the error rate is not a differentiable function, we used a random search technique for minimizing the error. 61 We chose the search technique due to the good convergence properties reported on it. However, any other stochastic optimization algorithm can be applied. 7. Diversity of Networks One way to measure the diversity of a group of neural networks is to construct K-error diagrams as suggested in Ref. 40. The diagrams display the accuracy and diversity of the individual networks. For each pair of networks, the accuracy is measured as the average error rate on the test data set, while the diversity is evaluated by computing the so-called degree-ofagreement statistic K. Each point in the diagrams corresponds to a pair
238
A. Verikas et al.
of networks and illustrates their diversity and the average accuracy. The K statistic is computed as K = (6i-e2)/(i-e2)
(24)
with 6 i = E t i Cii/N and 0 2 = £ ? = 1 { £ « = 1 ^ E ? = i W }. w h e r e Q i s the number of classes, C is a Q x Q square matrix with Cij containing the number of test data points assigned to class i by the first network and into class j by the second network, and N stands for the total number of test data. The statistic K = 1 when two networks agree on every data point, and K = 0 when the agreement equals that expected by chance. We used the statistic K to compare diversities of neural networks obtained through the half & half sampling and bootstrapping. 8. D a t a The ESPRIT Basic Research Project Number 6891 (ELENA) provides databases and technical reports designed for testing both conventional and neural classifiers. All the databases and technical reports are available via anonymous ftp: ftp.dice.ucl.ac.be in the directory pub/neural-net/ELENA/ 30 databases. Prom the ELENA project we have chosen the artificial data set Clouds, and two real data sets, Phoneme and Satimage. Two additional databases have been taken from different sources. One is an artificial database called Ringnorm, and the other one is the Thyroid database. These were both taken from a collection called PROBEN1, which represents a medical diagnosis task. The Ringnorm database is a 20-dimensional two-class database. 5 Each class is drawn from a multivariate normal distribution. Breiman reports the theoretical classification error rate to be 1.3%. This database is available from http://www.es.toronto.edu/-~delve/. 11 The aim of the Thyroid medical database is to diagnose thyroid hyperor hypo-function. The task is to decide whether the patient's thyroid has overfunction, normal function, or underfunction. It is a 21-dimensional, three-class database containing 7200 examples. The class probabilities are 5.1%, 92.6%, and 2.3%, respectively. The data sets used are summarised in Table 1. The benchmark (BM) errors presented in Table 1 are taken from the ELENA project and the PROBEN 1 database. In the ELENA project, the errors presented are the
Fusing Neural Networks
D a t a Set # classes # features # samples BM Error % Bayes %
Through Fuzzy Integration
239
Table 1.
Summary of the data sets used.
Clouds
Phoneme
Satimage
Ringnorm
Thyroid
2 2 5000 12.3 9.66
2 5 5404 16.4
6 5 6435 11.9
2 20 5000
—
3 21 7200 1.31
—
—
1.30
—
average errors obtained when using an MLP with two hidden layers of 20 and 10 units, respectively. To solve the Thyroid task, an MLP with two hidden layers of 16 and 8 units, respectively, has been employed. 9. Experimental Testing All comparisons between the different aggregation schemes presented here have been performed by dividing the data sets into training and testing sets of equal size. We used two sampling techniques when training neural network committee members, namely half & half bagging and bootstrapping. In all the tests, we trained a set of one hidden layer MLPs with 10 sigmoidal hidden units. This architecture was adopted after some experiments. The experiments showed that the network used in the ELENA project was too large. Since we only investigate different aggregation schemes, we have not performed expensive experiments for finding the optimal network size for each data set used. Two training techniques have been employed: 1) the Bayesian inference technique 38 to obtain regularized networks and 2) the standard backpropagation training technique with a high number of training epochs (without regularization). We run each experiment seven times, and the mean errors and standard deviations presented are calculated from these seven trials. In each trial, the data set used is randomly divided into training and testing parts of the same size. In the half & half bagging approach, the size of the data sets MC and CC were set to M = Niearn/4:, where Niearn is the size of the learning set. Since one member of the half & half-sampled committee was trained on 2M = Niearn/2 data points, the same number of data points was also used to train one member of the bootstrapped committee. The data set of Niearn/2 data points was collected by bootstrapping the original training data set. In the first set of experiments, we obtained the parameters of the aggregation schemes by minimizing the sum-of-squared error, as discussed in
240
A. Verikas et al.
^&
^ B S R - * - BS - B - H&HR -if- H&H
Phoneme data
ixi"r±it^*-i
^t^teti^
^~S~-s-li^-e~-s;
4
6
8
10 , . . 12
14
16
18
r
20
ror %
(d)
UJ
A.
-l>BSR
Ringnorm data
-
- * - BS - B - H&HR -&- H&H
3
10
12
(e) Fig. 1. sets.
Classification error as a function of the committee size for the different data
Section 6. Our first series of experiments investigates the trade-off between accuracy and diversity of the networks in committees obtained using the different sampling and training techniques. Figure 1 illustrates the test set classification error of the committees for the different databases as a function of the committee size. Aggregation by the majority vote rule has been used in the experiments. Figures 2-6 present ft-error diagrams illustrating the diversity of the networks for the different data sets.
Fusing Neural Networks
0.4
Through Fuzzy
0.6
0.2
Integration
0.4
Kappa
241
0.6
Kappa
25 CtoOO
|
2
°Qt£°
o^
B2Q
0
Lggo
CO
o c^gMK
o^gg
o
»&>n
m
Phoneme H&H R 0.2
0.4
0.6
^B^o
15
Phoneme H&H 0.8
0.2
0.4
Kappa
O.e
Kappa
Fig. 2. re-error diagrams for the Phoneme data, set using bootstrapped (top) and half & half sampled (bottom) committees.
Clouds H&H
P1S
^
^
LU
o
^^fjflf 10
0.6
0.8
Kappa
0.4
0.6
0.8
1
Kappa
Fig. 3. re-error diagrams for the Clouds data set using bootstrapped (top) and half & half sampled (bottom) committees.
242
A. Verikas et al. 0 0 o
Satimage BSR
#
£20
o
o o o
%
i 15
«r*>
10 0.5
0.7
0.8
0.S
10 0.5
0.9
0.5
0.6
0.7
0.6
0.7
0.8
0.8
0.!
Kappa
Kappa
0.5
m
Satimage BS 0.6
0.6
0.7
Kappa
0.8
0.9
Kappa
Fig. 4. re-error diagrams for the Satimage data set using bootstrapped (top) and half & half sampled (bottom) committees.
111 2
0.5
0.6
0.7
0.8
0.9
0.5
0.6
Kappa
0.5
0.6
0.7
Kappa
0.7
0.8
0.9
0.8
0.9
Kappa
0.8
0.S
0.5
0.6
0.7
Kappa
Fig. 5. re-error diagrams for the Thyroid data set using bootstrapped (top) and half & half sampled (bottom) committees.
Fusing Neural Networks Through Fuzzy
Integration
0.6
Kappa
243
0.7
Kappa
Fig. 6. K-error diagrams for the Ringnorm data set using bootstrapped (top) and half & half sampled (bottom) committees.
The following notations are used in the figures. B S R stands for the bootstrapping sampling and regularized training case, BS stands for the bootstrapping sampling and training without regularization, H & H R means the half & half sampling approach and regularized training, and H&H stands for the half & half sampling approach and training without regularization. After analyzing Figs. 1-6 the following observations can be made. The half & half sampling technique outperforms the bootstrapping approach by creating more accurate neural network committees. As can be seen from the K-error diagrams, the networks created by bootstrapping form a much tighter cluster than they do with the half & half sampling approach. This is expected, since with the bootstrapping technique each network is trained on a sample drawn from the same distribution. This explains why half & half sampling outperforms bootstrapping. The lower accuracy of networks produced by the half & half sampling approach is compensated for by the increased diversity. Committees created by the different sampling techniques exhibited approximately the same standard deviation of the classification error. For
A. Verikas et al.
244
both sampling techniques the regularized committees, on average, are more accurate than the non-regularized ones. However, the accuracy of the committees depends on the trade-off between the accuracy and diversity of the networks. For example, for the Thyroid data set the non-regularized committees are more accurate than the regularized ones. This is not surprising, since, as can be seen from Fig. 5, the slightly lower accuracy of the networks produced by the non-regularized learning is compensated for by the noticeably increased diversity. In the second set of experiments, we investigated the influence of the different types of fuzzy measures on the classification accuracy of committees fused through the Choquet integral. First, the fuzzy measures used have been learned through minimization of the sum-of-squared error, as explained in Section 6. All the committees were made of 9 members trained with the Bayesian inference technique. The left-hand part of Tables 2-6 summarize the results of the tests. The following notations are used in the tables: Mean stands for the percentage of the average test set classification error, Std is the standard deviation of the error, Best stands for the single neural network with the best average performance, MV stands for the majority vote, A V means averaging, MED stands for the median rule, WA means weighted averaging, LOS means the linear combination of order statistics, ZIM stands for Zimmermann's compensatory operator, A CI stands for the Choquet integral with the A fuzzy measure, 2CI means the Table 2. Performance of the neural network committees for the data set. Sum-of-Squared Error Scheme
1. Best 2. MV 3. AV 4. MED 5. WA 6. LOS 7. ZIM 8. ACI 9. 2CI 10. OCI
Bootstrap
Half & Half
Phoneme
Classification Error R a t e Bootstrap
Half & Half
Mean
Std
Mean
Std
Mean
Std
Mean
Std
14.58 12.54 12.73 12.54 12.49 12.40 12.68 12.54 12.49 12.07
0.38 0.33 0.30 0.33 0.35 0.38 0.26 0.35 0.41 0.40
15.07 12.02 11.92 12.02 12.07 11.94 12.42 12.29 12.29 11.37
0.86 0.14 0.35 0.14 0.29 0.18 0.21 0.46 0.41 0.32
14.58 12.54 12.73 12.54 12.49 12.19 12.29 12.13 12.03 11.68
0.38 0.33 0.30 0.33 0.35 0.38 0.27 0.37 0.37 0.43
15.07 12.02 11.92 12.02 12.07 11.48 11.78 11.23 11.20 10.84
0.86 0.14 0.35 0.14 0.29 0.21 0.28 0.34 0.33 0.25
Fusing Neural Networks
Through Fuzzy Integration
245
Table 3. Performance of the neural network committees for the Clouds data set. Sum-of-Squared Error Scheme
1. Best 2. MV 3. AV 4. MED 5. WA 6. LOS 7. ZIM 8. ACI 9. 2CI 10. OCI
Bootstrap
Half & Half
Classification Error R a t e Bootstrap
Half & Half
Mean
Std
Mean
Std
Mean
Std
Mean
Std
10.58 10.44 10.42 10.44 10.40 10.36 10.45 10.38 10.40 10.24
0.09 0.14 0.18 0.14 0.13 0.15 0.22 0.13 0.13 0.05
10.77 10.37 10.50 10.37 10.56 10.40 10.50 10.40 10.41 10.40
0.18 0.14 0.11 0.13 0.25 0.17 0.16 0.12 0.21 0.22
10.58 10.44 10.42 10.44 10.40 10.20 10.30 10.00 9.98 9.97
0.09 0.14 0.18 0.14 0.13 0.12 0.14 0.11 0.11 0.09
10.77 10.37 10.50 10.37 10.56 10.11 10.21 9.87 9.85 9.85
0.18 0.14 0.11 0.13 0.25 0.09 0.12 0.10 0.07 0.07
Table 4. Performance of the neural network committees for the data set. Sum-of-Squared Error Scheme
1. Best 2. MV 3. AV 4. MED 5. WA 6. LOS 7. ZIM 8. ACI 9. 2CI 10. OCI
Bootstrap
Half & Half
Satimage
Classification Error R a t e Bootstrap
Half & Half
Mean
Std
Mean
Std
Mean
Std
Mean
Std
11.88 11.37 11.22 11.33 11.13 11.58 11.10 11.13 11.15 11.15
0.25 0.19 0.23 0.20 0.21 0.24 0.16 0.24 0.25 0.24
11.87 10.59 10.59 10.59 11.01 11.51 11.33 11.07 10.96 10.64
0.21 0.32 0.32 0.36 0.14 0.36 0.15 0.16 0.22 0.28
11.88 11.37 11.22 11.33 11.13 10.93 10.99 10.73 10.70 10.66
0.25 0.19 0.23 0.20 0.21 0.24 0.23 0.17 0.18 0.24
11.87 10.59 10.59 10.59 11.01 10.03 10.11 9.80 9.79 9.61
0.21 0.32 0.32 0.36 0.14 0.22 0.19 0.22 0.21 0.25
Choquet integral with the 2-additive fuzzy measure, and OCI stands for the Choquet integral with the ordinary fuzzy measure. As can be seen from the tables, except for the Clouds data set, there is an obvious improvement in classification accuracy when combining networks. For the Clouds data set the classification error is quite close to the theoretical limit.
A. Verikas et al.
246
Table 5. Performance of the neural network committees for the Thyroid data set. Sum-of-Squared Error Scheme
1. Best 2. MV 3. AV 4. MED 5. WA 6. LOS 7. ZIM 8. ACI 9. 2CI 10. OCI
Bootstrap
Half & Half
Classification Error R a t e Bootstrap
Half & Half
Mean
Std
Mean
Std
Mean
Std
Mean
Std
1.51 1.19 1.28 1.20 1.16 1.24 1.25 1.18 1.20 1.05
0.06 0.08 0.09 0.09 0.14 0.09 0.13 0.15 0.12 0.10
1.55 0.79 0.75 0.84 0.89 0.77 0.84 0.77 0.82 0.68
0.18 0.09 0.08 0.12 0.15 0.09 0.09 0.11 0.30 0.09
1.51 1.19 1.28 1.20 1.16 1.13 1.08 0.98 0.97 0.96
0.06 0.08 0.09 0.09 0.14 0.08 0.12 0.11 0.11 0.10
1.55 0.79 0.75 0.84 0.89 0.64 0.62 0.57 0.56 0.56
0.18 0.09 0.08 0.12 0.15 0.07 0.09 0.10 0.10 0.10
Table 6. Performance of the neural network committees for the data set. Sum-of-Squared Error Scheme
1. Best 2. MV 3. AV 4. MED 5. WA 6. LOS 7. ZIM 8. ACI 9. 2CI 10. OCI
Bootstrap
Half & Half
Ringnorm
Classification Error R a t e Bootstrap
Half & Half
Mean
Std
Mean
Std
Mean
Std
Mean
Std
7.07 2.43 2.15 2.43 2.14 2.32 2.18 2.13 2.14 2.01
0.31 0.21 0.15 0.21 0.14 0.20 0.11 0.15 0.15 0.19
7.01 1.99 1.70 1.99 1.67 1.93 1.85 1.65 1.64 1.58
0.46 0.12 0.21 0.12 0.22 0.16 0.28 0.22 0.22 0.13
7.07 2.43 2.15 2.43 2.14 1.97 1.90 1.86 1.85 1.81
0.31 0.21 0.15 0.21 0.14 0.11 0.11 0.08 0.08 0.10
7.01 1.99 1.70 1.99 1.67 1.53 1.44 1.42 1.41 1.40
0.46 0.12 0.21 0.12 0.22 0.17 0.16 0.17 0.16 0.16
Aggregation by the Choquet integral with the ordinary fuzzy measure, on average, provided the best overall performance. However, this aggregation scheme outperforms the other ones only slightly, and it is achieved at the expense of high computing time and memory requirements. As can be seen from the tables, on average, the half & half sampling technique creates more accurate committees than the bootstrapping approach. Examining the
Fusing Neural Networks
Through Fuzzy
Integration
247
tables, we see that there is no significant difference between the accuracy of the committees produced by the simple combining techniques, namely the majority vote, averaging, the median rule and the other trainable ones. This observation indicates that either learning the fuzzy measures through minimizing the sum-of-squared error or the generalization (Eq. 22) of the quadratic criterion (Eq. 21) to the multiclass case do not work very well in practice. Therefore, in the last series of experiments we learned the aggregation weights (parameters of the fuzzy measures) through minimization of the classification error rate. A random search technique has been used for the minimization. 61 Regularized training was used in the experiments. The right-hand part of Tables 2-6 summarize these results. Comparing the leftand the right-hand parts of Tables 2-6, we find that for all the trainable aggregation schemes, the minimization of the classification error improved committee performance. In all the tests performed, the half & half sampled committees significantly outperformed the bootstrapped ones. Tables 2-6 show that, in spite of the huge difference in the number of parameters defining the A-fuzzy measure, the 2-additive fuzzy measure and the ordinary fuzzy measure, the use of the measures provided approximately the same classification accuracy. The LOS and Zimmermann's compensatory operator performed slightly worse. Therefore our results indicate that the use of the A-fuzzy measure would be the best choice. 10. Conclusions In this chapter, we used the half & half bagging approach for creating neural network committees for classification. The half & half sampling approach was compared with the bootstrapping technique. In all the tests performed, the half & half bagging approach outperformed the bootstrapping technique. On average, half & half bagging has created less accurate but more diverse networks than bootstrapping. However, the lower accuracy of networks produced by half & half bagging was compensated for by the increased network diversity. The regularized committees provided a higher classification accuracy than the unregularized ones. The influence of different types of fuzzy measures on the classification accuracy of neural network committees fused through the Choquet integral has also been investigated. Four types of fuzzy measures, namely the ordinary fuzzy measure, the A-fuzzy measure, the 2-additive fuzzy measure,
248
A. Veritas et al.
and the cardinal fuzzy measure have been tested. T h e aggregation schemes based on the Choquet integral provided the best performance. However, we have not found any significant difference in the classification accuracy of committees fused through the Choquet integral with respect to the A-fuzzy measure, the 2-additive fuzzy measure, and the ordinary fuzzy measure. These results indicate t h a t the very simple A-fuzzy measure would be the best choice. Learning the fuzzy measures through minimization of the classification error rate resulted in a significant improvement of the accuracy of the committees as compared to learning the measures by minimizing the sum-of-squared error.
References 1. R. Avnimelech and N. Intrator, Boosting regression estimators. Neural Computation 11, 499-520 (1999). 2. R. Battiti M. Colla, Democracy in neural nets: Voting schemes for classification. Neural Networks 7(4), 691-707 (1994). 3. I. Bloch, Some aspects of Dempster-Shafer evidence theory for classification of multi-modality medical images taking partial volume effect into account. Pattern Recognition Letters 17(8), 905-919 (1996). 4. L. Breiman, Bagging predictors. Technical report 421 (Statistics Departament, University of California, Berkeley, 1994). 5. L. Breiman, Bias, variance and arcing classifiers. Technical report 460 (Statistics Departament, University of California, Berkeley, 1996). 6. L. Breiman, Half & Half bagging and hard boundary points. Technical report 534 (Statistics Departament, University of California, Berkeley, 1998), <www.stat.berkley.edu/users/breiman>. 7. M. Ceccarelli and A. Petrosino, Multi-feature adaptive classifiers for SAR image segmentation. Neurocomputing 14, 345-363 (1997). 8. W. Chen, P. D. Gader and H. Shi, Improved dynamic programming-based handwritten word recognition using optimal order statistics. In Proceedings of the International Conference "Statistical and Stochastic Methods in Image Processing II", 246-256 (San Diego 1997). 9. Z. Chi, H. Yan and T:D. Pham, Fuzzy algorithms: with applications to image processing and pattern recognition (World Scientific Publishing, 1996). 10. S. B. Cho and J. H. Kim, Combining multiple neural networks by fuzzy integral for robust classification. IEEE Trans. Systems, Man, and Cybernetics 25(2), 380-384 (1995). 11. Delve, 12. T. Denoeux, A k-nearest neighbor classification rule based on DempsterShafer theory. IEEE Trans. Systems, Man, and Cybernetics B 25(5), 804813 (1995).
Fusing Neural Networks Through Fuzzy Integration
249
13. H. Drucker and C. Cortes, Boosting decision trees. In Advances in Neural Information Processing Systems 8, ed. D. S. Touretzky, M. C. Mozer and M. E. Hasselmo, 479-485 (MIT Press, 1996). 14. B. Efron and R. Tibshirani, An introduction to the bootstrap (Chapman and Hall, London, 1993). 15. Y. Freund and R. E. Schapire, Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on Machine Learning, 148-156 (1996). 16. Y. Freund and R. E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55, 119-139 (1997). 17. P. D. Gader, M. A. Mohamed and J. M. Keller, Fusion of handwritten word classifier. Pattern Recognition Letters 17, 577-584 (1996). 18. M. Grabisch, Fuzzy integral in multicriteria decision making. Fuzzy Sets and Systems 69, 279-298 (1995). 19. M. Grabisch, The representation of importance and interaction of features by fuzzy measures. Pattern Recognition Letters 17, 567-575 (1996). 20. M. Grabisch, fe-order additive discrete fuzzy measures and their representation. Fuzzy Sets and Systems 92, 167-189 (1997). 21. M. Grabisch and J.-M. Nicolas, Classification by fuzzy integral: Performance and tests. Fuzzy Sets and Systems 65, 255-271 (1994). 22. S. Hashem, Optimal linear combinations of neural networks. Neural Networks 10(4), 599-614 (1997). 23. J. B. Hampshire and A. H. Waibel, The meta-pi network: Building distributed representations for robust multisource pattern recognition. IEEE Trans Pattern Analysis and Machine Intelligence 14(7), 751-769 (1992). 24. T. Heskes, Balancing between bagging and bumping. In Advances in Neural Information Processing Systems 9, ed. M. C. Mozer, M. I. Jordan and T. Petsche, 466-472 (MIT Press 1997). 25. T. K. Ho, J. J. Hull and S. N. Srihari, Decision combination in multiple classifier systems. IEEE Trans Pattern Analysis and Machine Intelligence 16(1), 66-75 (1994). 26. Y. S. Huang and C. Y. Suen, A method of combining multiple classifiers - A neural network approach. In Proceedings of the 12th International Conference on Pattern Recognition, ATi-A7h (Jerusalem, Israel, 1994). 27. R. A. Jacobs, Methods for combining experts' probability assessments. Neural Computation 7(5), 867-888 (1995). 28. C. Ji and S. Ma, Combined weak classifiers. In Advances in Neural Information Processing Systems 9, ed. M. C. Mozer, M. I. Jordan, and T. Petsche, 494-500 (MIT Press 1997). 29. M. I. Jordan and L. Xu, Convergence results of the EM approach to mixtures of experts architectures. Neural Networks 8, 1409-1431 (1995). 30. C. Jutten, A. Guerin-Dugue, C. Aviles-Cruz, J. L. Voz and D. Van Cappel, ESPRIT basic research project number 6891 ELENA (1995), .
250
A. Verikas et al.
31. H.-J. Kang, K. Kim and J. H. Kim, Optimal approximation of discrete probability distribution with kth-order dependency and its application to combining multiple classifiers. Pattern Recognition Letters 18, 515-523 (1997). 32. J. Kittler, A. Hojjatoleslami, and T. Windeatt, Strategies for combining classifiers employing shared and distinct pattern representations. Pattern Recognition Letters 18, 1373-1377 (1997). 33. J. Kittler, M. Hatef, R. P. W. Duin and J. Matas, On combining classifiers. IEEE Trans Pattern Analysis and Machine Intelligence 20(3), 226-239 (1998). 34. R. Krishnapuram and J. Lee, Fuzzy-set-based hierarchical networks for information fusion in computer vision. Neural Networks 4, 335-350 (1992). 35. L. I. Kuncheva and L. C. Jain, Designing classifier fusion systems by genetic algorithms. IEEE Trans Evolutionary Computation 4(4), 327-336 (2000). 36. L. I. Kuncheva, J. C. Bezdek and R. P. W. Duin, Decision templates for multiple classifier fusion. Pattern Recognition 34(2), 299-314 (2001). 37. L. Lam and C. Y. Suen, Optimal combination of pattern classifiers. Pattern Recognition Letters 16, 945-954 (1995). 38. D. J. MacKay, Bayesian interpolation. Neural Computation 4, 415-447 (1992). 39. R. Maclin and J. W. Shavlik, Combining the predictions of multiple classifiers: Using competitive learning to initialize neural networks. In Proceedings of the 14th International Conference on Artificial Intelligence (1995). 40. D. Margineantu and T. G. Dietterich, Pruning Adaptive Boosting, in Proceedings of the Fourteenth International Conference on Machine Learning, 211—218 (San Francisco, Morgan Kaufmann, 1997). 41. C. J. Merz and M.J. Pazzani, Combining neural network regresion estimates with regularized linear weights. In Advances in Neural Information Processing Systems 9, ed. M. C. Mozer, M. I. Jordan, and T. Petsche, 564-570 (MIT Press 1997). 42. R. Mesiar, Generalizations of k-order additive discrete fuzzy measures. Fuzzy Sets and Systems 102, 423-428 (1999). 43. L. Mikenina and H.-J. Zimmermann, Improved feature selection and classification by the 2-additive fuzzy measure. Fuzzy Sets and Systems 107, 197-218 (1999). 44. A.R. Mirhosseini, H. Yan, K.-M. Lam and T. Pham, Human face recognition: an evidence aggregation approach. Computer Vision and Image Understanding 71(2), 213-230 (1998). 45. P. W. Munro and B. Parmanto, Competition among networks improves committee performance. In Advances in Neural Information Processing Systems 9, ed. M. C. Mozer, M. I. Jordan, and T. Petsche, 592-598 (MIT Press 1997). 46. T. Murofushi and M. Sugeno, Some quantities represented by the Choquet integral. Fuzzy Sets and Systems 56, 229-235 (1993).
Fusing Neural Networks Through Fuzzy Integration
251
47. D. W. Optitz and J. W. Shavlik, Generating accurate and diverse members of a neural-network ensemble. In Advances in Neural Information Processing Systems 8, ed. D. S. Touretzky, M. C. Mozer and M. E. Hasselmo, 535-541 (MIT Press 1996). 48. M. P. Perrone and L. N. Cooper, When networks disagree: Ensamble method for neural networks. In Neural Networks for Speech and Image Processing, ed. R. J. Mammone (Chapman-Hall 1993). 49. G. Ratsch, T. Onoda and K. R. Muller, Regularizing AdaBoost. In Advances in Neural Information Processing Systems 11, ed. M. S. Kearns, S. A. Solla D. A. Cohn, 564-570 (MIT Press 1999). 50. G. Rogova, Combining the results of several neural network classifiers. Neural Networks 7(5), 777-781 (1994). 51. R. E. Schapire, Using output codes to boost multiclass learning problems. In Proceedings of the Fourteenth International Conference on Machine Learning, 313-321 (Morgan Kaufmann, San Francisco, 1997). 52. H. Schwenk and Y. Bengio, Training methods for adaptive boosting of neural networks. In Advances in Neural Information Processing Systems 10, ed. M. I. Jordan, M. S. Kearns and S. A. Solla, 647-653 (MIT Press, 1998). 53. H. Shi, P. D. Gader and W. Chen, Fuzzy integral filters: Properties and parallel implementation. Real-Time Imaging 4(4), 233-241 (1998). 54. M. Sugeno, Fuzzy measures and fuzzy integrals: A survey. In Automata and Decision making, 89-102 (Amsterdam, North Holland 1977). 55. M. Taniguchi and V. Tresp, Averaging regularized estimators. Neural Computation 9, 1163-1178 (1997). 56. H. Tahani and J. M. Keller, Information fusion in computer vision using the fuzzy integral. IEEE Trans Systems, Man and Cybernetics 20(3), 733-741 (1990). 57. K. Turner and J. Ghosh, Analysis of decision boundaries in linearly combined neural classifiers. Pattern Recognition 29(2), 341-348 (1996). 58. K. Turner and J. Ghosh, Classifier combining through trimmed means and order statistics. In Proceedings of the International Joint Conference on Neural Networks, Anchorage, Alaska, (1998). 59. K. Turner and J. Ghosh, Linear and order statistics combiners for pattern classification. In Combining Artificial Neural Nets: Ensemble and Modular Multi-Net Systems, ed. A. J. C. Sharkey, 127-162 (Springer-Verlag, 1999). 60. A. Verikas, A Lipnickas, K. Malmqvist, M. Bacauskiene and A. Gelzinis, Soft combination of neural classifiers: A comparative study. Pattern Recognition Letters 20, 429-444 (1999). 61. A. Verikas and A. Gelzinis, Training neural networks by stochastic optimisation. Neurocomputing 30, 153-172 (2000). 62. S. Waterhouse and G. Cook, Ensemble methods for phoneme classification. In Advances in Neural Information Processing Systems 9, ed. M. C. Mozer, M. I. Jordan and T. Petsche, 800-806 (MIT Press, 1997).
252
A. Verikas et al.
63. L. Xu, A. Krzyzak and C. Y. Suen, Methods for combining multiple classifiers and their appli-cations to handwriting recognition. IEEE Trans Systems, Man, and Cybernetics 22(3), 418-435 (1992). 64. R. R. Yager, On ordered weighted averaging aggregation operators in multicriteria decision making. IEEE Trans Systems, Man, and Cybernetics 18, 183-190 (1988). 65. J. Zhang, J. Developing robust non-linear models through bootstrap aggregated neural networks. Neurocomputing 25, 93-113 (1999). 66. H.J. Zimmermann and P. Zysno, Decisions and evaluations by hierarchical aggregation of information. Fuzzy Sets and Systems 10(3), 243-260 (1984).
C H A P T E R 10 H Y B R I D DATA MINING METHODS IN IMAGE PROCESSING
Aljoscha Klose and Rudolf Kruse Dept. of Knowledge Processing and Language Engineering Otto-von-Guericke-University of Magdeburg D-39106 Magdeburg, Germany E-mail: [email protected]
In many applications, the analysis of images is a valuable source of information. However, advances in technology lead to increasing amounts of image data that cannot be analyzed manually and that make an automatic analysis more and more important. The automatic extraction of knowledge from images is often a difficult and complex process. We present applications from remote sensing, where we use hybrid Data Mining methods to analyze the machine vision process. The gained knowledge is exploited to improve the image processing algorithms.
1. I n t r o d u c t i o n T h e advances in computers and sensor technology have led t o the collection of huge amounts of d a t a by companies and scientific or governmental institutions. However, even though d a t a is available, it is often difficult t o extract knowledge and useful p a t t e r n s from it. In response to these problems, " D a t a Mining" has emerged as a new area of research over the last several years. D a t a Mining — also known as "Knowledge Discovery in Databases" — is an interdisciplinary approach to the exploration and exploitation of information t h a t is hidden in the archives of tables, images, sounds or texts. One explicit aim of d a t a mining is the extraction of p a t t e r n s t h a t can be understood by humans. T h e extracted p a t t e r n s can then help a user to better understand the problem domain. To accomplish this goal, we
253
254
A. Klose & R. Kruse
need techniques that can extract knowledge from large, dynamic, multirelational, multi-media information sources, and which close the semantic gap between structured data and human notions and concepts. In other words we must be able to translate computer representations into human notions and concepts, and vice versa. We think that hybrid combinations of the techniques usually referred to as "Soft Computing" are an appropriate means to achieve these aims. In the sections below, we will outline the key concepts of the Data Mining techniques used in our work, and the role that images play in Data Mining as heterogeneous sources of information. In our application, we deal with remotely sensed images. We concentrate on approaches towards the understanding and support of image processing algorithms. In our first study described in Sec. 2, we analyze information gathered during a machine vision process. We use the gained knowledge to improve the efficieny of the algorithms. Sees. 3 and 4 aim at a better understanding of the algorithms and the influence of individual parameters. The ultimate goal is to give the user insight into the algorithms and thus support him in applying them. 1.1. Fuzzy
Systems
As the goal of fuzzy systems has always been to model human expert knowledge and to produce systems that are easy to understand, we expect fuzzy techniques to play a prominent role in data mining. They are a suitable means to represent similarity, preference, and uncertainty. 1 They can state how similar an object or case is to a prototypical one, they can indicate preferences between suboptimal solutions to a problem, or they can model uncertainty about the true situation, if this situation is described in imprecise terms. For solving practical problems, all of these interpretations are needed, and all have proven useful in applications. In data analysis, fuzzy sets are used in several ways.2 We use fuzzy set methods in two ways: first, as a general and intuitive means to model domain knowledge and problem specific measures, where the extension of classical set theory is most important; secondly, to model linguistic terms to do what Zadeh called Computing with Words.3 We mainly use these linguistic terms in fuzzy if-then rules. The antecedent of such a fuzzy rule consists of fuzzy descriptions of input values, and the consequent defines a — possibly fuzzy — output value for the given input.
Hybrid Data, Mining Methods in Image
1.2. Neural
Processing
255
Networks
Neural networks are often considered as data mining methods. However, as the knowledge in neural networks is stored as numeric network connections, they do not provide human understandable information about the data. Therefore, neural networks are often combined with fuzzy systems. The idea of combining fuzzy systems and neural networks to neuro-fuzzy systems is quite intuitive: we use a fuzzy system to represent knowledge in an interpretable manner, and use the learning ability of neural networks to determine membership values. The drawbacks of both of the individual approaches — the black box behavior common to neural networks, and the problem of finding suitable membership values for fuzzy systems — can thus be avoided. A neuro-fuzzy system constitutes an interpretable model that can use problem-specific prior knowledge, and which is capable of learning, e.g. capable of inducing fuzzy rules from sample data. There are, however, other neural network architectures that are better suited for data mining. Since visualization techniques play an important role in understanding high-dimensional data, network architectures like self-organizing maps are more applicable to data mining. Self-organizing maps allow us to create two-dimensional visualizations of high-dimensional datasets. 4 1.3. Evolutionary
Algorithms
Evolutionary algorithms are a versatile class of search and optimization techniques. 5 In contrast to neural network learning, they can optimize not only parameters, but also structure. This makes them well suited for complex search spaces. As they perform a stochastic search using only a performance function, they can easily be integrated with fuzzy techniques. An example of such a combination is shown in Sec. 3. 1.4. Images
as Information
Sources for Data
Mining
Images are a natural and rich source of information for humans. In spite of decades of intensive research, machine vision is still far from being competitive with the human visual system. This applies especially to small groups of images. However, there are many successful applications of automated image processing, including areas with large numbers of images and strongly repetitive tasks. These tasks are often beyond the capabilities of humans.
256
A. Klose & R. Kruse
There are several current developments that have led to massive increases in the number of accessible images. The most important developments are the growth of the World Wide Web, advances in sensor technology, and improved transmission and storage capabilities in remote sensing. The World Wide Web has become an enormous, distributed, heterogeneous and unstructured source of information. While its textual parts can more or less be handled with current search engines, the information contained in the images cannot be searched automatically. Indexing and retrieval of image databases are still in their infancy. Content based image retrieval, where queries are performed by the color, texture, and shape of image objects and regions, is a current research topic. 6, ' 8 Another area where enormous amounts of data are being gathered is remote sensing. Due to technological advances, analysts working with remotely sensed imagery are today confronted with huge amounts of data. The integration of multi-sensor data in particular calls for support of the observer by automatic image processing algorithms. Some prominent examples of such giant image databases are: • the classification of (about 5 x 108) stellar objects in about 3 terabytes of sky images reported in the POSS-II sky survey project, 9 • the search for volcanos in about 30, 000 radar images of the surface of Venus in the Magellan study, 10 or • the mosaicing of thousands of high resolution satellite images of the Earth into a large continuous image. 11 An essential aim of information mining in large image databases is to direct the user's attention to useful information. For images, traditional data mining techniques will almost always work hand in hand with automatic image processing techniques. In the following we present two examples which differ slightly from that view: we analyze data that were gathered during image processing. In these applications, the aim is to gain insight into the vision process itself and thus help to improve it. 2. Analyzing the Machine Vision Process The domain of our studies is the "screening" of collections of high resolution aerial images for man-made structures. Applications of the detection of such structures include remote reconnaissance or image registration, where
Hybrid Data Mining Methods in Image
Processing
257
man-made structures deliver valuable alignment hints. The following analyses have been done in a cooperation with FGAN/FOM. 12 In the framework of structural analysis of complex scenes, a blackboard-based production system (BPI) is presented in Ref. 13. The image processing starts by extracting object primitives (i.e. edges) from the image, and then uses production rules to combine them into more and more complex intermediate objects. The algorithm stops when either no more productions apply, or the target high level object is detected. 14 ' 15 Figure la shows an example image with the extracted edge segments. Figure l b shows the runway detected as a result of applying the production system. An analysis of the process for this image shows that only 20 lines of about 37,000 are used to construct this stripe (see Fig. 2a). However, the analyzing system examines all of the lines. As the algorithm has a time bound of at least 0(n2), where n is the number of line segments, processing can take quite a while. The idea of our analysis was the following. If we knew which primitive objects are most promising, then we could start the analysis with these objects and thus significantly speed up the production process. The idea is to extract features from the image that describe the primitive objects, and train a classifier to decide which lines can be discarded. We applied the neuro-fuzzy classifier NEFCLASS to this task. We will briefly outline NEFCLASS in the next section. We considered a concept hierarchy for the screening of runways and airfields. The rules for this hierarchy state that collinear adjacent (short) lines are concatenated to longer lines. If their length exceeds a certain threshold, they conform to the concept long lines. The productions used to extend long lines can bridge larger gaps caused by taxiways or image distortions. If these long lines meet certain properties they may be used to generate runway lines, which could be part of a runway. Parallel runway lines within a given distance to each other are considered runway stripes. In a last step, these stripes may be identified as runways. Although we considered only runways, the production rule approach can be used for all kinds of stripe-like objects. It can be adapted to a variety of tasks by varying parameters. The model was intended to be used with image data from different sensors, e.g. visible, infrared, or synthetic aperture radar (SAR) systems.
A. Klose & R. Kruse
258
(a) Fig. 1. (a) 37, 659 lines extracted from a SAR image by Burns' edge detector (short segments painted in white), (b) runway found by production system.
2.1. NEFCLASS:
A Hybrid Neuro-Fuzzy
Classifier
NEFCLASS is a hybrid neuro-fuzzy classifier that belongs to the class of structure-oriented approaches. They accept initial fuzzy sets and thus structure the data space as a multidimensional fuzzy grid. A rule base is created by selecting those grid cells that contain data. This can be done in a single pass through the training data. This way of learning fuzzy rules was suggested by Wang and Mendel16 and extended in the NEFCLASS model. 17 After the rule base of a fuzzy system has been generated, the membership functions must be fine-tuned in order to improve performance. In the NEFCLASS model, the fuzzy sets are modified by simple backpropagationlike heuristics, which are motivated by neural network learning. In the learning phase, constraints are used to ensure that the fuzzy sets still fit their associated linguistic terms after learning. For example, membership functions of adjacent linguistic terms must not change position, and must overlap to a certain degree. 17
Hybrid Data Mining Methods in Image
Processing
259
Fig. 2. (a) The 20 lines for the construction of the runway (true positives) (b) 3,281 lines classified as positive by NEFCLASS.
The NEFCLASS model has been continuously improved and extended over the last few years, with several implementations for different machine platforms. a Most of these extensions address the specific characteristics and problems of real world data and their analysis. Symbolic Attributes. Real world data often contains symbolic (classvalued) information. Data mining algorithms that expect numerical attributes usually transform these attributes to artificial metric scales. However, it would be useful to be able to create fuzzy rules directly from data with symbolic attributes. NEFCLASS has been extended to deal with symbolic data by using mixed fuzzy rules, i.e. rules with fuzzy class memberships in antecedents and consequents. However, defining a linguistic term to label a consequent might be difficult, which might in turn lead to a reduction of interpretability. 18 a
T h e most recent version NEFCLASS-J, implemented in JAVA, is publicly available from our web site at h t t p : / / f u z z y . c s . u n i - m a g d e b u r g . d e
A. Klose & R. Kruse
260
Missing Values. Missing values are a common problem in many applications. NEFCLASS has been extended with a rather simple strategy to deal with missing values. 19 In the classification phase a missing value belongs to any fuzzy set with degree 1. Thus, we do not have to make any assumptions about its real value. The same method can be used in the learning phase. When we encounter a missing value during rule creation, any fuzzy set can be included in the antecedent for the corresponding variable. Therefore, we create all combinations of fuzzy sets that are possible for the current training pattern. Similarly, the fuzzy sets of missing attributes remain unchanged by the fine-tuning heuristic. Pruning Techniques. When a rule base is induced from data it often has too many rules to be easily readable, and thus gives little insight into the structure of the data. Therefore, to reduce the rule base, NEFCLASS uses several pruning techniques. 17 ' 20 ' 21 These methods are effective in both reducing the number of rules and increasing generalization ability. Learning from Unbalanced Data. In many practical domains the available training data is unbalanced. If the number of cases of each class varies too much, this causes problems for many classifiers, especially if the classes are not well separated. Classifiers tend to predict the majority class and do not take into account the special semantics of the problem. Often, false positives and negatives are not equally harmful. NEFCLASS can handle such problems by allowing the user to specify the misclassification costs in a matrix. 12 These modifications allow us to use NEFCLASS in domains with even highly unbalanced class frequencies, where many classifiers fail.
2.2. Determine 12
Line
Importances
In our study a set of 17 images depicting 5 different airports was used. Each of the images was analyzed by the production system to detect the runway(s). The lines were labeled as positive if they were used for runway construction, and negative otherwise. 4 of the 17 images form the training dataset used to train NEFCLASS. The training set contains 253 runway lines and 31,330 negatives. Thus, the dataset is highly unbalanced. Experiments showed that the regions next to the lines bear useful information. For each line, a set of statistical (e.g. mean and standard deviation) and textural features (e.g. energy, entropy, etc.) were calculated from the gray values next to that line.
Hybrid Data Mining Methods in Image
Processing
261
The semantics of misclassifications in this task were asymmetric. The positive lines are the minority class and thus easily ignored by a classifier. However, every missed positive can turn out to be very expensive, as it can hinder successful object recognition. Misclassifying negative lines just increases processing time. With NEFCLASS this could be represented by specifying asymmetric misclassification costs. Thus, the costs of false negatives (i.e. positive lines wrongly classified as negative) have empirically been set to 300 times the costs of false positives. After learning, the NEFCLASS pruning techniques were used to reduce the number of rules from over 500 to under 20. The best result was obtained with 16 rules. The lines from the remaining 13 images were used as test data. The quality of the result can be characterized by a detection and a reduction rate. The detection rate is defined as the ratio of correctly detected positives to all positives. The higher this value is, the higher the probability of a successful recognition. The average detection rate on the unseen images was 84%, and varies from 50% to 100%. The second measure is the reduction rate, which is defined as the ratio of lines classified as positives to the total number of lines. The lower this value is, the shorter the processing time will be. The average reduction rate on the unseen images was 17%. For most of the test images — even with lower detection rates — the image analysis was successful, as the missed lines are mainly shorter and less important. Figure 2b shows the lines NEFCLASS classified as positive in the example image, which was one of the unseen images. On this image the number of lines was reduced to one tenth, which means a reduction of processing time by a factor of more than 100. 3. Analyzing Optimal Parameters The production systems described in section 2 are parameterized. Typical examples of parameters are thresholds, like the maximum length of a gap between two line segments that may be bridged, or the minimal and maximal distance between two long parallel lines to build a stripe. These parameters are used to adapt the algorithms to varying scenarios, different applications and changing image material. However, this adaptation of the parameters is not always obvious. Some of these parameters depend on the geometric features of the modeled objects. For example, the exact descriptions of an airfield can be taken from maps or construction plans. Thus, we can specify how long or wide runways can be in the real world. Together
262
A. Klose & R. Kruse
with knowledge about the sensor used, sensor parameters, image resolution and image scaling in a specific image, we can determine the parameter values in the pixel domain. These parameters are called model parameters. Other parameter values cannot be derived from this easily accessible image information. Parameters like the maximum tolerated deviations from collinearity of lines, or parameters of the (low level) edge detection, strongly depend on the image quality (e.g. low contrast, blurred images, low signalto-noise ratio, partial occlusions). These so-called tolerance parameters may also depend on meta data like sensor type and image resolution. In our case, after fixing all known model parameters to suitable values, there are still more than ten variable parameters for the whole processing chain. However, human experts with experience and profound knowledge about the algorithms can often find reasonable parameter settings by looking at the image, often with just a few tries. Unfortunately, the experts cannot easily explain their intuitive tuning procedure. Therefore, we investigated approaches to support and automate adaptation of the image processing algorithm to changing requirements. We try to discover relationships between image properties and optimal parameters by using data mining techniques. To analyze the dependencies, we needed a set of example images with corresponding optimal parameter values. Our first step was to set up a database of sample images with varying quality and from different sensors. When choosing the images we had to make a compromise. On the one hand, we have to cover an adequate variety of images to obtain statistically significant results. On the other hand we can only process a limited number of images due to time restrictions. Optimization implies many iterations of the structural analysis, which can be quite time-consuming in these high resolution images. To stay within tolerable processing time spans we limited the initial database to 50 images (Figure 3 shows two examples). We might eventually extend this database in the future. For all images in our database we manually defined a ground truth, i.e. the exact position of the runways. We defined a measure to assess the results of edge detection, i.e. to assess how well extracted line segments match a runway defined in the ground truth. The structural image analysis was then embedded in an evolutionary optimization loop, and tolerance parameters were determined to maximize the assessment measure. 22 Our first suggestion for the problem of finding a suited parameter tuple p £ P for a given image i £ I was to assume a set of suitable image
Hybrid Data Mining Methods in Image
Fig. 3.
Processing
263
Images from the sample database.
features fi(i), • • • , fn(i) € R and a function cf>: Rn —* P such that p = 4>{fi(i), ••• , fn(i)). If we had appropriate image features, finding <> / could be understood as a regression or function approximation task, for which a set of standard techniques exists. We experimented with a set of features that were adapted versions of those that prove to be useful in the line filtering study from section 2.2. The results are described in Ref. 22. It turned out that the ad hoc choice of image features was not satisfactory. The main problem of this approach is that we do not know which image features contain the relevant information for parameter selection. It may even be that such features do not exist, or are too complex to be calculated in a reasonable amount of time. Therefore, another idea was to group similar images and then determine a commonly suited parameter tuple for each group. This approach would have several advantages over the parameter regression approach. Most important, by finding optimal parameters for groups of images, more general parameters are preferred, and thus local optima are avoided. Thus, the parameters will probably be more robust for new images. Determination of
A. Klose & R. Kruse
264
parameters for a new image means finding the most similar group and using the corresponding parameters. The search for groups of objects is a cluster analysis task. However, from our previous analyses 22 we conclude that we should avoid using image features until we know which features reflect relevant information. The key to our solution is to analyze image "behavior." If we find groups of images that behave similarly with respect to different parameters — i.e. commonly yield good results for some parameters and commonly worse results for others—we can analyze what these groups have in common in a second step. This supports the finding of hypotheses for suitable image features. Searching for common behaviors can be done by applying the same set of parameter tuples to all images, and analyzing the results. If we use a set of parameter tuples and apply these to each image, we get a corresponding set of assessment values for each image, forming an assessment vector, which characterizes image behavior. Thus, we do not directly use the parameter space for analysis, nor do we need image features. However, an adequate meaning of similar has to be defined. We use fuzzy measures to define an appropriate similarity measure. Fuzzy set theory offers a flexible tool to incorporate our knowledge about the domain into this definition. Many standard cluster analysis algorithms rely on numeric (i.e. real valued) data and standard metrics like the Euclidean distance. The use of non-standard similarity or distance measures makes other cluster methods necessary. Suitable methods include hierarchical clustering 23 or partitioning methods using evolutionary algorithms. 24 Hierarchical clustering algorithms are relatively fast, due to their straightforward mechanism, but they tend to run into local optima and thus can deliver unstable results. Therefore, we decided to use evolutionary clustering algorithms.
3.1. Evolutionary
Cluster
Analysis
Evolutionary algorithms have been established as widely applicable global optimization methods. 5 ' 25 ' 26 The mimicking of biological reproduction schemes that underlies evolutionary algorithms has been successfully applied to all kinds of combinatorial problems. Even for many NP-hard problems, for which complete searches for the optimum solution are infeasible, evolutionary algorithms yield near-optimal solutions of high quality. Successful solutions of graph-theoretic problems (e.g. the subset sum problem, the maximum cut problem, the minimum tardy problem, and
Hybrid Data Mining Methods in Image
Processing
265
equivalent problems) have been presented. 27 Non-hierarchical clustering is a related problem. Evolutionary algorithms have been applied to this problem class for a very long time. One of the earliest experiments on this topic goes back to 1978.28 There are, however, many recent results. 24 ' 29 ' 30 Evolutionary algorithms are largely problem independent. However, it is necessary to choose an appropriate scheme to encode potential solutions as chromosomes, to define a function to assess the fitness of these chromosomes, and to choose reasonable operators for selection, recombination and mutation. There are basically two intuitive ways to encode a partitioning of n objects to k clusters: • Each object o is assigned to a cluster c(o) 6 { 1 . . .k}. A chromosome thus has n genes, and the solution space has a cardinality of kn. Notice that not all solutions are different, and that not all solutions make use of all clusters. • Each cluster c is represented by a prototype p(c) g P. Each object o is assigned to the cluster with the most similar prototype. The chromosomes thus have k genes, which is smaller than the number of objects (otherwise clustering becomes trivial). The search space cardinality is \P\k- If P = {oi}, i.e. only the objects themselves are allowed as prototypes, the search space size is nk (which is much smaller than kn in the first encoding). Again, if not constrained explicitly, some of the solutions are identical (i.e. there are only (£) different choices of prototypes). The fitness function must be defined to assess the quality of chromosomes, i.e. the quality of partitionings. We suggest the use of fuzzy sets to define these measures, as it enables us to incorporate our domain specific knowledge (see Sec. 3.2). We chose tournament selection as our selection operator. 31 When we select parents in the mating phase, we randomly choose two chromosomes from the pool and take the fitter of the two. In comparison to fitness proportional approaches, this algorithm does not depend on the scaling of the fitness function, and is thus more robust. Additionally, it is computationally cheap in comparison to rank based selection, which implies sorting the chromosomes by fitness. Children are combined from their parents using a two-point-crossover of the chromosome strings. Two positions in the chromosomes are randomly chosen. The genes in between these positions are copied from one parent to
A. Klose & R. Kruse
266
the offspring, while the remaining genes are copied from the other parent. This simple operator works fine in our example, as we do not impose any constraints and thus have no need for a repair mechanism. The use of twopoint-crossover avoids the positional bias of one-point-crossover.26 For the mutation operator, we also use a common approach. We change each of the g genes in a chromosome with a probability of ^, where changing means setting the gene to a new random value. 27 The evolutionary algorithm is randomly initialized with a pool of 400 chromosomes. It then produces — by applying selection, recombination and mutation — new generations of partitions. The algorithm is stopped when the best fitness does not increase for 80 generations or a limit of 1000 generations is exceeded. 3.2. Analyzing
Image
Similarities
The fitness function is supposed to measure the quality of the corresponding partitionings. It is common in cluster analysis to maximize the inner cluster homogeneity, i.e. the similarity between the objects of one cluster. Our objects — the images — are described by their assessment vectors. Thus we have to define a similarity on these vectors. We first considered geometrically motivated measures like linear correlation, or the cosine between vectors. However, this raises difficulties. Due to the high dimensionalities of the vectors, correlations are near zero and angles near 90° for most pairs of vectors. Furthermore, these measures do not explicitly differentiate between high and low assessment values. Our goal is a definition of a similarity measure that captures the behavior of the images and groups them accordingly. We believe that it is more appropriate not to expect images to behave identically for all parameters, but to group images that can be processed with an equal subset of parameters. Thus, high assessment values are more important, and must be treated differently from low values. We can think of the assessment vectors as fuzzy sets, i.e. functions af. P —> [0, 1] that for a given image i e / = { l , . . . , n } map parameter tuples p € P to assessment values aj(p) G [0, 1]. This formally satisfies the definition of a fuzzy set. An interpretation of such a fuzzy set a$ could be: "The set of parameters that are suited to processing image i." We can easily create crisp sets a, jCr .j sp with the same meaning by applying a threshold to the assessments, and thus having a binary decision "suited/not suited." If we wanted to define a condition on these crisp sets
Hybrid Data Mining Methods in Image
for a suitable (crisp) prototype aPiCrisp could intuitively demand t h a t
Processing
267
for a subset C C / of images, we
&p, crisp IS £1 S u b s e t of all Sets (1%^ crisp i i
i.e. t h a t the prototypical parameters are suited for all images. We use fuzzy set operations to extend the same intuitive idea to our assessment fuzzy sets a{. We used the following measure of (partial) subsethood. 3 2 s(a,b) = \aDb\/\a\,
(1)
where a and b are fuzzy sets, the intersection n is the minimum function and the cardinality is defined as \a\ = J2X a(x)T h e measure of subsethood tends to assume higher values for fuzzy sets with small cardinalities. However, we want to find prototypes t h a t also perform well. Thus, for our similarity measure, we multiplied the subsethood by the assessment for the best common parameter, i.e. sup.,, (a (1 b)(x). As a fitness function we took the sum over all similarity values between image and associated (i.e. most similar) prototype. For our analysis we took the 10 best parameter tuples for all b u t one of the 50 images. Since the last image was of very low quality, we took only 5 parameters for it. Thus, we used a total of 495 parameter tuples. T h e choice of parameters is actually of secondary importance, as long as it ensures adequate coverage of the parameter space. This means t h a t we should have parameters with good and bad assessments for each image. We t h e n used these 495 parameters in the processing of each of the 50 images and calculated the corresponding vector of 495 assessments. We used the second partition encoding scheme with the possible prototypes chosen from the set of images. We repeatedly run the evolutionary clustering with values k G { 3 , . . . ,30} for the number of clusters. T h e fitness value reached its maximum after about one to two hundred generations. For all values of k we found the same partitionings for most of the repeated runs. Thus, the search procedure seems to be quite robust and adequate for the problem. We can observe t h a t the cluster prototypes are rather stable over the number of clusters, i.e. we do not observe many changes in the most suitable prototypes. Figure 4b shows which images are chosen as prototypes and how many images are assigned to them. We see t h a t the three images used as prototypes in the case of three clusters seem to be relatively general, as they are also used for higher numbers of clusters. However, the fitness value increases continuously with an increasing number of clusters
A. Klose & R. Kruse
268
0
5
10 15 20 25 Number of clusters
30
35
-i
..
_Q|
in1-*'
Fig. 4. (a) Fitness value of best partition vs. number of clusters, (b) assignments of images to prototypes.
(see Fig. 4a). As the curve does not contain any marked steps, we conclude that the images do not have a natural partitioning (with respect to our similarity measure) that prefers a certain number of clusters. Actually, this is not surprising and does not mean that the resulting clusters are random, because most of the possibly relevant features like image quality change gradually. Visual inspection of the images in the clusters shows some regularities. Most apparent, many groups contain only images from one sensor type (e.g. only visible band or only SAR). This means that images from different sensors need different parameter values — a result that we actually expected. It confirms that our definition of similarity on the assessment vectors is reasonable. Another observation was more surprising. We expected to get groups of images with different quality (e.g. varying noise, contrast, etc.). However, there were several groups that contained several (also quite different) images of one airfield only. Our assumption is that some of the parameters might still be model dependent and thus could be eliminated from the processing chain. However, this assumption will have to be validated in the future.
4. Analyzing Parameter Similarities To retrieve more knowledge about the image processing, it might be reasonable to analyze the parameters and their relations. One option is the analysis of parameter similarities. A definition of similarity in the parameter
Hybrid Data Mining Methods in Image
Processing
269
space does not seem to be a promising approach. As we have no a priori information about the influence of the individual parameters, defining a similarity measure on these is similar to the usage of the uncertain image features in the system identification approach outlined above. Analogous to the analysis of image similarities, we suggest an analysis of the gathered assessment data instead, and define a measure of parameter behavior. The assessments can be seen as a table with the images in the rows and the different parameter settings in the columns. In Sec. 3, we used the rows as descriptions of image behavior. Correspondingly, we now use the columns as descriptions of parameter behavior, i.e. how well a fixed parameter set performs on the different sample images. Thus, we get 495 assessment vectors of 50 dimensions each (i.e. one dimension per image). In clustering we did not use the evolutionary approach, for several reasons. First, the intuitive motivation used for the fuzzy similarity definition used above cannot be applied to the parameter assessment vectors. Second, the parameter settings themselves are vectors of (real-valued) individual parameters. These have to be analyzed. As clusters of vectors are less expressive than clusters of images, we used a different approach, namely self-organizing maps (SOMs). We applied SOMs successfully to document retrieval in a recent project, where text document collections have been mapped to two dimensions by SOMs and an appropriate similarity measure. 33 We will next briefly review the basic ideas of self-organizing maps. 4 . 1 . Self-Organizing
Maps 4
Self-organizing maps are a special neural network architecture that cluster high-dimensional data vectors according to a similarity measure. The clusters are arranged in a low-dimensional topology that preserves the neighborhood relations of the high dimensional data. Thus, not only are objects that are assigned to one cluster similar to each other (as in every cluster analysis), but also objects of nearby clusters are expected to be more similar than objects in more distant clusters. Usually, two-dimensional grids of squares or hexagons are used (like in Fig. 5). Although other topologies are possible, two-dimensional maps have the advantage of intuitive visualization and thus good exploration possibilities. Self-organizing maps are used for unsupervised clustering of highdimensional sample vectors. The network structure has two layers. The
A. Klose & R. Kruse
270
neurons in the input layer correspond to the input dimensions. T h e o u t p u t layer (the map) contains as many neurons as clusters needed. All neurons in the input layer are connected with all neurons in the o u t p u t layer. T h e weights of the connections between the input and o u t p u t layers encode positions in the high-dimensional d a t a space. Thus, every unit in the output layer represents a prototype. Before the learning phase of the network, the two-dimensional structure of the o u t p u t units is fixed and the weights are initialized randomly. During learning, the sample vectors are repeatedly propagated through the network. T h e weights of the most similar prototype w s (winner neuron) are modified such t h a t the prototype moves toward the input vector wi. T h e weights ws of the winner neuron are modified according to the following equation: ws=wa+cr-(ws-Wi),
(2)
where a is a learning rate. To preserve the neighborhood relations, prototypes t h a t are close to the winner neuron in the two-dimensional structure are also moved in the same direction. T h e closeness is defined by a neighborhood function v t h a t monotonously decreases with the distance from the winner neuron. Therefore, the adaption method is extended by v. w's=ws+v(i,s)-a-
(ws
-
Wi).
(3)
By this learning procedure, the structure in the high-dimensional sample d a t a is non-linearly projected to the lower-dimensional topology. After learning, arbitrary vectors (i.e. vectors from the sample set or prior unknown vectors) can be propagated through the network and are m a p p e d to the o u t p u t units. For further details on self-organizing maps see Ref. 34. 4.2.
Visualizing
the
Parameters
We used a SOM to assign the 495 parameter assessment vectors to a twodimensional m a p . Figure 5 shows the results for a m a p of 10x7 neurons. To visualize the corresponding parameters, we colored the m a p nodes using the node-wise minimum or maximum (left and right columns of Fig. 5) for a single parameter of the parameter set. In Fig. 5 we show the results for three of the ten parameters (minwidth, minmag, and distS). T h e white spots are nodes with no assigned parameter sets. If we t r y to analyze the resulting maps, we can see t h a t the parameter values are distributed quite differently. Interestingly, for minwidth and
Hybrid Data Mining Methods in Image
Processing
271
"S
(a)
(b)
(c)
(d)
I (f)
(e)
Fig. 5. Examples of parameter visualizations, (a), (c), and (e) are the minimal and (b), (d), and (f) the maximal values of individual parameters minwidth, minmag and distS per map node.
minmag the minimal and maximal values differ only slightly, i.e. the coloring of the maps in left and right columns are quite similar. In contrast, for distS most nodes have similar minimum and maximum values, i.e. the values are evenly distributed over the map nodes. Therefore we can assume that dist3 has little influence on the image processing (Note that we did not use the parameter values themselves to determine their similarity).
272
A. Klose & R. Kruse
T h e maps for two other parameters b o t h show similar colors in adjacent nodes. This shows t h a t these parameters are more important for the behavior of the processing. However, we have a horizontal split for minwidth and a vertical split for minmag. We can see t h a t we have basically two levels for parameter minwidth, which are about equally frequent. T h e parameter minmag is mostly rather small. T h e few greater values are squeezed to the upper border of the m a p , which indicates a somehow distinct behavior. This information can be used to remove parameters or to choose probably appropriate combinations of parameter values. We will have to analyze whether the good parameter sets for certain images are restricted to certain areas of the m a p , which would be helpful for choosing appropriate parameters. 5. C o n c l u d i n g R e m a r k s Images can be important sources of information. As the saying goes, "A picture is worth a thousand words." Unfortunately this is not easily scalable: A thousand pictures, for instance, are not necessarily worth a million words — especially since h u m a n users are easily overwhelmed by too many images. In contrast to numerical data, only a few — or even a single — image (s) are likely to bear interesting information. Therefore, in large image archives, one prominent goal of d a t a mining — as a tool to enable users to understand and access their d a t a — is to guide user attention and to reduce the number of images to just the interesting ones. We believe t h a t a variety of techniques will be necessary for this task, and t h a t soft computing techniques will play an important role. We have shown examples to support this process, where techniques from fuzzy sets, neural networks and evolutionary algorithms have been useful tools to model our knowledge about the domain, and to get understandable results. Ultimately, these techniques also allow new insights into the image analysis process. References 1. D. Dubois, H. Prade, and R. R. Yager. Information engineering and fuzzy logic. In Proc. 5th IEEE Intl. Conf. on Fuzzy Systems FUZZ-IEEE'96, New Orleans, LA, pp. 1525—1531 (IEEE Press, Piscataway, NJ, 1996). 2. H. Bandemer and W. Nather. Fuzzy Data Analysis Mathematical and Statistical Methods (Kluwer, Dordrecht, 1992).
Hybrid Data Mining Methods in Image Processing
273
3. L. A. Zadeh. Computing with words. IEEE Transactions on Fuzzy Systems, 4:103-111 (1996). 4. T. Kohonen. Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43:59-69 (1982). 5. D. E. Goldberg. Genetic algorithms in search, optimization and machine learning (Addison Wesley, Reading, MA, 1989). 6. C. Faloutsos, R. Barber, M. Flickner, W. Niblack, D. Petkovic, and W. Equitz. Efficient and effective querying by image content. J. of Intelligent Information Systems, 3:231-262 (1994). 7. D. Florescu, A. Levy, and A. Mendelzon. Database techniques for the worldwide web: A survey. SIGMOD Record, 27(3):59-74 (1998). 8. N. Vasconcelos and A. Lippman. A bayesian framework for semantic content characterization. In Proc. Intl. Conf. Computer Vision and Pattern Recognition CVPR, pp. 566-571 (1998). 9. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Eds. Advances in Knowledge Discovery and Data Mining (AAAI Press MIT Press, Cambridge, MA, 1996). 10. U. Fayyad and P. Smyth, Eds. Image Database Exploration: Progress and Challenges (AAAI Press, Menlo Park, CA, 1993). 11. S. Gibson, O. Kosheleva, L. Longpre, B. Penn, and S. A. Starks. An optimal FFT-based algorithm for mosaicing images, with applications to satellite imaging and web search. In Proc. 5th Joint Conf. on Information Sciences JCIS 2000, pp. 248-251 (2000). 12. A. Klose, R. Kruse, K. Schulz, and U. Thonnessen. Controlling asymmetric errors in neuro-fuzzy classification. In Proc. ACM SAC'00, pp. 505-509 (ACM Press, 2000). 13. U. Stilla, E. Michaelsen, and K. Lutjen. Automatic extraction of buildings from aerial images. In Leberl, Kalliany, and Gruber, Eds., Mapping Buildings, Roads and other Man-Made Structures from Images, Proc. IAPR-TC7 Workshop, Graz, pp. 229-244 (Oldenbourg, Miinchen, 1996). 14. R. Scharf, H. Schwan, and U. Thonnessen. Reconnaissance in SAR images. In Proc. of the Eur. Conf. on Synthetic Aperture Radar, Friedrichshafen, pp. 343-346 (VDE-Verlag, Berlin, 1998). 15. H. Schwan, R. Scharf, and U. Thonnessen. Reconnaissance of extended targets in SAR image data. In Serpico, Ed., Image and signal processing for remote sensing IV, Proc. Eur. Symposium on Remote Sensing, Barcelona, September 21th-24th, pp. 164-171 (1998). 16. L.-X. Wang and J. M. Mendel. Generating fuzzy rules by learning from examples. IEEE Trans. Syst., Man, Cybern., 22(6): 1414-1427 (1992). 17. D. Nauck, F. Klawonn, and R. Kruse. Foundations of Neuro-Fuzzy Systems (Wiley, Chichester, 1997). 18. D. Nauck and R. Kruse. Fuzzy classification rules using categorical and metric variables. In Proc. 6th Int. Workshop on Fuzzy-Neuro Systems FNS'99, pp. 133-144 (Leipziger Universitatsverlag, Leipzig, 1999).
274
A. Klose & R. Kruse
19. D. Nauck, U. Nauck, and R. Kruse. NEFCLASS for JAVA - new learning algorithms. In Proc. 18th Intl. Conf. of the North American Fuzzy Information Processing Society NAFIPS'99, pp. 474-476 (IEEE Press, New York, 1999). 20. A. Klose, A. Niirnberger, and D. Nauck. Some approaches to improve the interpretability of neuro-fuzzy classifiers. In Proc. EUFIT'98, pp. 629-633 (1998). 21. A. Klose and A. Niirnberger. Applying boolean transformations to fuzzy rule bases. In Proc. EUFIT'99, 6 pages (published on CD-ROM, 1999). 22. A. Klose, R. Kruse, H. Gross, and U. Thonnessen. Tuning on the fly of structural image analysis algorithms using data mining. In Priddy, Keller, and Fogel, Eds., Applications and Science of Computational Intelligence III, Proc. SPIE AeroSense'OO, Orlando, FL, pp. 311-321 (SPIE Press, 2000). 23. A. Jain and R. Dubes. Algorithms for Clustering Data (Prentice Hall, Englewood Cliffs, NJ, 1988). 24. E. Falkenauer. The grouping genetic algorithms - widening the scope of the GAs. Belgian Journal of Operations Research, Statistics and Computer Science, 33:79-102 (1993). 25. J. H. Holland. Adaptation in natural and artificial Systems (The University of Michigan Press, Ann Arbor, MI, 1975). 26. M. Mitchell. An introduction to genetic algorithms (MIT Press, Cambridge, MA, 1998). 27. S. Khuri, Th. Back, and J. Heitkotter. An evolutionary approach to combinatorial optimization problems. In Proc. 22nd Annual ACM Computer Science Conf. CSC'94, Phoenix, pp. 66-73 (ACM Press, New York, 1994). 28. V. V. Raghavan and K. Birchard. A clustering strategy based on a formalism of the reproductive process in natural systems. In Proc. 2nd Intl. Conf. of Research and Development in Information Retrieval, pp. 10-22 (1978). 29. D. R. Jones and M. A. Beltramo. Solving partitioning problems with genetic algorithms. In R. Belew and L. Booker, Eds., Proc. \th Intl. Conf. Genetic Algorithms, pp. 442-449 (Morgan Kaufmann, Los Altos, CA, 1991). 30. T. Van Le. Fuzzy evolutionary programming for image processing. In Proc. Int. Conf. on Intelligent Processing and Manufacturing of Materials, Gold Coast, Australia, pp. 497-503 (1997). 31. D. E. Goldberg and K. Deb. A comparative analysis of selection schemes used in genetic algorithms. In G. Rawlins, Ed., Foundations of Genetic Algorithms, pp. 69-93 (Morgan Kaufmann, 1991). 32. G. J. Klir and B. Yuan. Fuzzy Sets and Fuzzy Logic (Prentice Hall, Englewood Cliffs, NJ, 1995). 33. A. Klose, A. Niirnberger, R. Kruse, G. Hartmann, and M. Richards. Interactive text retrieval based on document similarities. Physics and Chemistry of the Earth, 25(8):649-654 (2000). 34. T. Kohonen. Self-Organization and Assoziative Memory (Springer-Verlag, Berlin, 1984).
C H A P T E R 11 ROBUST FINGERPRINT IDENTIFICATION BASED ON HYBRID PATTERN RECOGNITION METHODS
D.-W. Jung* and R.-H. Park Department of Electronic Engineering, Sogang University C.P.O. Box 1142, Seoul 100-611, Korea E-mail: [email protected]
For personal identification, fingerprint identification has been much more widely used than other automatic biometric identification methods. For reliable feature extraction we analyze the direction of ridges, and the structural characteristics, such as the ridge line width and interval, are automatically estimated based on the ridge line following algorithm. To eliminate false features outside the region of interest (ROI), we generate adaptive matching boundaries. The concept of fuzzy sets is applied to the extracted features and the quantitative feature values are defined. In the feature matching stage, we introduce a weighted matching score using quantitative feature values to guarantee reliable matching results. Furthermore, a two-step estimation of transformation parameters is employed to reduce the computational complexity. The experimental results show that our system can achieve a fast personal identification with a good performance.
1. I n t r o d u c t i o n In the electronically interconnected information society, large amounts of information are easily exchanged through computer networks. As today's society becomes more complex, t h e security of information becomes more important. 1 ' 2 Obtaining positive identification of parties to an exchange of information is a vital element of d a t a security. Thus, various methods for automatic personal identification have been proposed. Biometrics *He is now with NGT Co., Jayang B/D, 31-8, Munjong-dong, Songpa-Gu, Seoul, Korea (E-mail: reoidom710ngt.co.kr). 275
276
D.-W. Jung & R.-H.
Park
is the science of analyzing biological observations, and is the basis of automated methods for recognizing a person based upon physical or behavioral characteristics. 3 Biometric approaches use fingerprints, speech patterns, facial features, retinal scanning, hand-writing, hand veins and hand geometry as recognizable and identifying human traits. 4,5 For personal identification, fingerprint identification has been much more widely used than other biometric identification methods. The major reason for the wide usage and popularity of fingerprints as a means of personal identification is that fingerprints are unique and do not change as a person ages.6 Fingerprint identification method is one of the most interesting pattern recognition methods for personal identification.7 Fingerprint identification techniques guarantee confidence and stability in personal identification, and thus they have been used in various applications. 7 ' 8,9 Fingerprint identification is applied to computer industries such as e-business, networks, groupware, software licensing, and peripherals (mouse and keyboard). Moreover, it is utilized in automated teller machines (ATM), car ignition systems, locks on safes or doors, and smart cards. There are several practical problems in the fingerprint identification system. Each time a fingerprint is acquired, location and shape distortion occurs because of the elasticity of the skin. 10,11,12 Moreover, high confidence and real-time processing are important factors in automatic fingerprint identification systems (AFIS). To solve these problems, extraction of features 13,14 from the fingerprint image and its application to matching have been investigated. 15,16 Figure 1 illustrates the system flowchart of the proposed fingerprint identification system. The fingerprint image is first acquired from the inkless fingerprint sensor, and it is verified or stored in the database. To obtain more reliable feature extraction results, the input fingerprint image is analyzed and the ridge line parameters are estimated. Then, the ridge line following algorithm is applied to extract features. To preserve each fingerprint image, we generate adaptive matching boundaries for eliminating false features outside the region of interest (ROI). We apply the concept of fuzzy sets to features and evaluate quantitative feature values. If the features are to be registered, they are inserted in the database. Otherwise they are passed to the matching stage. In the fingerprint matching stage, estimation of transformation parameters is divided into two steps for fast computation, and the matching score is computed based on fuzzy weighting. Finally
Robust Fingerprint
Identification
Fingerprint image acquisition
Fingerprint image analysis Ridge line parameter estimation
Ridge line following Feature extraction
Generation of adaptive matching boundaries
Feature analysis Evaluation of the quantitative feature value using the concept of fuzzy sets
No-
Fingerprint matching (two-step estimation of transformation parameters) based on fuzzy weighting
Personal authentication
Fig. 1.
Flowchart of the proposed fingerprint identification system.
277
278
D.-W. Jung & R.-H.
Park
the degree of matching (indicating the degree of similarity) is computed for all fingerprints in the database, and personal identification is complete. The rest of this chapter is organized as follows. Section 2 explains robust feature extraction methods and evaluation of quantitative feature values using the concept of fuzzy sets. In Sec. 3, we propose the weighted matching score and two-step estimation of transformation parameters. Simulation results of the proposed hybrid method applied to real fingerprints are shown in Sec. 4. Finally conclusions are given in Sec. 5. 2. Feature Extraction from Fingerprint Images In a fingerprint, the main features are ridges and valleys which alternate. Within ridge lines there are some anomalies such as ridge bifurcations and ridge endings. We can define the anomalies as features. Fingerprint identification is based on the analysis of features in a fingerprint. Thus, the performance of the AFIS depends on the accuracy of the extracted features. A large number of methods for detection of features from fingerprint images have been proposed. However, they are all essentially similar. Most of them transform gray level fingerprint images into binary images. Next, a thinning process is applied to the binary images. The features are then extracted from the thinned fingerprint images. However the binary images may lose a lot of feature information, and are sensitive to noise. Furthermore, the entire process is computationally intensive. Our feature extraction method is based on the ridge line following algorithm. 17 The direction of ridges and the structural characteristics of the ridge lines, such as the ridge line width and interval, are automatically estimated from the gray scale fingerprint image. 2.1. Feature Information
in Fingerprint
Images
When fingerprints are used for personal identification, most of the fingerprint identification systems use features. For user registration, features are extracted from the acquired fingerprint image, and stored in the database. For personal identification, extracted features are compared with the ones registered in the database. Various representations of features can be defined. We represent the feature information in terms of three fields: the location of the feature, the angle of the ridge line tangential to the feature with respect to the horizontal
Robust Fingerprint
(a) Ending feature Fig. 2.
Identification
279
(b) Bifurcation feature Feature representation.
direction, and the feature type (ending or bifurcation). Figure 2 shows the feature representation depending on the type, where (xo, yo) and 6 denote the location of the feature and the angle of the ridge line, respectively. 2.2. Modified Ridge Line Following
Algorithm
In our system, the feature extraction algorithm is based on the ridge line following algorithm proposed by Maio and Maltoni. 17 They proposed a technique, based on ridge line following, where the features were extracted directly from gray level images. The basic idea of the ridge line following method is to follow the ridge lines on the gray level image, by tracking according to the directional image of the fingerprint. A set of starting points is determined by superimposing a square-meshed grid on the gray level image. For each starting point, the algorithm keeps track of the ridge lines until they terminate or intersect other ridge lines. However, there are several problems in the ridge line following algorithm, so we have modified it for more reliable feature extraction. Figure 3 shows the overall block diagram of the modified ridge line following algorithm. The width of the fingerprint ridge line is estimated from the fingerprint image acquired by the input sensor. The section of the ridge line is searched for the local maximum. If the location of local maximum is determined, the local tangential direction of the ridge line is computed at that point. 18 At each step, the stopping criteria are tested. If they are satisfied, we extract the feature; otherwise, we proceed with the ridge line following. We add an estimation of the ridge line width, and modify the stopping criteria.
D.-W. Jung & R.-H.
280
Fingerprint image
Park
Estimation of the ridge line width
Sectioning and maximum determination
Computation of tangential directions
Extracted features
Fig. 3.
B l o c k d i a g r a m of t h e modified r i d g e line following a l g o r i t h m .
2.3. Estimation
of the Ridge Line
Width
In the ridge line following algorithm, the length of the section set and the interval of ridge line tracking are parameters. The optimal parameter values are set by the average width of the ridge lines. We use automatic estimation of the average ridge line width in the ridge line following algorithm to reliably extract the features. The length of the section set is extended to two times the average ridge line width plus one pixel, and the direction is orthogonal to the direction tangential to the ridge line. The interval of ridge line tracking is a distance of estimating the direction tangential to the ridge line. There are several steps for estimating the average ridge line width. First, a normalization process is used to determine the mean and variance of gray level in the acquired fingerprint image. To obtain an oriented window, the local orientation 19,20,21 is estimated from the input fingerprint image as shown in Fig. 4. The average widths of ridge lines and valleys are computed by analyzing gray level patterns within the oriented window. The gray level pattern is very similar to a discrete sinusoidal wave, which has the same period as the ridge lines and valleys within the oriented window. 22 ' 23 In Fig. 4, the average ridge line and valley widths are 7.12 and 4.13 pixels, respectively. These values are applied to the ridge line following algorithm for automatic determination of the ridge line parameters.
Robust Fingerprint
Fig. 4.
2.4. Stopping
Identification
281
Procedure of average ridge and valley width estimation.
Criteria
There are four stopping criteria which terminate ridge line following: departure from the ROI, termination, intersection, and excessive bending. The termination criterion is relatively sensitive to noise. Thus, we propose a new termination criterion, which is robust against noise, and more reliable for extracting ending features than Maio and Maltoni's method. 17 Figure 5 shows the procedure for detecting an ending feature. Figure 5(a) illustrates Maio and Maltoni's ending feature detection method. The point (ic, jc) becomes the current point and a ridge line direction, ipc,
stopping criterion
->•
f (a) Maio and Maltoni's method Fig. 5.
Jf
ending feature
(n-l)th window
(b) Proposed method
Detection of an ending feature.
D.-W. Jung & R.-H.
282
Park
is computed. The section segment, with the median point (it, jt) and orthogonal direction to ipc, is obtained and the point (in, jn) is chosen from the set of local maxima of the section segment set. Then, the points (ic, jc) and (in, jn) form the angle 6 with respect to the direction
Feature Value
Using
In the ridge line following algorithm, some false bifurcation features are generated by false ridge line following. Figure 6 illustrates the problem of false bifurcation features. The dotted lines illustrate the procedures of the ridge line following. A detected ridge line following ® crosses a ridge line, and false bifurcation features are generated from this incorrect detection. In following (2), the incorrect trace of ® is regarded as another ridge line, (2) —
J ( Bifurcation feature
>«.
r
*
© Fig. 6.
y
_
*
•
•
*
•
•
-
-
.
.
.
^
©
Example of false bifurcation feature declaration.
Robust Fingerprint
Identification
283
and a false bifurcation feature is extracted. (3), (3), and © also intersect the trace of 0 and false bifurcation features are also extracted. In Fig. 6, four bifurcation features are incorrectly extracted. The false features lead to a low matching score or false identification result because the fingerprint matching process performs matching on the basis of the detected features. For a solution to this problem, we apply the concept of fuzzy sets 24 to features and evaluate quantitative feature values. Fuzzy sets were introduced by Zadeh 25 as a new way to represent vagueness in everyday life. Fuzzy interpretations of data structures are very natural and intuitively plausible ways to formulate and solve various problems. The concept of fuzzy sets can be used to represent input data as an array of membership values, denoting the degree to which the data possess certain properties, and providing an estimate of missing information.26 This representation has been used to represent uncertainty in pattern recognition tasks. 27 ' 28 - 29 Figure 7 shows enlarged examples of detecting false features (caused by false ridge line following) in real fingerprint images, as depicted in Fig. 6. False ridge line following occurs because of excessively thin valleys or unclear ridge lines, which result in false features. The distribution of false features tends to be concentrated together. Since the features are the critical primitives for representing fingerprint patterns, the reliability of personal identification is reduced if point matching is used with false features. To improve reliability, we apply the concept of fuzzy sets to extracted features and define the quantitative feature values, resulting in a robust fingerprint identification algorithm based on hybrid pattern recognition methods.
Fig. 7.
Examples of detecting false features.
Let A = (a, c) denote a fuzzy set, where a represents the center and c signifies the spread or width of the fuzzy set. Let HA{X) denote the
284
D.-W. Jung & R.-H. Park
Fig.
Membership function employed in the proposed method.
membership function of A which is given by (J.A(X) =
c>0,
R((x-a)/c),
(1)
where R(x) is the reference function with the properties: R(x) = R(—x), -R(O) = 1, and R is a strictly decreasing function in the interval [0, oo). If the reference function is chosen as R(x) = exp(— \x\p), then the membership function 30 ' 31 becomes HA{X)
= exp
cP
(2)
where the parameter p represents the shape of the fuzzy membership function. If the parameters a, c, and p in Eq. (2) are set to 0, v30, and 2, respectively, then we obtain the membership function used in our study: fiA(x) = e x p
x 30
(3)
which is illustrated in Fig. 8. To evaluate the quantitative feature value, we use Eq. (2). Figure 9 shows the weighted feature value window used to define the feature value, where each feature to be evaluated is centered at pixel (i,j) and the window size is M x M. Figure 9(a) shows the feature distribution (Case A) in which at least one feature exists in the neighborhood within the window. In Fig. 9(b), there is only one feature centered in the window (Case B). Let d(u, v) denote the Euclidean distance 32 ' 33 between the center
Robust Fingerprint
Identification center pixel (i, j)
center pixel (7, j)
!
!' -X~
'
fc \
H^
Ja
xtt
\ \k \
^
-L-
^ „ ^ L• S r ± i j__sz 4-V -^^-A_
i j
^
1 P /
I i
1
4-
ni
v\
285
[
1
™
r - ^ ._ nziDZ.
1
1
tl
d(u,v)'
' (M,V)
I
(a) Case A
extracted feature (b) Case B
Fig. 9.
Weighted feature value window.
pixel (i,j) and the neighboring pixel (u,v), expressed as d(u, v) — \J(u - i)2 + (v — j)2
(4)
Let HA(J,,J) denote the two-dimensional (2-D) membership function of A, Eq. (4) is combined with Eq. (2), and thus the 2-D membership function for evaluating the feature value is given by i+
VA(i,j)
f(i,j)
E 1, 0,
E
d2(u,v) exp
if («, v) within the window otherwise,
f(iJ)
u,v^0
(5)
(6)
where M is assumed to be odd. The parameter c determines the window size. The optimum window size is determined by the average widths of ridge lines and valleys in an image. Let WS{n) denote the size of the nth window of the weighted feature value window corresponding to the nth fingerprint image. Then WS{n) is defined by WS{n) = 3Ra + 2Va ,
(7)
where Ra and Va denote the average widths of ridge lines and valleys, respectively. Assume that the window contains three ridge lines and two valleys. If the detected feature is located at the center of the window, we search for neighboring features around the center within the window
286
D.-W. Jung & R.-H.
Park
(see Fig. 9(a)). In such a case, most of these features are false features extracted by false ridge line following. To overcome this problem, the neighboring features around the center pixel are considered. Let QFV(n) indicate the quantitative feature value, with n denoting the nth extracted feature from the fingerprint image. Then we define the proposed feature value by
QFV(n) = \ ™ ^
1
- ^ ^
5™?**
(8)
[1, for Case B . Equation (8) assigns a quantitative value to the feature centered in the weighted feature value window, based on a fuzzy determination whether it is a feature or not. If neighboring features exist within the window (Case A of Fig. 9(a)), the feature value of the center feature of the window is evaluated by subtracting Eq. (5) from 1, where Eq. (5) is calculated only for the neighboring pixels (u, v) in the window excluding the center pixel. If the calculated quantitative feature value is smaller than 0, it is set to 0. If there are no neighboring features within the window (Case B of Fig. 9(b)), the value of the feature centered at the window is set to 1. Thus, values of extracted features in a fingerprint image are mapped into [0,1]. 3. Feature Matching in Fingerprint Images In Section 2, we described the process of feature extraction using the modified ridge line following algorithm. In this section, we describe a method of personal identification that takes advantage of extracted features. Feature matching for automatic fingerprint identification is performed by point pattern matching. A large number of point pattern matching methods have been presented in the literature. 34 ' 35 In this study, we use the point matching method for fingerprint identification. We generate an adaptive matching boundary to sort out the false features. Quantitative feature values for each feature are used in calculating the matching scores. We also propose a fast feature matching method based on a two-stage estimation of transformation parameters. 3.1. Determination
of Adaptive
Matching
Boundary
Features are extracted from fingerprint images using the ridge line following algorithm. However, this method can extract false features near the
Robust Fingerprint
Identification
287
neP-
(a)
(b)
Fig. 10. Example of feature extraction result: (a) Acquired fingerprint image, (b) False features caused by the uniform ROI applied to the entire fingerprint image.
boundary of the acquired fingerprint images, resulting in false matching. We analyze the patterns of ridge lines from the acquired fingerprint image, and construct a region of interest (ROI) that excludes these boundary Figure 10 shows an example of ROI determination from the input fingerprint image. Figure 10(a) shows the acquired fingerprint image. If the ROI is the entire fingerprint image, false ending features are produced around the ROI as shown in Fig. 10(b). In Fig. 10(b) the line (-) marks denote the ridge directions, whereas the box ( • ) marks represent the feature location. To reduce false features, we must select desirable features based on the characteristics of each fingerprint image. To determine the ROI in an image, the variation of gray level values in the fingerprint image is considered. Let I(i,j) represent the gray level value at pixel (i,j), and let m and a2 denote the mean and variance of I, respectively. Let N(i,j) signify the normalized gray level value at pixel (i,j), and let mo and a\ denote the desired mean and variance of N, respectively. Then the normalized image N(i,j)22 is given by m0H
op I a
\I{i,j)-m\,
if
I(i,j)>m
a
N(i,j) mo
(9)
<7o,
\I(i,j) — m\,
otherwise.
a We divide a 288 x 240 normalized fingerprint image into a number of 16 x 16 nonoverlapping blocks, and compute the standard deviation of each block. Let cr block denote the standard deviation of each block in the
D.-W. Jung & R.-H.
288
Park
13
(a)
(b)
Fig. 11. Example of feature extraction result: (a) Modified fingerprint image using the ROI, (b) Features extracted by the adaptive ROI.
normalized fingerprint image is defined by
image. Then the ROI of the acquired fingerprint
I 0,
otherwise.
From Eq. (10), the ROI is constructed of the blocks in which the standard deviation is larger t h a n t h a t of the normalized fingerprint image. T h e variations of gray level are relatively large in the ROI blocks because ridge lines and valleys, whose gray level values are greatly different from each other, coexist. Figure 11(a) shows the construction of the adaptive matching b o u n d a r y for Fig. 10(a) by excluding pixels outside the ROI. Our method adaptively determines an ROI t h a t is appropriate for a given image. Figure 11(b) shows t h a t the false ending features are eliminated by excluding t h e m from the ROI. There are two advantages in sorting out the false features by using the adaptive matching boundary. First, the feature matching process becomes more reliable. Second, the time required for feature matching is reduced.
3 . 2 . Feature
Matching
Feature matching searches for the best matching between the features of the query image and the images in the fingerprint database. For each feature in the query image, we check whether there is a matched feature in the database fingerprint image. In feature matching, t h e features in t h e query image are rotated, translated, and superimposed on features from database
Robust Fingerprint
Identification
289
images. The best match gives the user's identity. In order to estimate unknown rotation and translation parameters, Ratha et al. introduced the generalized Hough transform. 36 The conventional Hough transform 37 can be generalized for point pattern matching. We estimate the transformation parameters by discretizing the parameter space, and exhaustively searching the discrete space. Let Te,Ax,Ay denote the transformation. Then the transform of a query feature located at (x,y) is given by Te Ax A
' '«
f x\ [y)
=
( cos9 sin9\ f x\ {-Sin9 cose) {y)
{ Ax\ {Ay)>
, .
+
( U )
where 9 and (Ax, Ay) are the rotation and translation parameters, respectively. The transformation parameters are discretized into 9 G {#1,02, • • • ! # j } ,
Ax € {Ax!,Ax2,... Ay G {Ayi,Ay2,...
,AxK}, ,AyL},
(12)
where J, K, and L denote the total numbers of discretized values of 9, Ax, and Ay, respectively. J • K • L transforms are performed for each image in the database, and the matching score is calculated for each combination of parameter values. For all combinations of parameter values, the best set of transformation parameters is selected, based on the maximum matching score. Figure 12 illustrates the matching criterion for a query feature and a database feature. A tolerance for matching is needed due to the elasticity of skin. The ' • ' , ' • ' , and ' / ' marks denote the database features, query features, and ridge directions, respectively. Figure 12(a) shows two matched
d'
tolerance area
-
(a) Matched features Fig. 12.
tolerance area
(b) Unmatched features
Examination of paired features.
D.-W. Jung & R.-H.
290
Park
features, in which the query feature lies inside the tolerance area of the database feature, and the ridge directions are within some tolerance of each other. Figure 12(b) shows two unmatched features, in which the query feature is outside the tolerance area or the ridge directions are not within a tolerance of each other. 3.3. Weighted
Matching
Scores
A matching score for fingerprint features establishes a correspondence between the query and all the elements of the database feature set. In our algorithm, extracted features from the query fingerprint are rotated, translated, and superimposed on the features of a database fingerprint. Those query features satisfying the condition of Fig. 12(a) are regarded as matched features. Other proposed matching algorithms were based on finding the number of paired features between each database and query fingerprint.36,38 The matching score in these algorithms is computed based on the number of paired features. In this paper, we propose a weighted matching score using quantitative feature values. Let MS denote the matching score between the fingerprint in the database and the query fingerprint, given by Ms
= loo ^ I J W - H M O
;
(13)
where NQ (NR) signifies the number of query (a certain user's) features and g(k) is the discrimination function representing whether the two features are matched or not, defined by ,,, f 1, g(k) = 1 ' y ' \0,
if the two features are matched, , . ' otherwise,
, , n k = 1,2,... ,Nn. ' ' ' w (14)
MS is a confidence value, representing how well the query matches a texture in the database. The coefficient a» represents the ith QFV of the query feature defined by Eq. (8). The coefficient bj signifies the j'th QFV of a certain user in the fingerprint database. The denominator of Eq. (13) normalizes the matching score. The maximum value of each accumulated QFV is selected as the normalization term. The numerator of Eq. (13) represents the QFV of the matched features between the query image and a certain user's image. If the number of paired (matched) features is Np, Np QFVs
Robust
Fingerprint
Identification
291
are accumulated by selecting the maximum QFV out of NQ query feature QFVs, and the paired database feature QFVs. 3.4. Two-Step
Estimation
of Transformation
Parameters
Estimating the transformation parameters in Eq. (12) is computationally expensive. Since the computational burden is proportional to the number of features registered in the database, it is hard to perform real-time personal identification for large fingerprint databases. To reduce the computational complexity, we propose a fast feature matching method based on two-stage estimation of the transformation parameters. Any parameter estimation algorithm must consider the correspondence between query and database features for all the combinations of transform parameters represented in Eq. (12). 36 Rotation and translation of extracted query features to the registered fingerprints must occur when the query fingerprint is acquired. Since there are wide ranges of rotation and translation parameter values, effective estimation of transformation parameters is a critical factor in a reliable fingerprint identification system. We employ a two-step estimation of transformation parameters. In the first step, a coarse matching is performed, and matching scores are computed between the query image and all the images registered in the database. In the second step, a fine matching is performed between the query image and the high-scoring database images from the first step. The registered user with the maximum matching score is then identified. Our fingerprint identification system identifies a user based on the matching score. Figure 13 shows the distribution of matching scores for correct and incorrect matching. The dotted line denotes the decision threshold
Threshold
Matching score
Fig. 13.
D i s t r i b u t i o n of c o r r e c t a n d i n c o r r e c t m a t c h i n g scores.
D.-W. Jung & R.-H.
292
Park
MSth for whether matching is correct or not, ® and (6) denote the regions of false rejection (FR) and false acceptance (FA), respectively. The decision criterion should establish a decision boundary which minimizes the FR rate for a prespecified FA rate. The prespecified FA rate for a biometric system is usually very small 5>39>40. Figure 14 shows the flowchart of the proposed feature matching algorithm, using a two-step estimation of transformation parameters for fast
B(j, k, I) = 0
j- k=I= 1 Mode -«—Coarse matching
Transformation of query features and counting the number of paired features
k -*•- k+l or
|«-N(
* > K or / > L Yes
J >J
j*-j+l
B(j, k, I) <-Number of paired features
j—j+2
J = 2 Matching score (MS) computation
Selection of the maximum matching score
••-No
-<^
Mode -*— Fine matching
^ ^ ' ~~---
Mode: Coarse?
~~"~—r """
Exclusion of the corresponding user from the database
T__
«
^
Coarse matching
MX > MS:h 1 3
"'Til Fine
matching
Fig. 14. Flowchart of the proposed fast feature matching algorithm using two-step estimation of transformation parameters.
Robust Fingerprint
Identification
293
matching. B(j,k,l) represents the accumulator array in the generalized Hough transform, with j , k, and I denoting indices of the discretized parameters 8, Ax, and Ay, respectively. The left part of the flowchart illustrates a coarse matching process, in which the matching score is evaluated for all discretized parameter values. If a certain user's matching score after coarse matching in the fingerprint database is smaller than one third of the prespecified threshold matching score MSth, the user is excluded from the fine matching process. The right part of the flowchart represents a fine matching process, in which matching is performed with the remaining registered users for the remaining parameter values. The matching score is evaluated and added to the coarse matching score. Because the features of most registered users are greatly different from the query features, most of the registered users are excluded in the coarse matching process. Let NQ and ND denote the numbers of query and all registered database features, respectively. Let J, K, and L signify the total numbers of discritized parameter values of 8, Ax, and Ay, respectively. Then the required number of operations is reduced from O(NDJKLNQ) to 0(1/2NDJKLNQ) by the two-step estimation of transformation parameters. 4. Experimental Results and Discussions In our experiments, we capture fingerprint images using an NITGen SecuGen sensor. Figure 15 shows the fingerprint image acquisition sensor used in our experiments. Its resolution is 450 dpi with an LED light source, and it is connected to a PC using the FDA01 Developer's Kit software. We conduct our experiments on a Pentium II 400MHz using programs written in C. We capture fingerprint images from 50 individuals and obtain five
Fig. 15.
Fingerprint scanner used in experiments.
294
D.-W. Jung & R.-H.
Park
Fig. 16.
Examples of good-quality images.
Fig. 17.
Examples of poor-quality images.
images for each person, yielding a total of 250 images. The size of these fingerprint images is 288 x 240. The quality of the image varies with the condition of the finger. Dryness is one important condition for obtaining a good image. About 80 percent of the obtained images are of good quality, with some examples shown in Fig. 16. About 20 percent of the obtained images are of poor quality, with some examples shown in Fig. 17. Figure 18 shows the proposed fast feature matching algorithm. When the query image is acquired from the input sensor, the features are extracted from the image and the quantitative feature values are evaluated. After extraction of query features, the registered database features are retrieved. Each transformation parameter for the query features is estimated using all database features, and the weighted matching score is computed for all registered users in the retrieval process. If the maximum matching score is larger than the prespecified threshold, the user with that maximum score is selected as the correct person. Figure 19 illustrates the evaluation of the QFV for the extracted features using the weighted feature value window. The weighted feature value
Robust Fingerprint Query fingerprint
Fig. 18.
0.350
Identification
295
Query features
Matching result
Database fingerprint
Database features
Overall procedure of the fingerprint identification.
0.348
(a) concentrated features Fig. 19.
^-1.000 (b) normal extracted features Examples of the QFV.
window is centered at each extracted feature, with the QFV of the features calculated by Eq. (8). Figure 19(a) shows the evaluation of the QFV for the concentrated features caused by false ridge line following because of the intervals of ridge lines are close to each other. The QFVs give relatively low values. Figure 19(b) shows the QFV for the normal extracted features. The bifurcation features are easily recognized, and the QFVs are set to 1.
296
D.-W. Jung & R.-H.
Park
For our experiments, two fingerprint images per person are registered in the database, for a total of 100 fingerprint images in the database. The remaining three fingerprint images per person are used as the query images. Figure 20 shows the distributions of correct and incorrect matching scores, with 15,000 combinations of query images and database images. The dotted line illustrates the distribution of the matching score using the method of paired features. 36 ' 38 The solid line represents the distribution of the matching score by our proposed method. The distributions of incorrect matching scores show shapes similar to each other. However, the distribution of correct matching scores based on the QFV gives higher matching scores than the method counting the number of paired features. As a result, for incorrect matching, the proposed mehtod gives smaller percentage values. These trend leads to easier selection of a threshold that minimizes the FR rate for the specified FA rate.
Ilkl M u *
f I TUt
•- lik i i n (l'i pi sid noh ill
Fig. 20. Distribution of correct and incorrect matching scores (Method of counting the number of paired features vs. Proposed method).
Table 1 represents the FR rate and correct recognition rate for the method of paired features for various MSth- Table 2 shows the FR rate and correct recognition rate of the proposed method for various MSthWhile a maximum correct recognition rate is required (minimum FA rate)
Robust Fingerprint
Identification
297
Table 1. FR rate and identification rate of the method of counting the number of paired features. MSth
F R rate (%)
Identification rate (%)
29.08 39.04 49.01 57.62
99.757 99.931 99.993 100
18 20 22 24
Table 2. F R rate and identification rate of the proposed method. MSth
FR rate (%)
Identification rate (%)
18 20 22 24
15.20 19.59 27.36 33.45
99.359 99.903 99.938 100
Table 3.
Required average matching time.
Parameter estimation method
Average matching time (sec.)
Full search method Two-step estimation
2.03 1.12
in the fingerprint identification system, a low FR rate is also desirable. From Tables 1 and 2, the FR rate of the proposed method is much smaller than that of the method of counting the number of paired features for the similar identification rate. This indicates that the proposed fingerprint identification method can effectively differentiate the users. This performance improvement is due to the robustness of QFV. Table 3 shows the average matching time per image. The average matching time of the two-step estimation method is reduced to 55% of the average time of the full search method. Note that the proposed two-step estimation method does not result in performance degradation, and thus this fast fingerprint identification method can be applied to practical systems with large fingerprint databases. The performance of any fingerprint identification algorithm is sensitive to the quality of the fingerprint images. Most low matching scores or high FR rates are due to poor quality of the input images. If the quality of the
D.-W. Jung & R.-H.
298
Park
acquired fingerprint images is enhanced, the performance of our identification system can be improved. 5.
fingerprint
Conclusions
We have implemented a robust fingerprint identification system using hybrid methods. In the feature extraction stage, the average ridge line width is estimated in the ridge line following algorithm, and a new stopping criterion is proposed. Moreover, a fuzzy technique is applied to the features, and quantitative feature values are evaluated using the weighted feature value window. In the feature matching stage, the p a t t e r n s of ridge lines in the fingerprint image are analyzed and an adaptive matching boundary is constructed to eliminate false features. T h e weighted matching score is computed using the quantitative feature value for more reliable matching. A two-step estimation of transformation parameters is also employed to reduce the computation time. Experimental results show t h a t the proposed fingerprint identification approach is effective for personal identification. Further research will focus on improving the system's tolerance for degraded or poor-quality fingerprint images. References 1. R. Clarke, Human identification in information system: Management challenges and public policy issues, Info. Technol. People, 7(4):6-37, 1994. 2. S. G. Davies, Touching Big Brother: How biometric technology will fuse flesh and machine, Info. Technol. People, 7(4):60-69, 1994. 3. A. K. Jain, L. Hong, S. Pankanti, and R. Bolle, An identity authentication system using fingerprints, Proceedings of the IEEE, 85(9):1365-1388, Sept. 1997. 4. A. K. Jain, R. M. Bolle, and S. Pankanti, Eds., Biometrics: Personal Identification is a Networked Society, Norwell, MA: Kluwer, 1999. 5. J. G. Daugman, High confidence visual recognition of persons by a test of statistical independence, IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-15(11):1148-1161, Nov. 1993. 6. N. K. Ratha, S. Chen, and A. K. Jain, Adaptive flow orientation based feature extraction in fingerprint images, Pattern Recognition, 28(11):1657-1672, Nov. 1995. 7. H. C. Lee, and R. E. Gaensslen, Eds., Advances in Fingerprint Technology, New York: Elsevier, 1991. 8. B. Miller, Vital signs of identity, IEEE Spectrum, 31(2):22-30, Feb. 1994. 9. J. Wood, Invariant pattern recognition: A review, Pattern Recognition, 29(1):1-17, Jan. 1996.
Robust Fingerprint Identification
299
10. L. Coetzee, and E. C. Botha, Fingerprint recognition in low quality images, Pattern Recognition, 26(10): 1441-1460, Oct. 1993. 11. A. R. Rao, and K. Balk, Type classification of fingerprints: A syntactic approach, IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-2(3):223-231, Mar. 1980. 12. C. L. Wilson, G. T. Candela, and C. I. Watson, Neural-network fingerprint classification, J. Artificial Neural Networks, l(2):203-228, Feb. 1994. 13. B. M. Mehtre, N. N. Murthy, and S. Kapoor, Segmentation of fingerprint images using the directional image, Pattern Recognition, 20(4):429-435, Apr. 1987. 14. L. O'Gorman and J. V. Nickerson, An approach to fingerprint filter design, Pattern Recognition, 22(l):29-38, Jan. 1989. 15. Z. R. Li, and D. P. Zhang, A fingerprint recognition system with microcomputer, in Proc. 6th ICPR, Montreal, Canada, 1984, pp. 939-941. 16. D. Maio, D. Maltoni, and S. Rizzi, An efficient approach to online fingerprint verification, in Proc. 8th Int. Symposium on AI, Monterrey, Mexico, Oct. 1995, pp. 132-138. 17. D. Maio, and D. Maltoni, Direct gray-scale minutiae detection in fingerprints, IEEE Transactions on Pattern Analysis and Machine Intelligence, P A M I 19(l):27-40, Jan. 1997. 18. M. J. Donahue, and S. I. Rokhlin, On the use of level curves in image analysis, Image Understanding, 57(2):185-203, Feb. 1993. 19. M. Kass, and A. Witkin, Analyzing oriented patterns, Comput. Vis. Graph. Image Processing, 37(3)-.362-385, Mar. 1987. 20. M. Kawagoe, and A. Tojo, Fingerprint pattern classification, Pattern Recognition, 17(3):295-303, May 1984. 21. A. R. Rao, A Taxonomy for Texture Description and Identification, SpringerVerlag, New York, 1990. 22. L. Hong, Y. Wan, and A. K. Jain, Fingerprint image enhancement: Algorithm and performance evaluation, IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-20(8):777-789, Aug. 1998. 23. A. K. Jain, S. Prabhakar, L. Hong, and S. Pankanti, Filterbank-based fingerprint matching, IEEE Transactions on Image Processing, IP-19(5).'846-859, May 2000. 24. L. A. Zadeh, Fuzzy logic, IEEE Transactions on Computer, C-21(4):83-93, 1988. 25. L. A. Zadeh, Fuzzy sets, Information and Control, 8(3):338-353, 1965. 26. Y. S. Choi, and R. Krishnapuram, A robust approach to image enhancement based on fuzzy logic, IEEE Transactions on Image Processing, IP-6(6):808825, June 1997. 27. T. L. Huntsberger, C. Rangarajan, and S. N. Jayaramamurthy, Representation of uncertainty in computer vision using fuzzy sets, IEEE Transactions on Computer, C-35(2):145-156, Feb. 1986. 28. S. K. Par, Fuzzy tools for the management of uncertiainty in pattern
300
29.
30.
31.
32.
33. 34. 35. 36.
37.
38.
39.
40.
D.-W. Jung & R.-H. Park recognition, image analysis, vision and expert systems, Int. J. Syst. Sci., 22(3):511-549, Mar. 1991. S. K. Pal, and B. Chakraborty, Fuzzy set theoretic measure for automatic feature evaluation, IEEE Transactions on Systems, Man, and Cybernetics, SMC-16(5):754-760, Sept. 1986. C. C. Lee, Fuzzy logic in control systems: Fuzzy logic controller - Part I & Part II, IEEE Transactions on Systems, Man, and Cybernetics, S M C 20(2):404-435, Feb. 1986. K. N. Plataniotis, D. Androutsos, D. and A. N. Venetsanopoulos, Adaptive fuzzy systems for multichannel signal processing, Proceedings of the IEEE, 87(9):1601-1622, Sept. 1999. M. Barni, V. Cappellini, and A. Mecocci, Fast vector median filter based on Euclidean norm approximation, IEEE Signal Processing Lett., l(6):92-94, June 1994. J. Chaudhuri, C. A. Murthy, and B. B. Chaudhuri, A modified metric to compare distances, Pattern Recognition, 25(5):667-677, May 1992. A. Ranade, and A. Rosenfeld, Point pattern matching by relaxation, Pattern Recognition, 12(4)-.269-275, Dec. 1980. J. P. P. Starink, and E. Backer, Finding point correspondence using simulated annealing, Pattern Recognition, 28(2):231-240, May 1995. N. K. Ratha, K. Karu, S. Chen, and A. K. Jain, A real-time matching system for large fingerprint databases, IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-18(8):799-813, Aug. 1996. D. H. Ballard, Generalized Hough transform to detect arbitrary patterns, IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI3(2):111-122, Feb. 1981. A. K. Jain, L. Hong, and R. Bolle, On-line fingerprint verification, IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI19(4):302-314, Apr. 1997. J. G. Daugman, and G. O. Williams, A proposed standard for biometric decidability, in Proc. CardTech/SecureTech Conf., Atlanta, GA, 1996, pp. 223-234. L. Hong, and A. K. Jain, Integrating faces and fingerprints for personal identification, IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-2O(12):1295-1307, Dec. 1998.
C H A P T E R 12 TEXT CATEGORIZATION USING LEARNED DOCUMENT FEATURES
M. Junker, A. Abecker and A. Dengel German Reseach Center for Artificial Intelligence (DFKI) GmbH P. O. 2080, 67608 Kaiserslautern, Germany E-mail: {markus.junker, andreas. abecker, andreas.dengel} Qdfki.de The categorization of texts or text fragments into a given set of classes is a problem with ever increasing practical relevance. In this chapter we first present a symbolic approach for learning expressive document features, so-called complexes. This approach can be used, e.g., for learning features in the form of specific word sequences or substring tests. Such features are employed in two different learning approaches for generating text classifiers. In the first approach, the learning of complexes is embedded in a symbolic rule learner. In the second approach we use symbolically learned complexes to represent documents for a subsymbolic learning approach, namely, the Support Vector Machine (SVM). We compare the results of the purely symbolic rule learning, a purely statistical approach based on the SVM, and the hybrid approach. It is shown experimentally that rule learning as well as the SVM benefit from learning expressive document features in the form of complexes. In particular, best results are obtained by the hybrid approach combining the symbolic learning of document features with the SVM. 1. I n t r o d u c t i o n Categorization of texts and text fragments into a given set of contentoriented categories is a job people are often confronted with in the office, e.g.: • e-mails are sorted into mail folders for storage, or to indicate priority and further processing; • business letters are classified according to their type (invoice, purchase order, etc.) or to the business process affected in order to dispatch t h e m 301
302
M. Junker, A. Abecker & A. Dengel
accordingly within the company, or to trigger subsequent electronic processing such as automatic extraction of the total amount of an invoice, etc. • incoming news about markets, products, or new technologies are categorized according to their topic in order to forward t h e m to interested subscribers of the respective information services; • newspaper articles, technical documents, scientific information, internal bulletins, etc. are indexed for storage and retrieval in document management systems, libraries, and corporate archives. It is apparent t h a t a u t o m a t e d text or document categorization systems save time and h u m a n labor, and ensure a consistent assignment of categories to documents, which is not guaranteed when relying upon h u m a n indexers. Today, automatic categorization systems usually employ hand-crafted categorization rules of the form if text p a t t e r n pc occurs in document d t h e n document d belongs to category c Languages for describing categorization p a t t e r n s typically allow for boolean combinations of tests for word occurrences. For instance, the pattern (or (and gold jewelry) (and silver jewelry)) can be used to find documents which deal with gold or silver jewelry. More elaborate p a t t e r n languages provide, e.g., tests for word sequences or specific word properties (such as inclusion of a given substring). T h e most prominent example of a rule-based categorization system is the TCS shell which has been successfully applied to the categorization of business newswire articles. 1 ' 2 In the literature a number of other examples of such systems is given. 3 , 4 ' 5 , 6 In order to reduce h u m a n effort for the manual design of the categorization rules, there is a growing interest in learning categorization systems which take categorized example documents as input, and automatically extract category-specific classifiers. Such systems are especially useful in situations where rapid developments in the application domain (e.g., new content categories arising, typical vocabulary for specific topics shifting over time, etc.) or frequent changes of the usage environments (new users, other use of categorization results) require continuous maintenance and evolution of classifiers. T h e majority of approaches for learning document categories rely on subsymbolic approaches such as Naive Bayes Classifiers or the Support
Text Categorization
Using Learned Document
303
Features
Vector Machine (SVM). 7 ' 8 ' 9 ' 10 ' 11 ' 12 ' 13 Until now, only few symbolic learners (i.e., decision tree or rule learning algorithms) have been applied to text categorization. 14 ' 15 A few rule learners try to use more elaborated document features such as specific phrases or the occurrence of specific substrings. 16,17 In contrast to the manual design of categorization rules, the majority of learning approaches is based on simple document representations borrowed from traditional information retrieval 18 ' 19 (cf. Fig. 1): a document is described by a vector. Each position represents one of the set of all words occurring somewhere in the document collection (or the set of all words in the collection considered relevant for classification); the value at a certain position reflects whether (or, how often) a word is contained in the encoded document. The main hypothesis of this chapter is that more effective document classifiers can be built on the basis of richer document representations than those only counting the words within documents. In many other pattern recognition problems, the learning algorithm can exploit sophisticated example representations based on a thorough understanding of the classification problem, which is translated into an (ideally small) set of meaningful features describing a particular example. In text categorization, this is not so simple. At first glance, only the pure words are available for example representation. No meta information or abstract descriptions exist; no information about the context of a word occurrence can be easily kept. There is no way to discern words which are relevant for a classification decision and words which are not. Hence, the majority of learning approaches in the text categorization domain employ the above mentioned vector representations 0 0
a able
1
bank
I U 1
create creative currency
0 J
2oom
. . . commerce batlfc_- houston said it filed an amplication with the econtrolle] the currency m
an effort to create the
larges ...
^
Fig. 1. Simple feature-value representation of documents.
304
M. Junker, A. Abecker & A. Dengel
for documents, and rely on a linear classification approach. Since there is no easy way to distinguish between the important and unimportant document features, the classifiers let as many features as possible contribute to a categorization decision. In our approach, we use a richer document representation which maintains information about the sequence of words in a document. In order to learn expressive classifiers (similar to hand-crafted categorization rules) which heavily exploit this information, we distinguish two learning phases: (1) In the feature construction phase, we provide a powerful hypothesis language for accurate characterization of document sets using complex document features, called complexes. In Section 2 we present a general framework for this task which is based upon an extensible set of declaratively specified operators for refinement of hypotheses, each of them extending the hypothesis language by a specific syntactic construct (like substring tests, conjunction of tests, etc.). This sophisticated feature construction step turns out to be a learning problem of its own which is subject to the known problems typical for learning with rich hypothesis languages, namely, huge search spaces and overfitting. In order to address the former one, we introduce a strategy language which allows for an easy, declarative specification of control knowledge to guide the search process by heuristics. In Section 4 we demonstrate the use of strategies to show how specific refinement operators can be utilized efficiently. The overfitting problem is dealt with in Section 3 where we instantiate the search procedure of our framework with specific parameters which — in our experimental work — turned out to be well-suited for text categorization. (2) In the classifier learning phase we build classifiers for specific text categories, which employ the complex document features already constructed. This two-phase approach opens up new possibilities for hybrid classification algorithms, because the separate feature construction can be coupled as a stand-alone module with other arbitrary learning paradigms. In Section 5 we show how feature construction is embedded as an inner loop into a separate-and-conquer approach for symbolic rule learning. In Section 6 we describe how the Support-Vector-Machine (SVM) can be improved by using our feature construction phase as a preprocessing step to generate the input for the SVM. Both approaches are backed up with extensive experimental evaluation.
Text Categorization
Using Learned Document
Features
305
2. General Framework 2.1. Document
Representation
In our framework, real world documents (usually given as character sequences) are transformed into sequences of words (w\... wn) where a word Wi is represented as a character sequence: ^example '
commerce
bank VJ41
W40
D W\2
with
the
controller
U>49
tt>50
W51
create
the
largest
UJ59
U>60
W§\
houston
said
it
filed
an
application
™43
JU44
W45
tt>46
W47
1048
jof
the
52 U)53
currency TO54
in
an
U>55 ™56
effort
to
W57
U>58
There is some freedom in how to separate a character sequence in the original document into a word sequence. For instance, words with a hyphenation mark "-" may be split or not, punctation marks may be split or not, words can be transformed into the lower case or not, etc. In initial experimental evaluations it turned out that the methods described later were not sensitive to such decisions. It is important to note that the transformation preserves almost all information contained in the original document. This allows us to, e.g. define arbitrary features that rely on word sequences or complex word properties as described in the next section. 2.2.
Complexes
Document features can be described by so-called complexes. The term complex is taken from the symbolic rule learning algorithm CN2.20 In CN2 a complex is a conjunction of tests on specific feature values. A complex in our sense is an arbitrary predicate computed based on the document representation. In order to define predicates, a language for complexes C together with an appropriate semantics is provided. This language is based on pattern languages as used in text categorization systems in which rules are written manually, like TCS.1 The semantics of complexes C is defined by a match function which assigns a truth value to all documents V: match: C x V —> {0,1}. If the value assigned is 1, the complex matches (or, as we say, it covers the document). If the value assigned is 0, the complex does not match (respectively, it does
M. Junker, A. Abecker & A. Dengel
306
not cover the document). A typical complex is the conjunction of word tests wi A W2 A . . . A wn. It matches for a given document iff all words Wi occur in the respective document. Word tests Wi can also be replaced by so-called substring tests *Wi, W&, *wt*, which only require t h a t a document contains a word which contains Wi at the end, at the beginning or somewhere in the middle. Word sequence tests of the form u>i[si]i02[s2] • • • [sn~i}wn (s, € IN) can be used to test whether the words w\, u>2, • • • wn occur in a document in this order. T h e numbers Sj indicate the maximal number of words allowed between Wi and IOJ+I.
2 . 3 . Searching
for
Complexes
Complexes are the basis for expressing powerful document features. Good complexes should cover at least some of the documents in a category with high precision. T h e precision is defined as the number of documents correctly covered by a complex, divided by the number of all documents covered by the complex. In our approach, complexes are learned for a target category based on positive and negative training documents E = E® U Ee for each category. T h e search for good complexes based on the training examples is done heuristically. For describing heuristics, we have introduced so-called strategy expressions which are described in the following. T h e basic components for building strategy expressions are refinement operators. Based on a given complex c they generate a set of new complexes c i , C 2 , . . . , c n . Well-defined refinement operators should derive promising candidates for improved complexes based on c and a given document set. A refinement operator is a function v. 2D x C —> 2C. It is written in the form:
{v)
In order to describe the heuristic search, we need some functions on complexes in addition to refinement operators: • A weighting function function w : 2 x C —»• 1R. It defines an order on complexes based on a given document set. Better ranked complexes are believed t o yield a better precision. • T h e selection function selWi(, : 2° x 2C —> 2C. T h e selection function is used to find the top-ranked complexes of a set of complexes according to
Text Categorization
Using Learned Document
Features
307
their weight. The parameters w and b provide the weighting function and the number of complexes to be chosen. Applying the weighting function requires a set of evaluation documents which is also given as a parameter. • Even if a complex is given a good weight it can still be of low precision. A second means for avoiding low precision complexes is provided by a significance measure sigm : 2V X C x 2C x C —> IR. By applying a threshold, the value sigm(D, c pre , S, c) is used to decide whether the complex c should be excluded from the search space. Later, we will introduce the weighting function and significance measure used in our experiments. Using the introduced terms, strategy expressions are defined as follows. Let C be a set of complexes and V be a set of refinement operators. A strategy expression (or, in short: a strategy) s is a function 2C x 2° -> 2°. The set S of strate gies based on V is given by: • • • • • •
v G S, if v G V si o s2 € <S, if Si, s2 G S (concatenation) sn G S, if s G cS, n E IN (n-fold concatenation) s+ G S, if s G <S (closure of a strategy) si U S2 £ <S, if si, S2 £ «S (union) s|& G 5, if s G justrat, w and 6 G IV (restricted refinement); b is also called beam • [s] € <S, if 5 € 5 (significance filter) • (s) G 5, if s G 5 Let w be a weighting function and 'sigm' be a significance measure. The semantics of strategy expressions is given by a function 2 D x 2 I l x 2 C —> 2^ with: . v(D,C) = {jc^cv(D,c) • ( Sl o S2)(D, C) = s 2 0D, C) U Sl(s2(D, C)) so(sn-1)(D,C) :n> 1 C (-")^ )-l :n=l 8 (D,C) (s+)(A C) = {c\ 3 „ e « c G s n (D, C)} ( S l U s2)(D, C) = Sl(D, C) U s2(D, C) s\b(D,C) = s(ae\v,b(D,C)) [s](D,{ci,C2,...,Cn}) = [JiW e s(£>, {ci})I sigm(A Ci, s(£>, { c j ) , d) > t?} (s)(D,C) = S (D,C)
308
M. Junker, A. Abecker & A. Dengel
A complex c' e s(D, {c}) is also called a refinement of c. The concatenation si o s 2 describes the application of two strategies, one after the other. Here, to every complex returned by the strategy S2, the strategy s\ is applied. By sn the n-fold concatenation of the same strategy is described. The closure s+ of a strategy allows us to repeat the application of one strategy until no new complex can be obtained by a subsequent application. The union Si U S2 of the results of separate strategies can be obtained by applying the union operator. Restrictions s|& allow us to restrict further refinements only to the best-ranked complexes with respect to a given weighting function. The significance filter [s] selects those complexes which pass a given significance criterion. The strategies introduced allow us to imitate human strategies when manually creating rules based on sample documents. Starting with a set of complexes (generally the complex true) they can be used to describe complex search spaces. 3. Framework Instantiation: Avoiding Overfitted Complexes As mentioned earlier, a good complex should cover some part of the documents of a category with high precision. High precision of a complex should not only hold for the example documents but also for unseen documents. When searching for such complexes based on the training samples, there is some risk in tuning complexes too much to these examples. This results in a poor precision on unseen documents. This problem is an example of the wellknown overfitting problem addressed in machine learning literature. 21 ' 22,23 In symbolic learning it is very common to address the problem of overfitting by specific weighting functions and significance measures as contained in our definition of strategy expressions. We briefly introduce the instantiation we have chosen to avoid overfitting in the experiments described later. To simplify the description, we introduce p — |match(c, E®)\ and n = |match(c, Ee)\ as the number of positive and negative example documents a complex covers. 3.1. Weighting
Function
The precision of a complex for unseen documents can be computed based on the maximum likelihood estimate -~ on the training set. Unfortunately,
Text Categorization
Using Learned Document
Features
309
this estimate is too optimistic if the complex to be estimated covers only a small number of documents in the sample set. In order to avoid this problem we evaluated a range of weighting functions for learning complexes, of which the m-estimate as described in Ref. 24 turned out to be the best one 25 : w{c,E)
p + mh-pr m
p+ n+ m
Using the m-estimate, complexes which cover only a few documents (i.e., with \p + n\ small) will be weighted with a value close to -f^r- With an increasing number of covered documents, the weight approaches —?—• The value m is used to control the influence of the number of covered documents. 3.2. Significance
Measure
Overfitting is not completely avoided by using the m-estimate as a weighting function. Often, the application of a strategy to a complex results in a huge number of new complexes. With the increasing number of resulting complexes there is also an increased risk that one of the complexes accidentally has a high weight on the training documents. This insight was the basis for the development of a new significance measure we describe in Ref. 25. In contrast to other significance measures known in rule learning (see, e.g., Ref. 26) for rating a complex c, it also incorporates other complexes c' obtained in the same refinement step from an original complex c pre . More precisely, for rating the significance of c, it takes into account those complexes c' which cover the same number of documents as c. The new signficance measure is called Z-measure and defined as: sigm(E, cpie, S,c) \
|match(c',E)| = |match(c,£;)| >\
= 1 - ( 1 - B i n ( p + n,u;,p))K
J I
with S = s(E, Cij-i) being the local search space obtained by applying the strategy s to c pre and Bm{p + n,w,p) = £ £ £ (P+.
" ^ ( 1 -
w)p+n-i.
4. Framework Instantiation: Expressive Document Features Search strategies are designed to learn expressive document features in the form of complexes. In this section we present three of the complexes which
M. Junker, A. Abecker & A. Dengel
310
we have investigated in detail: conjunctions of word tests, word sequence tests, and substring tests. 4.1. Learning
Conjunctions
of Word
Tests
Learning of conjunctions of word tests is based on the refinement operator (K)
^— with n G JN0; w is a word in E® w\ A • • • A wn A w It extends a given conjunction of word tests by adding new word tests. For building new word tests only those words are accepted which occur in at least one positive example document. This avoids constructing new conjunctions which cover no positive example document. Based on the refinement operator K the strategy
([K]|i) + defines the search space for complexes. Applied to {true}, this strategy first generates word conjunctions with only one word test using K. The significance filter "[...]" then removes all resulting complexes which are not significant, and by using "... |i" the best conjunction is selected from the remaining ones. Due to the operator ". .. + " the procedure is repeated on the best expression until no further improvement can be obtained. 4.2. Learning
Word Sequence
Tests
Often, the conjunction of word tests is not satisfactory because the order and distance of words in a document is very substantial for a good complex. By using the order and distance, the following language constructs in particular can be identified: • terms that consist of several words, such as "text categorization" • words in a discriminating context within the phrase, such as "no invoice", "not received", and "sell .... unit" • proper names, such as names of persons, companies, and products These language constructs can be captured by word sequence tests. In order to refine a single word test to a word sequence test, and to refine
Text Categorization Using Learned Document Features
311
a word sequence test by requiring specific words at its left and right, we introduce two refinement operators WSL and WSR: (WSL(s))
(WSR(s))
c\ A • • • A cn A p
C\ A • • • A cn A w[s]p C\ A • • • A c n A p ci A • • • A c„ Ap[s]w
with p being a word test or word sequence test and w occurring in E®. Based on these refinement operators, the following strategy can be used for learning complexes with word sequence tests.
(([WSL(O) U WSR(O)] U • • • U [WSL(smax)
+ U W S R ( s m o . ) ] ) | i ) + o [K]|-^ i
This strategy first adds the best word test to a given conjunction. This word test is then refined to a word sequence test by repeatedly adding words to the left and right. Different maximum distances of these words to the existing word sequence varying from 0 to the given parameter smax are introduced for each extension. Having found the best extension of the word test to a word sequence, the strategy tries to refine the test by adding a regular word test again. This is repeated until no more improvement can b e obtained.
4 . 3 . Learning
Substring
Tests
So-called composite nouns are very common in German, especially in technical domains. While in English a sequence of single words can be used to make up complex terms, in German words are concatenated. Thus, t h e English t e r m "text categorization system" is translated into a single word in German: "Textkategorisierungssystem". This type of composite noun is a problem if a single word within various composite nouns has to be identified. Properly chosen substring tests of the form *w, w*, *w* are an interesting workaround for this problem. Figure 2 shows by example how substring tests allow us to cluster words with similar meanings. In the figure, the substring tests are given in bold face. The central problem is how to find good substring tests for a concrete text category. T h e refinement operator T refines the last word test w in
M. Junker, A. Abecker & A. Dengel
312
textkategorisierung dokumentkategorisierung •kategorisierung^^
textkategorien
*kategori*
Fig. 2.
Example of merging German words of similar meanings.
conjunction with substring tests. For example, a substring test *x* is introduced if the positive sample documents E® contain the word x which is a substring of w. cx
A • • • A cn
A
w
c\ A • • • A cn A w
{x* I w starts with x, x is a word in E®}U with n E iVo, w £ {*x \ w ends with x, x is a word in £,®}U {*x* | w contains x, x is a word in E®} In the strategy using this refinement operator, first the best conjunction is refined by adding new word tests. The best b of these word tests according to the weighting function are then refined using T. All complexes obtained by K and T are then tested for their significance:
pv^r 5. Using Complexes for Symbolic Rule Learning 5.1. Learning
Symbolic
Rules
Complexes are designed to cover some of the documents of a category with high precision. A more complete coverage of the category can be obtained by the pattern ci V c-i V • • • V cn with the semantics match(ci V c-i V • • • V cn, d) = 1 if 3j match(ci,cQ = 1. By restricting the interpretation of the pattern
Text Categorization Using Learned Document Features
313
to the first b < n arguments, a high coverage can be traded for a lower precision: match;,(ci V ci V • • • V cn, d) = 1 if match(ci V ci V • • • V q,, d). Using these patterns, we can learn decision rules of the following form: if matchj(ci V C2 V • • • V c„, d) = 1 then assign target category. The search for a pattern is done using a seed-driven separate-andconquer algorithm. The algorithm is similar to the well-known rule learner CN220 but uses a seed for guiding the search for complexes. The inputs of the algorithm are example documents E = E® U E® for the target category, a search heuristic s, a weighting function w, and a significance measure 'sigm' (Fig. 3). The algorithm initializes the document set R by E and the disjunction p by the empty pattern false. Starting with the complex true, it then searches for the best complex Cbest which covers an arbitrarily chosen seed document d from R®. If the best complex is true and not all documents in R® were already tried, a different document is taken as the seed. All documents covered by the best complex Cbest are then removed from R, and the next complex is searched for using a seed from the remaining documents. The loop terminates if the best complex found using all remaining documents in C as a seed is true. In the final step the complexes in the resulting pattern are sorted according to their weight given by w. Input: example documents E = E® U E® for the target category, a strategy expression s, a weighting function w, a significance measure 'sigm' Output: "best" pattern p for the target category R^- E p «— false REPEAT REPEAT d <— random example from R® S = s(E, {true}) U {true} c best <— c G 5, match(c, d) = 1 and Vc/-^cw(J5, c') < w ( £ , c) U N T I L ( ( c b e s t / true) or (all d £ R® considered)) R <- R \ match(c b e s t ,_R e ) V<~V V c b e s t U N T I L (c b e s t = true) sort complexes in p = c\ V • • • V cn with w ( £ , Ci) > w(ECj) for i < j Fig. 3. Seed-driven algorithm for finding the best disjunction.
M. Junker, A. Abecker & A. Dengel
314
5.2.
Evaluation
For the experimental evaluation we rely on four text collections: • a collection of 1,004 German technical abstracts in 6 categories (e.g., "computer science and modeling" and "opto-electronics and laser technology") • a collection of 1,741 OCR'ed German business letters belonging to 5 categories (e.g., "invoice" and "offer") • a collection of 19,813 English financial news articles belonging to 118 categories (e.g., "earn", "acquisition" and categories corresponding to goods like "crude", "corn", and "rice") a • a collection of 20,000 English newsgroups articles belonging to 20 categories (e.g., "alt.atheism", "comp. sys. mac. hardware", and "comp.windows.x" ) b For the evaluation we split each collection 1:1 into a training and a test set. We only incorporated the 97 categories into our experiments for which at least 5 positive example documents existed in the training set. On the test set, we use the effectiveness measures recall and precision which are widespread in information retrieval. 1 8 Recall and precision correspond to the characteristic requirements of p a t t e r n s . They should cover the documents of a category as completely as possible (measured by the recall), and they should cover these documents as correctly as possible (measured by the precision). For each learned p a t t e r n p = c\ V • • • V c n a range of recall/precision values was computed by increasing the parameter b in the match function from 1 to n. Averaging over the categories was done at predefined recall points by macro-averaging the interpolated precision (cf. Ref. 18) for all p a t t e r n s at these recall points. T h e resulting range of recall/precision points is graphically shown in recall/precision diagrams (here we note recall and precision in percent). Figure 4 compares the results we obtained by learning word sequence tests with those obtained by learning pure conjunctions of word tests with For learning word sequence tests, we used smax a
e {0,5,10}.
Taken from the Reuters-21578 collection which is available via h t t p : //www.research. a t t . com/~lewis. b Available via http://www.cs.cmu.edu/afs/cs/project/theo-ll/www/naive-bayes.html.
Text Categorization
Using Learned Document
Features
315
word tests only word sequences (smax=0) word sequences (smax=5) word sequences (smax=10) 0
10 20 30 40 50 60 70
90 100
Recall Fig. 4.
Introducing word sequence tests with variable maximal distances between words.
Learning word sequence tests without distances between words (smax = 0) already raises the precision by u p to 6% at the low recall end. Admitting distances of at most 5 words {smax = 5) raises the precision up to 10%. By further increasing smax (smax = 10), no further improvement can be obtained. T h e reason for this might be t h a t using small distances the words still occur in the same sentence or part of the sentence. This allows the system to learn frequent and important phrases for categorization. W i t h increasing word distances, the number of variations to express the same content increases very rapidly. This makes it hard to learn regularities. A typical p a t t e r n learned using conjunctions of word tests is the one learned for the category cpi (consumer price index) in the Reuterscollection: (statistics A inflation) V (base A prices A year) V (measured A consumer) V (statistics A index A consumer) V inflation V (living A cost) V (clothing A prices) V 1958 V insee V true Using word sequence tests, the p a t t e r n learned for the same category cpi is: ((statistics [5] said [5] the) A inflation) V (> [4] inflation) V ((february [4] statistics) A consumer) V (cost [2] living) V 1958 V (. [0] the [5] inflation) V ((pet [5] statistics) A price) V (, [3] measured) V (year [2] inflation) V (inflation [3] 3 [2] .) V (. [3] , [0] base) V (inflation [5] pet [5] in) V (inflation [5] <) V ((inflation [5] pet) A index) V ((national [5] statistics) A inflation) V (2 [5] inflation) V (clothing [3] and) V (measured [5] index) V insee V true
316
M. Junker, A. Abecker & A. Dengel Table 1. \Smax
Effect of using word sequence tests with respect to categories
— 5).
at recall (in %) categories with improvements decreases
10 8 0
20 5 0
30 13 2
40 14 3
50 15 6
60 21 5
70 26 8
80 22 9
90 5 15
100 0 10
The latter shows an improvement of precision of up to 20% as compared to the pattern which is purely based on conjunctions of word tests. Generally, it turns out that improvements can be shown by using word sequence tests only in some categories. Table 1 lists the number of categories in which using word sequence tests (smax = 5) significantly outperformed pure conjunctions of word tests (and vice versa) at different recall levels. For significance testing we used the p-test with an error probability of 5% as described in Ref. 12. The table shows that by using word sequence tests with smax = 5, the precision can be significantly increased in up to 26 categories. This is in more than one fourth of all categories (97) we have considered in our experiments. On the other hand there is some tendency towards a worse precision at higher recall levels when working with word sequence tests. We suspect an overfitting effect caused by the increased search space when using word sequence tests as the reason for this. The effect of a worse accuracy caused by too extensive search has also been described in the literature. 23 The experiments using learned substring tests were only conducted on the collection of German technical abstracts. Figure 5 shows the averaged results over all 6 categories. It can be seen that incorporating learned substring tests yields a big improvement in precision. This result can be improved even more by increasing the beam b. The following typical pattern was learned without using substring tests for the category "opto-electronics and laser technology" in German technical abstracts: laser V fasern V lasern V lichtleitfasern V (pm A an) V weidel V laserdioden V gaas V optischen V km V gestellt V optischer V berichtet V nut V wellenlange V zeigen V db V gemessen V beschrieben V herstellung V optische V true Using substring tests, the pattern for this category is much shorter and the precision has been increased by up to 28%:
Text Categorization
Using Learned Document
Features
317
words tests only substring tests (b=l) substring tests (b=10) substring tests (b=100) 0
10 20 30 40 50 60 70 80 90 100
Recall Fig. 5.
Introducing substring tests with strategies of variable beam.
Table 2. Effect (b = 1000).
of using substring tests with respect to categories
at recall (in %) categories with improvements decreases
10 1
20 3
30 2
0
0
0
40 3 0
50 3 0
60 4 0
70 5 0
80 5 0
90 2 0
100 0 0
*laser* V *faser* V *optisch* V *gaas* V *zeigen* V * m o n o m o d e * V gestellt V *30* V p m V *lasern* V *moden* V optischen V *hergestellt* V true
It turns out that the substring test "*laser*" alone replaces tests on more than 15 composite nouns containing the word laser, e.g. "injektionslasers" (Engl, "injection laser"), "laserstrukturen" (laser structures), "laserlicht" ("laser light"), "laseremission" ("laser emission"), "laser lichtempfindlichkeit" ("laser light sensitivity"), and "lasermodul" ("laser module"). Table 2 lists the number of categories in which learned substrings improve precision significantly at different recall levels (b = 1000). A significant decrease in precision could not be found in any category. 6. Using Complexes for Learning with the SVM 6.1. Learning
Complex
Features
Learned complexes can also be used to enrich standard word-based vector representations of documents. Figure 6 shows our algorithm for generating
318
M. Junker, A. Abecker & A. Dengel
Input: example document E = E® U EG for the targ gt category, minimum number n of word occurrences a strategy s a weighting function w, a significance measure 'sigm' Output: set of complexes C = {ci, C2,.. . ,cn] for the target category. R^E C <— {w \ |match(ty, E)\ > n} REPEAT REPEAT d <— random example from R® S" = s({true},£;)U{true} c *— c G S, match(c, d) = 1 and Vc<# c w(c', E) < w(c, E) UNTIL((c ^ true) or (all d e R® considered)) R<- R\ match(c,R®)
C^-CUc UNTIL (c = true) Fig. 6.
Seed-driven algorithm for learning features.
document features in the form of complexes specific to a category to be learned. In contrast to the previous algorithm for symbolic rule learning, this one extends a simple word-oriented feature-value representation with more complex features. The inputs of the algorithm are positive and negative example documents E = E® U EG for the target category, the number n as the minimum number of word occurrences for introducing a word as a feature, and a search strategy s for generating the complex feature. The complexes returned by the algorithm are interpreted as features for representing documents in feature-value learner. A complex c corresponds to a boolean feature whose value for a new document d is given by the match function match(c, d). 6.2. Support
Vector
Machines
Support Vector Machines (SVM) learn linear decision rules of the form if (w • f + b = 1) then assign target category described by the vector w and the threshold b. The vector / is the featurevector representing the document to be categorized. SVMs rely on the structural risk minimization principle. 27 Based on this principle they try
Text Categorization
Using Learned Document
Features
319
to find the hyperplane with the lowest error probability. Vapnik shows that finding this hyperplane corresponds to solving the following optimization problem: 1
™
2
<=i
minimize: V(w, b, £) — -w • w + C y ^ & subject to: V™=1 : y,[w • / ; + 6] > 1 - & V?=i : & > 0 In the formula, £j is a slack variable being at least 1 if the corresponding training sample lies on the wrong side of the hyperplane. The factor C is a parameter for trading-off training-error vs. model complexity. 28,29 The SVM was chosen as a representative for learning algorithms representing documents by features. SVMs were first applied to text categorization by Joachims 10 and belong to the best-performing learning algorithms known for this problem. 6.3.
Evaluation
The algorithm for generating category-specific features was evaluated as follows: As base line we chose a word-based document representation using all words occurring in at least 3 sample documents as features. This results in a feature vector (/i, /2, • • • , /«)• For each category, the performance of this representation was compared with the effectiveness of a representation enriched by learned complexes: (/i, f2,... ,fn, fn+i, ••• , fn+m) with fn+i, • • • , fn+m corresponding to the learned complexes c G C (m = |C|)). The evaluation was done using the SVM.ught implementation as provided by Joachims. 0 For generating different recall/precision points we varied the parameter b in the decision rule. Figure 7 shows the results with and without learned features which correspond to word sequence tests in comparison to the purely symbolic approach described in Section 5. It turns out that even without learned features the SVM is superior to the symbolic classifier. We think this is caused by the inherent weakness of purely symbolic formalisms in aggregating weak indications for a category to stronger decisions. Nevertheless, even the effectiveness of the SVM can be improved again by using the learned features by up to more than 5% in precision. c
Available via http://ais.gmd.de/~thorsten/svm_light.
320
M. Junker,
A. Abecker
& A.
Dengel
Rules Rules with word sequences SVM SVM with word sequences 0
10 20 30 40 50 60 70 SO 90 100 Recall
Fig. 7. E n r i c h i n g a w o r d - b a s e d f e a t u r e - v a l u e d o c u m e n t r e p r e s e n t a t i o n b y l e a r n e d w o r d s e q u e n c e t e s t s for t h e S V M .
Rules Rules with substring tests SVM SVM with substring tests 0
10 20 30 40 50 60 70 80 90 100 Recall
F i g . 8. E n r i c h i n g of a w o r d - b a s e d f e a t u r e - v a l u e d o c u m e n t r e p r e s e n t a t i o n b y l e a r n e d s u b s t r i n g t e s t s for t h e S V M .
Figure 8 shows the results when enriching the document representation in the collection of German technical abstracts by learned substring tests. The SVM without learned features is already superior to the symbolic classifier, and the results of SVM can again be improved considerably (up to 10% in precision) by adding learned substring tests. Tables 3 and 4 list the number of categories in which word sequence tests and substring tests improved the effectiveness of the SVM. The results
Text Categorization
Using Learned Document
Features
321
Table 3. Effect of using word sequence tests for the SVM with respect to categories (smax = 5). at recall (in %) categories with
10
20
30
40
50
60
70
80
90
100
improvements decreases
0 0
2 0
0 0
2
5
5
11
14
0
0
0
0
0
20 1
10 2
Table 4. Effect of using substring tests for the SVM with respect to categories 0 = 1000). at recall (in %) categories with improvements decreases
10 0 0
20 0 0
30 0 0
40 0 0
50 3 0
60 3 0
70 2 0
80 2 0
90 2 0
100 0 0
when using word sequence tests (Table 3) shows that using the SVM with the enriched feature set does not decrease the precision at higher recall as much as the symbolic approach does. Using word substrings only lead to significantly improved precision in 3 categories (Table 4). 7. Summary Standard approaches to learning document categorizations represent documents as vectors. The elements of these vectors indicate whether or how often a specific word occurs in the encoded document. In this chapter we introduced a conceptual framework for learning complex document features to be used by text categorization algorithms. Based on a document representation which maintains word positions and the words themselves, the framework allows the system to construct arbitrary features in texts, named complexes. The search for the best complexes is controlled by a declarative strategy language and can be described as a symbolic approach for learning good features. For empirically evaluating the advantages achievable using such learned features, we analyzed the effects of incorporating learned document features on two learning text categorization systems which belong to different learning paradigms. First, we embedded our algorithm for searching for optimal features into a CN2-like symbolic learner. Using four test document collections of different domains and languages, we showed that the usage of word
322
M. Junker, A. Abecker & A. Dengel
sequence tests and substring tests in complexes improved classifier effectiveness significantly. We then used learned features to enrich the s t a n d a r d word-based vector representation of documents for a statistical classifier, the Support Vector Machine. It belongs to the best known statistical learners for text categorization. It was also shown t h a t the effectiveness of this learning approach can be improved considerably by the learned complexes. T h e latter and, in our experiments, better approach is hybrid in the sense t h a t it integrates a symbolic learning approach for constructing features (learning complexes in our framework) and a statistical learning approach (the SVM). It seems to be particularly useful to combine the strength of the symbolic approach — the exploration of the huge search spaces of possibly useful document features — with the strength of statistics-based approaches — the tuning of numerical parameters — to archieve the best-performing classifiers for text categorization.
Acknowledgments This work has been supported by a grant from T h e Federal Ministry of Education, Science, Research, and Technology (FKZ 01 IN 902 B 8). References 1. P. J. Hayes, P. M. Anderson, I. B. Nirenburg, and L. M. Schmandt. TCS: a shell for content-cased text categorization. In Proceedings of 6th Conference on Artificial Intelligence Applications, pages 320-326, Santa Barbara, CA, USA, May 5-9 1990. 2. P. J. Hayes and S. P. Weinstein. Construe-TIS: A System for Content-Based Indexing of a Database of News Stories. In Alain Rappaport and Reid Smith, editors, Innovative Applications of Artificial Intelligence 2, pages 49-64. AAAI Press/MIT Press, 1991. 3. R. M. Tong and D. G. Shapiro. Experimental Investigations of Uncertainty in a Rule-Based System for Information Retrieval. International Journal of Man-Machine Studies, 22:265-282, 1985. 4. L. Gilardoni, P. Prunotto, and G. Rocca. Hierarchical pattern matching for knowledge based news categorization. In RIAO 94 Conference Proceedings, pages 67-81, New York, NY, USA, October 11-13 1994. 5. S. Agne and H.-G. Hein. A Pattern Matcher for OCR Corrupted Documents and its Evaluation. In Proceedings of the IS&T/SPIE 10th International Symposium, on Electronic Imaging Science and Technology (Docume nt Recognition V), pages 160-168, San Jose, CA, USA, January 24-30 1998.
Text Categorization Using Learned Document Features
323
6. C. Wenzel and R. Hoch. Text categorization of scanned documents applying a rule-based spproach. In Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval (SDAIR 95), pages 333-346, Las Vegas, NV, USA, 24-26 April 1995. 7. D. D. Lewis. Representation and learning in information retrieval. PhD thesis, Department of Computer Science, University of Massachusetts, 1992. 8. W. W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization. In Proceedings of the 19th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (SIGIR 96), pages 307-316, Zurich, Switzerland, August 18-22 1996. 9. D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papke. Training Algorithms for Linear Text Classifiers. In Proceedings of the 19th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (SIGIR 96), pages 298-306, Zurich, Switzerland, August 18-22 1996. 10. T. Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning (ECML 98), pages 137-148, Chemnitz, Germany, April 21-24 1998. 11. M. Junker and R. Hoch. An experimental evaluation of OCR text representations for learning document classifiers. International Journal on Document Analysis and Recognition, 1(2):116-122, June 1998. 12. Y. Yang and X. Liu. A re-examination of text categorization methods. In Proceedings of the 22th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (SIGIR 99), pages 42-49, Berkeley, CA, USA, August 15-19 1999. 13. R. E. Schapire and Y. Singer, iboostexter: a boosting-based system for text categorization. Machine Learning, 39(2/3): 135-168, 2000. 14. R. Tong, A. Winkler, and P. Gage. Classification Trees for Document Routing, a Report on the TREC Experiment. In Proceedings of the First Text REtrieval Conference (TREC-1), pages 209-227, 1993. 15. D. D. Lewis and M. Ringuette. A Comparison of Two Learning Algorithms for Text Categorization. In Proceedings of the 3rd Symposium on Document Analysis and Information Retrieval (SDAIR 94), pages 81-93, Las Vegas, NV, USA, April 11-13 1994. 16. W. W. Cohen. Learning to Classify English Text with ILP Methods. In Advances in Inductive Logic Programming, pages 124-143. IOS Press, 1996. 17. Markus Junker. Heuristisches Lernen von Regeln fur die Textkategorisierung. PhD thesis, University of Kaiserslautern, Germany, 2001. 18. C. van Rijsbergen. Information Retrieval. Butterworth, London, England, 1979. 19. G. Salton. Automatic Text Processing; the Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Publishing Company, 1989.
324
M. Junker, A. Abecker & A. Dengel
20. P. Clark and T. Niblett. The CN2 Algorithm. Machine Learning, 3(4):261283, 1989. 21. D. W. Aha and R. L. Bankert. A comparative evaluation of sequential feature selection algorithms. In Learning from data - artificial intelligence and statistics, chapter 19, pages 199-205. Springer, 1996. 22. S. W. Norton and H. Hirsh. Learning DNF via probabilistic evidence combination. In Machine Learning. Proceedings of the 10th International Conference {ML 93), pages 220-227, Amherst, MA, USA, June 27-29 1993. 23. J. R. Quinlan and R. M. Cameron-Jones. Oversearching and layered search in empirical learning. In Proceedings of the l^th International Joint Conference on Artificial Intelligence (IJCAI 95), pages 1019-1024, MontrEal, Canada, 1995. 24. S. Dzeroski. Handling imperfect data in inductive logic programming. In Fourth Scandinavian Conference on Artificial Intelligence, pages 111-125, 1993. 25. M. Junker and A. Dengel. Preventing overfitting in learning text patterns for document categorisation. In Second International Conference on Advances in Pattern Recognition (ICAPR 2001), Rio de Janeiro, Brazil, 2001. 26. J. Fiirnkranz. Separate-and-conquer rule learning. Artificial Intelligence Review, 13(l):3-54, 1999. 27. V. Vapnik. Statistical Learning Theory. Wiley, 1998. 28. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, pages 121-167, 1998. 29. N. Christianini and J. Shawe-Taylor. Support Vector Machines and other kernel-based learning methods. Cambridge University Press, 2000.
HYBRID METHODS IN PATTERN RECOGNITION The field of pattern recognition has seen enormous progress since its beginnings almost 50 years ago. A large number of different approaches have been proposed. Hybrid methods aim at combining the advantages of different paradigms within i single system. This book presents a collection of articles describing recent progress in this emerging field. It covers topics such as the combination of neural nets with fuzzy systems or hidden Markov models, neural networks for the processing of symbolic data structures, hybrid methods in data mining, the combination of symbolic and subsymbolic learning, and others. Also included is recent work on multiple classifier systems. Furthermore, the book deals with applications in on-line and off-line handwriting recognition, remotely sensed image interpretation, fingerprint identification, and automatic text categorization.
ISBN 981-02-4832-6
www. worldscientific. com 4871 he
9 "789810"248321"